专栏首页JadePeng的技术博客C# 词典数据结构设计【附demo】

C# 词典数据结构设计【附demo】

分析

要建立词典,最基本的应该有词典的描述信息、词典索引文件以及词典数据文件。 /// <summary> /// 索引文件 /// </summary> string idxFile = "dic.idx"; /// <summary> /// 数据文件 /// </summary> string dictfile = "dic.dict"; /// <summary> /// 词典信息文件 /// </summary> string ifoFile = "dic.ifo"; 我们建立对应的三个类

详细的代码如下:

/// 
    ///  词语解释
    /// 
    class DictWord
    {
        /// 
        /// 解析
        /// 
        public string Description
        {
            get;
            set;
        }
    }

    /// 
    /// 词典索引
    /// 
    class DictIndex
    {
        /// 
        /// 词语
        /// 
        public string Word
        {
            get;
            set;
        }

        /// 
        /// 偏移
        /// 
        public int Offset
        {
            get;
            set;
        }

        /// 
        /// 数据大小
        /// 
        public int DataSize
        {
            get;
            set;
        }
    }

    /// 
    /// 词典信息
    /// 
    class DictInfo
    {
        /// 
        /// 词典名称
        /// 
        public string BookName
        {
            get;
            set;
        }

        /// 
        /// 收录词数
        /// 
        public int WordCount
        {
            get;
            set;
        }

        /// 
        /// 当前偏移
        /// 
        public int CurrentOffset
        {
            get;
            set;
        }
    }

数据结构说明:

  1. 描述信息包含词典名字,词典词语数量
  2. 索引文件存储的是排好顺序词语的索引,每个索引包含词语名称、存在数据文件中的偏移量、以及数据块大小,排序的目的在于查找时直接用二分查找节省查找时间。
  3. 数据块就简单了,就纯粹的数据

建立词典

建立词典比较简单,首先,定义几个变量来存储词典相关信息:         DictInfo info;         SortedList<string, DictIndex> indexs;         List<DictWord> words;

ps: SortedList能直接排序,不用我们再手动排序了

然后我们来看添加词语:

/// 
        /// 添加词语
        /// 
        ///  
        ///  
        public void Add(string word, string description)
        {
 
            words.Add(new DictWord() { Description = description });
            indexs.Add(word, new DictIndex { DataSize = Encoding.UTF8.GetBytes(description).Length, Offset = info.CurrentOffset, Word = word });
            // 数量++
            info.WordCount++;
            // 偏移++
            info.CurrentOffset += Encoding.UTF8.GetBytes(description).Length;
        }

非常简单,就是添加索引,同时把词典的数量加1

最后来看怎么存储到文件:

/// <summary>
/// 保存
/// </summary>
public void Save()
{
 
    StringBuilder dicBuilder = new StringBuilder();
    dicBuilder.AppendLine(string.Format("BookName={0}", info.BookName));
    dicBuilder.AppendLine(string.Format("WordCount={0}", info.WordCount));
    dicBuilder.AppendLine(string.Format("CurrentOffset={0}", info.CurrentOffset));
    File.WriteAllText(ifoFile, dicBuilder.ToString(), Encoding.UTF8);
 
    dicBuilder = new StringBuilder();
 
    using (BinaryWriter idxWriter = new BinaryWriter(File.Open(dictfile, FileMode.Create)))
    {
        foreach (var word in words)
        {
            idxWriter.Write(Encoding.UTF8.GetBytes(word.Description));
        }
    }
 
    using (BinaryWriter idxWriter = new BinaryWriter(File.Open(idxFile, FileMode.Create)))
    {
        foreach (var index in indexs)
        {
            // 分块大小  128+4+4  = 136
 
            // word 最长128
            byte[] word = new byte[128];
            var wordData = Encoding.UTF8.GetBytes(index.Key);
            var length = Math.Min(128, wordData.Length);
            for (var i = 0; i < length; i++)
            {
                word[i] = wordData[i];
            }
            idxWriter.Write(word);
            byte[] re = new byte[4];
 
            idxWriter.Write(index.Value.Offset);
            idxWriter.Write(index.Value.DataSize);
        }
    }
 
}

这里注意下word最多能存128个字节,每个index区地大小为128+4+4 = 136字节

查询词典

前面做这么多准备,不都是为了查询吗?木有查询,神马都是浮云!

前面说到了,索引文件存储的是排序好的词语列表,所以查询就比较简单了 先给出两个辅助方法:             idxStream = new FileStream(idxFile, FileMode.Open);             idxReader = new BinaryReader(idxStream);             dictStream = new FileStream(dictfile, FileMode.Open);             dictReader = new BinaryReader(dictStream); (1) 获取指定位置的索引

/// 
///  获取指定位置的索引
/// 
///  
/// 
public DictIndex GetWordIndex(int wordIndex)
{
    idxStream.Seek(0, SeekOrigin.Begin);
    idxStream.Seek(wordIndex * 136, SeekOrigin.Begin);
    byte[] word = idxReader.ReadBytes(128);
    var dicIndex = new DictIndex();
    dicIndex.Word = Encoding.UTF8.GetString(word).Replace("\0", "");
    dicIndex.Offset = idxReader.ReadInt32();
    dicIndex.DataSize = idxReader.ReadInt32();
    return dicIndex;
}

(2)获取指定索引对应的词语解释

/// 
///  获取指定词语的解释
/// 
///  
/// 
public string GetWordDescription(DictIndex dictIndex)
{
    dictStream.Seek(0, SeekOrigin.Begin);
    if (dictIndex.Offset != 0)
        dictStream.Seek(dictIndex.Offset, SeekOrigin.Begin);
    byte[] word = dictReader.ReadBytes(dictIndex.DataSize);
    return Encoding.UTF8.GetString(word).Replace("\0", "");
}

现在开始二分查找:

/// 
        /// 获取词语解释
        /// 
        ///  
        /// 
        public string GetDescription(string word)
        {
            var i = 0;
            var mid = info.WordCount / 2;
            var max = info.WordCount;
            DictIndex w = new DictIndex();
            while (i <= max)
            {
                mid = (i + max) / 2;
                w = GetWordIndex(mid);
                if (string.Compare(w.Word, word) > 0)
                {
                    max = mid - 1;
                }
                else if (string.Compare(w.Word, word) < 0)
                {
                    i = mid + 1;
                }
                else
                {
                    break;
                }
            }
 
            return "[" + w.Word + "]\n" + GetWordDescription(w);
        }

此部分完整代码:

/// 
    /// 词典
    /// 
    class Dict
    {
        DictInfo info;
        SortedList indexs;
        List words;

        /// 
        /// 索引文件
        /// 
        string idxFile = "dic.idx";

        /// 
        /// 数据文件
        /// 
        string dictfile = "dic.dict";

        /// 
        /// 词典信息文件
        /// 
        string ifoFile = "dic.ifo";

        BinaryReader idxReader;
        FileStream idxStream;
        BinaryReader dictReader;
        FileStream dictStream;


        /// 
        /// 查询使用
        /// 
        public Dict()
        {
            LoadDictInfo();
            idxStream = new FileStream(idxFile, FileMode.Open);
            idxReader = new BinaryReader(idxStream);
            dictStream = new FileStream(dictfile, FileMode.Open);
            dictReader = new BinaryReader(dictStream);
        }

        /// 
        /// 创建时使用
        /// 
        ///  
        public Dict(string name)
        {
            info = new DictInfo { BookName = name, WordCount = 0, CurrentOffset = 0 };
            indexs = new SortedList();
            words = new List();

        }

        /// 
        /// 获取词语解释
        /// 
        ///  
        /// 
        public string GetDescription(string word)
        {
            var i = 0;
            var mid = info.WordCount / 2;
            var max = info.WordCount;
            DictIndex w = new DictIndex();
            while (i <= max)
            {
                mid = (i + max) / 2;
                w = GetWordIndex(mid);
                if (string.Compare(w.Word, word) > 0)
                {
                    max = mid - 1;
                }
                else if (string.Compare(w.Word, word) < 0)
                {
                    i = mid + 1;
                }
                else
                {
                    break;
                }
            }

            return "[" + w.Word + "]\n" + GetWordDescription(w);
        }



        /// 
        ///  获取指定位置的索引
        /// 
        ///  
        /// 
        public DictIndex GetWordIndex(int wordIndex)
        {
            idxStream.Seek(0, SeekOrigin.Begin);
            idxStream.Seek(wordIndex * 136, SeekOrigin.Begin);
            byte[] word = idxReader.ReadBytes(128);
            var dicIndex = new DictIndex();
            dicIndex.Word = Encoding.UTF8.GetString(word).Replace("\0", "");
            dicIndex.Offset = idxReader.ReadInt32();
            dicIndex.DataSize = idxReader.ReadInt32();
            return dicIndex;
        }

        /// 
        ///  获取指定词语的解释
        /// 
        ///  
        /// 
        public string GetWordDescription(DictIndex dictIndex)
        {
            dictStream.Seek(0, SeekOrigin.Begin);
            if (dictIndex.Offset != 0)
                dictStream.Seek(dictIndex.Offset, SeekOrigin.Begin);
            byte[] word = dictReader.ReadBytes(dictIndex.DataSize);
            return Encoding.UTF8.GetString(word).Replace("\0", "");
        }

        /// 
        /// 添加词语
        /// 
        ///  
        ///  
        public void Add(string word, string description)
        {

            words.Add(new DictWord() { Description = description });
            indexs.Add(word, new DictIndex { DataSize = Encoding.UTF8.GetBytes(description).Length, Offset = info.CurrentOffset, Word = word });
            // 数量++
            info.WordCount++;
            // 偏移++
            info.CurrentOffset += Encoding.UTF8.GetBytes(description).Length;
        }

        /// 
        /// 加载词典信息
        /// 
        void LoadDictInfo()
        {
            var infos = File.ReadAllLines(ifoFile);
            info = new DictInfo
            {
                BookName = infos[0].Replace("BookName=", "").Trim(),
                WordCount = int.Parse(infos[1].Replace("WordCount=", "").Trim()),
                CurrentOffset = int.Parse(infos[2].Replace("CurrentOffset=", "").Trim()),
            };
        }

        /// 
        /// 保存
        /// 
        public void Save()
        {

            StringBuilder dicBuilder = new StringBuilder();
            dicBuilder.AppendLine(string.Format("BookName={0}", info.BookName));
            dicBuilder.AppendLine(string.Format("WordCount={0}", info.WordCount));
            dicBuilder.AppendLine(string.Format("CurrentOffset={0}", info.CurrentOffset));
            File.WriteAllText(ifoFile, dicBuilder.ToString(), Encoding.UTF8);

            dicBuilder = new StringBuilder();

            using (BinaryWriter idxWriter = new BinaryWriter(File.Open(dictfile, FileMode.Create)))
            {
                foreach (var word in words)
                {
                    idxWriter.Write(Encoding.UTF8.GetBytes(word.Description));
                }
            }

            using (BinaryWriter idxWriter = new BinaryWriter(File.Open(idxFile, FileMode.Create)))
            {
                foreach (var index in indexs)
                {
                    // 分块大小  128+4+4  = 136

                    // word 最长128
                    byte[] word = new byte[128];
                    var wordData = Encoding.UTF8.GetBytes(index.Key);
                    var length = Math.Min(128, wordData.Length);
                    for (var i = 0; i < length; i++)
                    {
                        word[i] = wordData[i];
                    }
                    idxWriter.Write(word);
                    byte[] re = new byte[4];

                    idxWriter.Write(index.Value.Offset);
                    idxWriter.Write(index.Value.DataSize);
                }
            }

        }
    }

演示

如图所示

文件夹中放置了许多文本文件,内容为词语的解释

首先、建立词典:

Dict dic = new Dict("病症词典");
  
            var files = new DirectoryInfo(@"G:\Users\Administrator\Desktop\新建文件夹 (3)\新建文件夹 (3)").GetFiles();
            foreach (var file in files)
            {
                Console.WriteLine(file.FullName);
                dic.Add(file.Name.Replace("的症状.txt", ""), File.ReadAllText(file.FullName));
            }
            dic.Save();

然后、把玩一番:

var dict = new Dict();
            while (true)
            {
                Console.Write("请输入词语:");
                var w = Console.ReadLine();
                Stopwatch sw = new Stopwatch();
                sw.Start();
                Console.WriteLine("找到词语:");
                Console.WriteLine(dict.GetDescription(w));
                sw.Stop();
                Console.WriteLine("耗时:" + sw.ElapsedMilliseconds + "ms");

            }

运行结果:

到此为止,谢谢收看!

[[demo下载]]

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 使用贝叶斯做英文拼写检查(c#)

    贝叶斯算法可以用来做拼写检查、文本分类、垃圾邮件过滤等工作,前面我们用贝叶斯做了文本分类,这次用它来做拼写检查,参考:How to Write a Spelli...

    用户1177380
  • HTML5录音控件

    最近的项目又需要用到录音,年前有过调研,再次翻出来使用,这里做一个记录。 HTML5提供了录音支持,因此可以方便使用HTML5来录音,来实现录音、语音识别等功能...

    用户1177380
  • IDEA版本彩虹屁插件idea-rainbow-fart,一个在你编程时疯狂称赞你的 IDEA扩展插件

    是否听说过程序员鼓励师,不久前出了一款vscode的插件rainbow-fart,可以在写代码的时候,匹配到特定关键词就疯狂的拍你马屁。

    用户1177380
  • zookeeper源码分析(2)-客户端启动流程

    客户端的入口,负责启动整个客户端。持有ClientCnxn和ZKWatchManager的实例,提供了客户端对节点操作的方法。

    Monica2333
  • PHP中把字符串true/false转成boolean布尔型

    在PHP中,无法使用(bool)或者settype()函数把字符串的”true”和”false”转成布尔型。如果使用上述两种办法,会始终返回true。

    TLingC
  • 如何选型一个合适的框架-分布式任务调度框架选型

    定时任务是大家再开发中一个不可避免的业务,比如在一些电商系统中可能会定时给用户发送生日券,一些对账系统中可能会定时去对账。大概再很久以前每个服务可能就一台机器,...

    用户5397975
  • 【温故知新】概率笔记4——重要公式

    概率公式是概率计算中的重要环节,全概率公式、贝叶斯公式等可以运用于复杂事件的概率, 而所有这些公式又是由基本公式推导出来的。

    统计学家
  • 如何使用UI Configuration将WebClient UI的隐藏字段放出来

    In Service order detail page some fields are by default hidden in standard UI co...

    Jerry Wang
  • python二元表达式

    py3study
  • 身为程序猿,怎能不懂RegExp?

    正则表达式是程序猿的好朋友。这体现在两个方面:一、在我们敲的代码里面,可以用正则表达式非常轻巧、灵便、快捷的完成字符串的操作,比如匹配、搜索、提取子串等。二、我...

    企鹅号小编

扫码关注云+社区

领取腾讯云代金券