C# 词典数据结构设计【附demo】

分析

要建立词典,最基本的应该有词典的描述信息、词典索引文件以及词典数据文件。 /// <summary> /// 索引文件 /// </summary> string idxFile = "dic.idx"; /// <summary> /// 数据文件 /// </summary> string dictfile = "dic.dict"; /// <summary> /// 词典信息文件 /// </summary> string ifoFile = "dic.ifo"; 我们建立对应的三个类

详细的代码如下:

/// 
    ///  词语解释
    /// 
    class DictWord
    {
        /// 
        /// 解析
        /// 
        public string Description
        {
            get;
            set;
        }
    }

    /// 
    /// 词典索引
    /// 
    class DictIndex
    {
        /// 
        /// 词语
        /// 
        public string Word
        {
            get;
            set;
        }

        /// 
        /// 偏移
        /// 
        public int Offset
        {
            get;
            set;
        }

        /// 
        /// 数据大小
        /// 
        public int DataSize
        {
            get;
            set;
        }
    }

    /// 
    /// 词典信息
    /// 
    class DictInfo
    {
        /// 
        /// 词典名称
        /// 
        public string BookName
        {
            get;
            set;
        }

        /// 
        /// 收录词数
        /// 
        public int WordCount
        {
            get;
            set;
        }

        /// 
        /// 当前偏移
        /// 
        public int CurrentOffset
        {
            get;
            set;
        }
    }

数据结构说明:

  1. 描述信息包含词典名字,词典词语数量
  2. 索引文件存储的是排好顺序词语的索引,每个索引包含词语名称、存在数据文件中的偏移量、以及数据块大小,排序的目的在于查找时直接用二分查找节省查找时间。
  3. 数据块就简单了,就纯粹的数据

建立词典

建立词典比较简单,首先,定义几个变量来存储词典相关信息:         DictInfo info;         SortedList<string, DictIndex> indexs;         List<DictWord> words;

ps: SortedList能直接排序,不用我们再手动排序了

然后我们来看添加词语:

/// 
        /// 添加词语
        /// 
        ///  
        ///  
        public void Add(string word, string description)
        {
 
            words.Add(new DictWord() { Description = description });
            indexs.Add(word, new DictIndex { DataSize = Encoding.UTF8.GetBytes(description).Length, Offset = info.CurrentOffset, Word = word });
            // 数量++
            info.WordCount++;
            // 偏移++
            info.CurrentOffset += Encoding.UTF8.GetBytes(description).Length;
        }

非常简单,就是添加索引,同时把词典的数量加1

最后来看怎么存储到文件:

/// <summary>
/// 保存
/// </summary>
public void Save()
{
 
    StringBuilder dicBuilder = new StringBuilder();
    dicBuilder.AppendLine(string.Format("BookName={0}", info.BookName));
    dicBuilder.AppendLine(string.Format("WordCount={0}", info.WordCount));
    dicBuilder.AppendLine(string.Format("CurrentOffset={0}", info.CurrentOffset));
    File.WriteAllText(ifoFile, dicBuilder.ToString(), Encoding.UTF8);
 
    dicBuilder = new StringBuilder();
 
    using (BinaryWriter idxWriter = new BinaryWriter(File.Open(dictfile, FileMode.Create)))
    {
        foreach (var word in words)
        {
            idxWriter.Write(Encoding.UTF8.GetBytes(word.Description));
        }
    }
 
    using (BinaryWriter idxWriter = new BinaryWriter(File.Open(idxFile, FileMode.Create)))
    {
        foreach (var index in indexs)
        {
            // 分块大小  128+4+4  = 136
 
            // word 最长128
            byte[] word = new byte[128];
            var wordData = Encoding.UTF8.GetBytes(index.Key);
            var length = Math.Min(128, wordData.Length);
            for (var i = 0; i < length; i++)
            {
                word[i] = wordData[i];
            }
            idxWriter.Write(word);
            byte[] re = new byte[4];
 
            idxWriter.Write(index.Value.Offset);
            idxWriter.Write(index.Value.DataSize);
        }
    }
 
}

这里注意下word最多能存128个字节,每个index区地大小为128+4+4 = 136字节

查询词典

前面做这么多准备,不都是为了查询吗?木有查询,神马都是浮云!

前面说到了,索引文件存储的是排序好的词语列表,所以查询就比较简单了 先给出两个辅助方法:             idxStream = new FileStream(idxFile, FileMode.Open);             idxReader = new BinaryReader(idxStream);             dictStream = new FileStream(dictfile, FileMode.Open);             dictReader = new BinaryReader(dictStream); (1) 获取指定位置的索引

/// 
///  获取指定位置的索引
/// 
///  
/// 
public DictIndex GetWordIndex(int wordIndex)
{
    idxStream.Seek(0, SeekOrigin.Begin);
    idxStream.Seek(wordIndex * 136, SeekOrigin.Begin);
    byte[] word = idxReader.ReadBytes(128);
    var dicIndex = new DictIndex();
    dicIndex.Word = Encoding.UTF8.GetString(word).Replace("\0", "");
    dicIndex.Offset = idxReader.ReadInt32();
    dicIndex.DataSize = idxReader.ReadInt32();
    return dicIndex;
}

(2)获取指定索引对应的词语解释

/// 
///  获取指定词语的解释
/// 
///  
/// 
public string GetWordDescription(DictIndex dictIndex)
{
    dictStream.Seek(0, SeekOrigin.Begin);
    if (dictIndex.Offset != 0)
        dictStream.Seek(dictIndex.Offset, SeekOrigin.Begin);
    byte[] word = dictReader.ReadBytes(dictIndex.DataSize);
    return Encoding.UTF8.GetString(word).Replace("\0", "");
}

现在开始二分查找:

/// 
        /// 获取词语解释
        /// 
        ///  
        /// 
        public string GetDescription(string word)
        {
            var i = 0;
            var mid = info.WordCount / 2;
            var max = info.WordCount;
            DictIndex w = new DictIndex();
            while (i <= max)
            {
                mid = (i + max) / 2;
                w = GetWordIndex(mid);
                if (string.Compare(w.Word, word) > 0)
                {
                    max = mid - 1;
                }
                else if (string.Compare(w.Word, word) < 0)
                {
                    i = mid + 1;
                }
                else
                {
                    break;
                }
            }
 
            return "[" + w.Word + "]\n" + GetWordDescription(w);
        }

此部分完整代码:

/// 
    /// 词典
    /// 
    class Dict
    {
        DictInfo info;
        SortedList indexs;
        List words;

        /// 
        /// 索引文件
        /// 
        string idxFile = "dic.idx";

        /// 
        /// 数据文件
        /// 
        string dictfile = "dic.dict";

        /// 
        /// 词典信息文件
        /// 
        string ifoFile = "dic.ifo";

        BinaryReader idxReader;
        FileStream idxStream;
        BinaryReader dictReader;
        FileStream dictStream;


        /// 
        /// 查询使用
        /// 
        public Dict()
        {
            LoadDictInfo();
            idxStream = new FileStream(idxFile, FileMode.Open);
            idxReader = new BinaryReader(idxStream);
            dictStream = new FileStream(dictfile, FileMode.Open);
            dictReader = new BinaryReader(dictStream);
        }

        /// 
        /// 创建时使用
        /// 
        ///  
        public Dict(string name)
        {
            info = new DictInfo { BookName = name, WordCount = 0, CurrentOffset = 0 };
            indexs = new SortedList();
            words = new List();

        }

        /// 
        /// 获取词语解释
        /// 
        ///  
        /// 
        public string GetDescription(string word)
        {
            var i = 0;
            var mid = info.WordCount / 2;
            var max = info.WordCount;
            DictIndex w = new DictIndex();
            while (i <= max)
            {
                mid = (i + max) / 2;
                w = GetWordIndex(mid);
                if (string.Compare(w.Word, word) > 0)
                {
                    max = mid - 1;
                }
                else if (string.Compare(w.Word, word) < 0)
                {
                    i = mid + 1;
                }
                else
                {
                    break;
                }
            }

            return "[" + w.Word + "]\n" + GetWordDescription(w);
        }



        /// 
        ///  获取指定位置的索引
        /// 
        ///  
        /// 
        public DictIndex GetWordIndex(int wordIndex)
        {
            idxStream.Seek(0, SeekOrigin.Begin);
            idxStream.Seek(wordIndex * 136, SeekOrigin.Begin);
            byte[] word = idxReader.ReadBytes(128);
            var dicIndex = new DictIndex();
            dicIndex.Word = Encoding.UTF8.GetString(word).Replace("\0", "");
            dicIndex.Offset = idxReader.ReadInt32();
            dicIndex.DataSize = idxReader.ReadInt32();
            return dicIndex;
        }

        /// 
        ///  获取指定词语的解释
        /// 
        ///  
        /// 
        public string GetWordDescription(DictIndex dictIndex)
        {
            dictStream.Seek(0, SeekOrigin.Begin);
            if (dictIndex.Offset != 0)
                dictStream.Seek(dictIndex.Offset, SeekOrigin.Begin);
            byte[] word = dictReader.ReadBytes(dictIndex.DataSize);
            return Encoding.UTF8.GetString(word).Replace("\0", "");
        }

        /// 
        /// 添加词语
        /// 
        ///  
        ///  
        public void Add(string word, string description)
        {

            words.Add(new DictWord() { Description = description });
            indexs.Add(word, new DictIndex { DataSize = Encoding.UTF8.GetBytes(description).Length, Offset = info.CurrentOffset, Word = word });
            // 数量++
            info.WordCount++;
            // 偏移++
            info.CurrentOffset += Encoding.UTF8.GetBytes(description).Length;
        }

        /// 
        /// 加载词典信息
        /// 
        void LoadDictInfo()
        {
            var infos = File.ReadAllLines(ifoFile);
            info = new DictInfo
            {
                BookName = infos[0].Replace("BookName=", "").Trim(),
                WordCount = int.Parse(infos[1].Replace("WordCount=", "").Trim()),
                CurrentOffset = int.Parse(infos[2].Replace("CurrentOffset=", "").Trim()),
            };
        }

        /// 
        /// 保存
        /// 
        public void Save()
        {

            StringBuilder dicBuilder = new StringBuilder();
            dicBuilder.AppendLine(string.Format("BookName={0}", info.BookName));
            dicBuilder.AppendLine(string.Format("WordCount={0}", info.WordCount));
            dicBuilder.AppendLine(string.Format("CurrentOffset={0}", info.CurrentOffset));
            File.WriteAllText(ifoFile, dicBuilder.ToString(), Encoding.UTF8);

            dicBuilder = new StringBuilder();

            using (BinaryWriter idxWriter = new BinaryWriter(File.Open(dictfile, FileMode.Create)))
            {
                foreach (var word in words)
                {
                    idxWriter.Write(Encoding.UTF8.GetBytes(word.Description));
                }
            }

            using (BinaryWriter idxWriter = new BinaryWriter(File.Open(idxFile, FileMode.Create)))
            {
                foreach (var index in indexs)
                {
                    // 分块大小  128+4+4  = 136

                    // word 最长128
                    byte[] word = new byte[128];
                    var wordData = Encoding.UTF8.GetBytes(index.Key);
                    var length = Math.Min(128, wordData.Length);
                    for (var i = 0; i < length; i++)
                    {
                        word[i] = wordData[i];
                    }
                    idxWriter.Write(word);
                    byte[] re = new byte[4];

                    idxWriter.Write(index.Value.Offset);
                    idxWriter.Write(index.Value.DataSize);
                }
            }

        }
    }

演示

如图所示

文件夹中放置了许多文本文件,内容为词语的解释

首先、建立词典:

Dict dic = new Dict("病症词典");
  
            var files = new DirectoryInfo(@"G:\Users\Administrator\Desktop\新建文件夹 (3)\新建文件夹 (3)").GetFiles();
            foreach (var file in files)
            {
                Console.WriteLine(file.FullName);
                dic.Add(file.Name.Replace("的症状.txt", ""), File.ReadAllText(file.FullName));
            }
            dic.Save();

然后、把玩一番:

var dict = new Dict();
            while (true)
            {
                Console.Write("请输入词语:");
                var w = Console.ReadLine();
                Stopwatch sw = new Stopwatch();
                sw.Start();
                Console.WriteLine("找到词语:");
                Console.WriteLine(dict.GetDescription(w));
                sw.Stop();
                Console.WriteLine("耗时:" + sw.ElapsedMilliseconds + "ms");

            }

运行结果:

到此为止,谢谢收看!

[[demo下载]]

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏PaddlePaddle

【进阶篇】命令行参数细节描述

编写|PaddlePaddle 排版|wangp 虽然PaddlePaddle看起来包含了众多参数,但是大部分参数是为开发者提供的,或者已经在集群提交环境中自动...

2844
来自专栏听雨堂

从MapX到MapXtreme2004[9]-标注的强调显示

        如果想要将一个选中的图元强调显示,用红色醒目的文字显示的话,我的思路如下:             1、不可能直接改原先的图元,所以必须要在一个...

1896
来自专栏月色的自留地

macOS的OpenCL高性能计算

1378
来自专栏Google Dart

Flutte部件目录-基本部件(二) 顶

支持以下图像格式:JPEG,PNG,GIF,GIF动画,WebP,WebP动画,BMP和WBMP

502
来自专栏一“技”之长

iOS开发CoreGraphics核心图形框架之五——Patterns模型的应用

    Patterns称为模型可能并不直观,说一个场景我们或许就可以更加容易的理解Patterns。在开发中,开发者经常会遇到这样的需求,将某个图片或者某个图...

793
来自专栏浅探ARKit

基于ARkit和SceneKit检测相机位置和设置2个物体碰撞的事件

######和以往iOS的代理事件不同 它还要多设置categoryBitMask、contactTestBitMask属性的id 用于标志2个物体是否会发生...

41411
来自专栏AIUAI

Caffe2 - (三十三) Detectron 之 roi_data - data loader

2944
来自专栏DannyHoo的专栏

新浪微博项目笔记

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u010105969/article/details/...

651
来自专栏java 成神之路

java.util.Random 实现原理

2905
来自专栏逍遥剑客的游戏开发

XACT Q&A

1495

扫描关注云+社区