阴茎电击刺激体会:让中科院中文分词系统ICTCLAS为lucene所用的简单程序(C#版)

来源：百度文库编辑：九乡新闻网时间：2024/04/29 05:09:46

让中科院中文分词系统ICTCLAS为lucene所用的简单程序(C#版)

2011-01-17 17:40:09| 分类：C#.NET|字号订阅

利用了吕震宇根据Free版ICTCLAS改编而成.net平台下的ICTCLAS，把ICTCLAS的分词为lucene所用。以下是我写的程序，比较简单。大家看看评评，有什么要改进的地方，望大家指出
Analyzer类：
using System;
2using System.Collections.Generic;
3using System.Text;
4
5using Lucene.Net.Analysis;
6using Lucene.Net.Analysis.Standard;
7using System.IO;
8
9namespace ICTCLASForLucene
10{
11    public class ICTCLASAnalyzer : Analyzer
12    {
13        //定义要过滤的词
14        public static readonly System.String[] CHINESE_ENGLISH_STOP_WORDS = new string[368];
15        public string NoisePath = Environment.CurrentDirectory + "\\data\\sNoise.txt";
16
17        public ICTCLASAnalyzer()
18        {
19            StreamReader reader = new StreamReader(NoisePath, System.Text.Encoding.UTF8);
20            string noise = reader.ReadLine();
21            int i = 0;
22            while (!string.IsNullOrEmpty(noise))
23            {
24                CHINESE_ENGLISH_STOP_WORDS[i] = noise;
25                noise = reader.ReadLine();
26                i++;
27            }
28        }
29
30        /**////

Constructs a {@link StandardTokenizer} filtered by a {@link
31 /// StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}.
32 ///

33        public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader)
34        {
35            TokenStream result = new ICTCLASTokenizer(reader);
36            result = new StandardFilter(result);
37            result = new LowerCaseFilter(result);
38            result = new StopFilter(result, CHINESE_ENGLISH_STOP_WORDS);
39            return result;
40        }
41
42
43    }
44}
45

Tokenizer类：
1using System;
2using System.Collections.Generic;
3using System.Text;
4
5using Lucene.Net.Analysis;
6using SharpICTCLAS;
7using System.IO;
8
9namespace ICTCLASForLucene
10{
11    class ICTCLASTokenizer : Tokenizer
12    {
13        int nKind = 2;
14        List result;
15        int startIndex = 0;
16        int endIndex = 0;
17        int i = 1;
18        /**////

19 /// 待分词的句子
20 ///

21 private string sentence;
22 /**////

Constructs a tokenizer for this Reader.

23        public ICTCLASTokenizer(System.IO.TextReader reader)
24        {
25            this.input = reader;
26            sentence = input.ReadToEnd();
27            sentence = sentence.Replace("\r\n","");
28            string DictPath = Path.Combine(Environment.CurrentDirectory, "Data") + Path.DirectorySeparatorChar;
29            //Console.WriteLine("正在初始化字典库，请稍候");
30            WordSegment wordSegment = new WordSegment();
31            wordSegment.InitWordSegment(DictPath);
32            result = wordSegment.Segment(sentence, nKind);
33        }
34
35        /**////

进行切词，返回数据流中下一个token或者数据流为空时返回null
36 ///

37        public override Token Next()
38        {
39            Token token = null;
40            while (i < result[0].Length-1)
41            {
42                string word = result[0][i].sWord;
43                endIndex = startIndex + word.Length - 1;
44                token = new Token(word, startIndex, endIndex);
45                startIndex = endIndex + 1;
46
47                i++;
48                return token;
49
50            }
51            return null;
52        }
53    }
54}
55
分词郊果：
需分词句子：毛泽东，周恩来，中华人民共和国在1949年建立，从此开始了新中国的伟大篇章.长春市长春节发表致词汉字abc iphone 1265325.98921 fee1212@tom.com http://news.qq.com 100%
分词结果:
(毛泽东,0,2)(周恩来,4,6)(中华人民共和国,8,14)(1949年,16,20)(建立,21,22)(从此,24,25)(新,29,29)(中国,30,31)(伟大,33,34)(篇章,35,36)(长春市,38,40)(春节,42,43)(发表,44,45)(致词,46,47)(汉字,48,49)(abc,50,52)(iphone,53,58)(1265325.98921,59,71)(fee1212@tom,72,82)(com,84,86)(http://news,87,97)(qq,99,100)(com,102,104)(100%,105,108)
耗时00:00:00.0937500

让中科院中文分词系统ICTCLAS为lucene所用的简单程序(C#版) Lucene的打分机制 C#程序书写规范现在分词的构成浅析C# get set的简单用法 C#反射的理解-程序开发-红黑联盟三.关于S60三版系统程序的一些知识（为后面的操作提供解释，可跳过）不能让“权为亲所用”腐蚀公平（原创） Lucene学习总结之二：Lucene的总体架构 Linux系统下中文显示为方块使用C#多线程设计的电脑摇奖程序-程序开发-红黑联盟用设计模式固化你的C#程序(2)-程序开发-红黑联盟用设计模式固化你的C#程序(3)-程序开发-红黑联盟 ing分词的用法2 简单做蛋糕的程序解决程序错误,内存不能为read的问题的最简单的办法简单方法，让你的系统比重装还爽安卓系统程序精简方法——让你的安卓飞起来有关Lucene一些读者常问的 C#程序调用非托管C++ DLL文件的方法 - Chase的技术博客 - 博客园 C#中IList与List的区别-我的程序空间-搜狐空间 c#泛型和反射的设计应用-程序开发-红黑联盟发布一款基于C#的网络爬虫程序 - 刘杨 - 博客园让自己开始做事的8个简单要诀 - 中文HowTO