首页 诗词 字典 板报 句子 名言 友答 励志 学校 网站地图
当前位置: 首页 > 教程频道 > 其他教程 > 互联网 >

自个儿写Lucene分词器原理篇——CJKAnalyzer简单讲解

2013-10-15 
自己写Lucene分词器原理篇——CJKAnalyzer简单讲解其中CJK中日韩统一表意文字(CJK Unified Ideographs),目的

自己写Lucene分词器原理篇——CJKAnalyzer简单讲解

其中CJK中日韩统一表意文字(CJK Unified Ideographs),目的是要把分别来自中文、日文、韩文、越文中,本质、意义相同、形状一样或稍异的表意文字(主要为汉字,但也有仿汉字如日本国字、韩国独有汉字、越南的喃字)于ISO 10646及Unicode标准内赋予相同编码。CJK 是中文(Chinese)、日文(Japanese)、韩文(Korean)三国文字的缩写。顾名思义,它能够支持这三种文字。

实际上,CJKAnalyzer支持中文、日文、韩文和朝鲜文。

CJKAnalyzer的所有类:

自个儿写Lucene分词器原理篇——CJKAnalyzer简单讲解

CJKAnalyzer是主类

CJKWidthFilter是负责格式化字符,主要是折叠变种的半宽片假名成等价的假名  

CJKBigramFilter是负责把两个CJK字符切割成两个,只要是CJK字符就会两两切割,ABC->AB,BC

CJKTokenizer是兼容低版本的分析器

1.CJKAnalyzer的主要部分,可以看出,先判断版本号,低版本直接用CJKTokenizer,高版本的先用standardanalyzer把非英文字母数字一个个分开,再用CJKWidthFilter格式化CJK字符,再用LowerCaseFilter转换英文成小写,再用CJKBigramFilter把CJK字符切割成两两的。

public boolean incrementToken() throws IOException {    while (true) {    //判断之前是否暂存了双CJK字符         if (hasBufferedBigram()) {                // case 1: we have multiple remaining codepoints buffered,        // so we can emit a bigram here.        //如果选择了要输出单字切割        if (outputUnigrams) {          // when also outputting unigrams, we output the unigram first,          // then rewind back to revisit the bigram.          // so an input of ABC is A + (rewind)AB + B + (rewind)BC + C          // the logic in hasBufferedUnigram ensures we output the C,           // even though it did actually have adjacent CJK characters.          if (ngramState) {            flushBigram();//写双字          } else {            flushUnigram();//写单字,然后后退一个字符(rewind)            index--;          }         //这个个对应上面的判断,实现双字输出          ngramState = !ngramState;        } else {          flushBigram();        }        return true;      } else if (doNext()) {        // case 2: look at the token type. should we form any n-grams?                String type = typeAtt.type();           //判断字符属性,是否是CJK字符        if (type == doHan || type == doHiragana || type == doKatakana || type == doHangul) {                    // acceptable CJK type: we form n-grams from these.          // as long as the offsets are aligned, we just add these to our current buffer.          // otherwise, we clear the buffer and start over.                    if (offsetAtt.startOffset() != lastEndOffset) { // unaligned, clear queue            if (hasBufferedUnigram()) {                            // we have a buffered unigram, and we peeked ahead to see if we could form              // a bigram, but we can't, because the offsets are unaligned. capture the state               // of this peeked data to be revisited next time thru the loop, and dump our unigram.                            loneState = captureState();              flushUnigram();              return true;            }            index = 0;            bufferLen = 0;          }          refill();        } else {                    // not a CJK type: we just return these as-is.                    if (hasBufferedUnigram()) {                        // we have a buffered unigram, and we peeked ahead to see if we could form            // a bigram, but we can't, because its not a CJK type. capture the state             // of this peeked data to be revisited next time thru the loop, and dump our unigram.                        loneState = captureState();            flushUnigram();            return true;          }          return true;        }      } else {                // case 3: we have only zero or 1 codepoints buffered,         // so not enough to form a bigram. But, we also have no        // more input. So if we have a buffered codepoint, emit        // a unigram, otherwise, its end of stream.                if (hasBufferedUnigram()) {          flushUnigram(); // flush our remaining unigram          return true;        }        return false;      }    }  }




 

2楼nini_2012昨天 09:50
只分成两个两个字的,这也叫分词器?
Re: liugang51096557昨天 10:54
回复nini_2012n额····简单分词器
1楼u010123828昨天 09:22
讲的 一下子就看懂了O(∩_∩)O

热点排行