自己写Lucene分词器原理篇——CJKAnalyzer简单讲解
其中CJK中日韩统一表意文字(CJK Unified Ideographs),目的是要把分别来自中文、日文、韩文、越文中,本质、意义相同、形状一样或稍异的表意文字(主要为汉字,但也有仿汉字如日本国字、韩国独有汉字、越南的喃字)于ISO 10646及Unicode标准内赋予相同编码。CJK 是中文(Chinese)、日文(Japanese)、韩文(Korean)三国文字的缩写。顾名思义,它能够支持这三种文字。
实际上,CJKAnalyzer支持中文、日文、韩文和朝鲜文。
CJKAnalyzer的所有类:
CJKAnalyzer是主类
CJKWidthFilter是负责格式化字符,主要是折叠变种的半宽片假名成等价的假名
CJKBigramFilter是负责把两个CJK字符切割成两个,只要是CJK字符就会两两切割,ABC->AB,BC
CJKTokenizer是兼容低版本的分析器
1.CJKAnalyzer的主要部分,可以看出,先判断版本号,低版本直接用CJKTokenizer,高版本的先用standardanalyzer把非英文字母数字一个个分开,再用CJKWidthFilter格式化CJK字符,再用LowerCaseFilter转换英文成小写,再用CJKBigramFilter把CJK字符切割成两两的。
public boolean incrementToken() throws IOException { while (true) { //判断之前是否暂存了双CJK字符 if (hasBufferedBigram()) { // case 1: we have multiple remaining codepoints buffered, // so we can emit a bigram here. //如果选择了要输出单字切割 if (outputUnigrams) { // when also outputting unigrams, we output the unigram first, // then rewind back to revisit the bigram. // so an input of ABC is A + (rewind)AB + B + (rewind)BC + C // the logic in hasBufferedUnigram ensures we output the C, // even though it did actually have adjacent CJK characters. if (ngramState) { flushBigram();//写双字 } else { flushUnigram();//写单字,然后后退一个字符(rewind) index--; } //这个个对应上面的判断,实现双字输出 ngramState = !ngramState; } else { flushBigram(); } return true; } else if (doNext()) { // case 2: look at the token type. should we form any n-grams? String type = typeAtt.type(); //判断字符属性,是否是CJK字符 if (type == doHan || type == doHiragana || type == doKatakana || type == doHangul) { // acceptable CJK type: we form n-grams from these. // as long as the offsets are aligned, we just add these to our current buffer. // otherwise, we clear the buffer and start over. if (offsetAtt.startOffset() != lastEndOffset) { // unaligned, clear queue if (hasBufferedUnigram()) { // we have a buffered unigram, and we peeked ahead to see if we could form // a bigram, but we can't, because the offsets are unaligned. capture the state // of this peeked data to be revisited next time thru the loop, and dump our unigram. loneState = captureState(); flushUnigram(); return true; } index = 0; bufferLen = 0; } refill(); } else { // not a CJK type: we just return these as-is. if (hasBufferedUnigram()) { // we have a buffered unigram, and we peeked ahead to see if we could form // a bigram, but we can't, because its not a CJK type. capture the state // of this peeked data to be revisited next time thru the loop, and dump our unigram. loneState = captureState(); flushUnigram(); return true; } return true; } } else { // case 3: we have only zero or 1 codepoints buffered, // so not enough to form a bigram. But, we also have no // more input. So if we have a buffered codepoint, emit // a unigram, otherwise, its end of stream. if (hasBufferedUnigram()) { flushUnigram(); // flush our remaining unigram return true; } return false; } } }