自个儿写Lucene分词器原理篇——CJKAnalyzer简单讲解

2013-10-15

自己写Lucene分词器原理篇——CJKAnalyzer简单讲解其中CJK中日韩统一表意文字（CJK Unified Ideographs），目的

自己写Lucene分词器原理篇——CJKAnalyzer简单讲解

其中CJK中日韩统一表意文字（CJK Unified Ideographs），目的是要把分别来自中文、日文、韩文、越文中，本质、意义相同、形状一样或稍异的表意文字（主要为汉字，但也有仿汉字如日本国字、韩国独有汉字、越南的喃字）于ISO 10646及Unicode标准内赋予相同编码。CJK 是中文（Chinese）、日文（Japanese）、韩文（Korean）三国文字的缩写。顾名思义，它能够支持这三种文字。

实际上，CJKAnalyzer支持中文、日文、韩文和朝鲜文。

CJKAnalyzer的所有类：

自个儿写Lucene分词器原理篇——CJKAnalyzer简单讲解

CJKAnalyzer是主类

CJKWidthFilter是负责格式化字符，主要是折叠变种的半宽片假名成等价的假名

CJKBigramFilter是负责把两个CJK字符切割成两个，只要是CJK字符就会两两切割，ABC->AB,BC

CJKTokenizer是兼容低版本的分析器

1.CJKAnalyzer的主要部分，可以看出，先判断版本号，低版本直接用CJKTokenizer，高版本的先用standardanalyzer把非英文字母数字一个个分开，再用CJKWidthFilter格式化CJK字符，再用LowerCaseFilter转换英文成小写，再用CJKBigramFilter把CJK字符切割成两两的。

public boolean incrementToken() throws IOException {    while (true) {    //判断之前是否暂存了双CJK字符         if (hasBufferedBigram()) {                // case 1: we have multiple remaining codepoints buffered,        // so we can emit a bigram here.        //如果选择了要输出单字切割        if (outputUnigrams) {          // when also outputting unigrams, we output the unigram first,          // then rewind back to revisit the bigram.          // so an input of ABC is A + (rewind)AB + B + (rewind)BC + C          // the logic in hasBufferedUnigram ensures we output the C,           // even though it did actually have adjacent CJK characters.          if (ngramState) {            flushBigram();//写双字          } else {            flushUnigram();//写单字，然后后退一个字符(rewind)            index--;          }         //这个个对应上面的判断，实现双字输出          ngramState = !ngramState;        } else {          flushBigram();        }        return true;      } else if (doNext()) {        // case 2: look at the token type. should we form any n-grams?                String type = typeAtt.type();           //判断字符属性，是否是CJK字符        if (type == doHan || type == doHiragana || type == doKatakana || type == doHangul) {                    // acceptable CJK type: we form n-grams from these.          // as long as the offsets are aligned, we just add these to our current buffer.          // otherwise, we clear the buffer and start over.                    if (offsetAtt.startOffset() != lastEndOffset) { // unaligned, clear queue            if (hasBufferedUnigram()) {                            // we have a buffered unigram, and we peeked ahead to see if we could form              // a bigram, but we can't, because the offsets are unaligned. capture the state               // of this peeked data to be revisited next time thru the loop, and dump our unigram.                            loneState = captureState();              flushUnigram();              return true;            }            index = 0;            bufferLen = 0;          }          refill();        } else {                    // not a CJK type: we just return these as-is.                    if (hasBufferedUnigram()) {                        // we have a buffered unigram, and we peeked ahead to see if we could form            // a bigram, but we can't, because its not a CJK type. capture the state             // of this peeked data to be revisited next time thru the loop, and dump our unigram.                        loneState = captureState();            flushUnigram();            return true;          }          return true;        }      } else {                // case 3: we have only zero or 1 codepoints buffered,         // so not enough to form a bigram. But, we also have no        // more input. So if we have a buffered codepoint, emit        // a unigram, otherwise, its end of stream.                if (hasBufferedUnigram()) {          flushUnigram(); // flush our remaining unigram          return true;        }        return false;      }    }  }

2楼nini_2012昨天 09:50

只分成两个两个字的,这也叫分词器?

Re: liugang51096557昨天 10:54: 回复nini_2012n额····简单分词器

1楼u010123828昨天 09:22: 讲的一下子就看懂了O(∩_∩)O

热点排行

互联网

自个儿写Lucene分词器原理篇——CJKAnalyzer简单讲解