APACHE Lucene 的运用

2013-01-28

APACHE Lucene 的使用?看着Lucene的API，看得出来索引库是个核心区域。?索引库：索引库是一个目录，里面是一些

APACHE Lucene 的使用

看着Lucene的API，看得出来索引库是个核心区域。

索引库：

索引库是一个目录，里面是一些二进制文件，就如同数据库，所有的数据也是以文件的形式存在文件系统中的。我们不能直接操作这些二进制文件，而是使用Lucene提供的API完成相应的操作，就像操作数据库应使用SQL语句一样。

对索引库的操作可以分为两种：管理与查询。管理索引库使用IndexWriter，从索引库中查询使用IndexSearcher。Lucene的数据结构为Document与Field。Document代表一条数据，Field代表数据中的一个属性。一个Document中有多个Field，Field的值为String型，因为Lucene只处理文本。

我们只需要把在我们的程序中的对象转成Document，就可以交给Lucene管理了，搜索的结果中的数据列表也是Document

配置Lucene的工作环境:我用的3.0.x版本

要加入的jar包有：

llucene-core-3.0.1.jar（核心包）

lcontrib\analyzers\common\lucene-analyzers-3.0.1.jar（分词器）

lcontrib\highlighter\lucene-highlighter-3.0.1.jar（高亮）

lcontrib\memory\lucene-memory-3.0.1.jar（高亮)

lIKAnalyzer3.2.3.jar(中文分解词,这jar不在Lucene里面，需要去额外下载)

4.开始Hello world.

首先，我需要把Lucene所需要的基本功能给抽象出来,我们看Lucene的API，流程图可以看出我们都是针对索引库进行操作，需要用到IndexWrite的CRUD 写入、更新、删除索引库和IndexSearcher来搜索索引库的内容。

我们需要一个接口来描述这些基本方法：

/*** * Lucene 索引基本操作接口 * @author share * * @param <E> */public interface IndexService<E> { void save(E entity);//添加 void delete(Long id);//删除 void update(E entity);//更新，针对打数据的操作，可以能不进行更新操作，直接delete在save记录。 Page<E> search(Page<E> page,String queryString);//按条件分页查询搜索 }

贴Page对象的基本方法：

public class Page<T> { //-- 公共变量 --// public static final String ASC = "asc"; public static final String DESC = "desc"; //-- 分页参数 --// protected int pageNo = 1; protected int pageSize = 20; protected boolean autoCount = true; //-- 返回结果 --// protected List<T> result = new ArrayList<T>(); protected long totalCount = 0; protected int totalPages = 0; protected int searchFlag=0; //省略getter 和setter方法 }

设计一个LuceneUtils类

/** * 使用 IndexWriter 进行保存或更新操作时， * 若不手动调用 IndexWriter 的 close 方法，数据并不会持久化到索引库中。 * IndexWriter 一般只需要在程序退出的时候再关闭。 * 因此，需要调用它的 commit 方法手动提交。需要特别注意。 */ public class LuceneUtils { private static IndexWriter indexWriter; public static String path = ServletActionContext.getServletContext().getRealPath("/")+"Index"; static { try { Directory directory = FSDirectory.open(new File(path)); //使用IKAnalyzer 词典分词器Analyzer analyzer = new IKAnalyzer(); indexWriter = new IndexWriter(directory, analyzer, MaxFieldLength.LIMITED); } catch (Exception e) { throw new RuntimeException(e); } } /** * 获取 IndexWriter */ public static IndexWriter getIndexWriter() { return indexWriter; } /** * 关闭 IndexWriter */ public static void closeIndexWriter() { try { indexWriter.close(); } catch (Exception e) { throw new RuntimeException(e); } }}

然后我们需要去实现那个LuceneService泛型的基本方法的实现类：(这里使用的不是Lucene分解词，用的是一个中文分词器IKAnalyzer)

@Service("luceneService")@Transactionalpublic class LuceneServiceImpl implements IndexService<NewsContent> { /*** * 为每一条新闻记录建立一个索引* @param col */ @Transactional(readOnly=true) public Page<NewsContent> search(Page<NewsContent> page, String queryString){ if(queryString == null || queryString.equals("")){ return page; } String path = ServletActionContext.getServletContext().getRealPath("/")+"Index"; List<NewsContent> newsList = new ArrayList<NewsContent>(); Directory dirPath = null; IndexReader reader = null; IndexSearcher searcher = null; int firstResult = (page.getPageNo()-1)*page.getPageSize(); int maxResult = page.getPageSize(); //查询字段try { dirPath = FSDirectory.open(new File(path)); reader = IndexReader.open(dirPath); searcher = new IndexSearcher(reader); //在索引中使用IKSimilarity相似度评估器searcher.setSimilarity(new IKSimilarity()); //使用：search(Query query , Filter filter , int n , Sort sort) //搜索的字段String[] fields = { "title", "content" }; IKQueryParser parser = new IKQueryParser(); Query query = parser.parseMultiField(fields,queryString); //sort // 按降序排列SortField sf = new SortField("id", SortField.LONG,true); Sort sort = new Sort(sf); TopDocs tds = searcher.search(query, null, firstResult + maxResult , sort); int totalCount = tds.totalHits; ScoreDoc[] sd = tds.scoreDocs; /* 保证循环的次数不超过 scoreDocs 的长度*/ int length = Math.min(firstResult+maxResult, sd.length); // 一、创建并配置高亮器 Formatter formater = new SimpleHTMLFormatter("<font color='red'>", "</font>"); // 高亮效果，默认为<B>与</B> Scorer scorer = new QueryScorer(query); // 查询条件 Highlighter highlighter = new Highlighter(formater, scorer); highlighter.setTextFragmenter(new SimpleFragmenter(20)); // 摘要的大小，默认为100个字符 //遍历查询出来的文档转换成 新闻记录集合for (int i = firstResult; i < length; i++) { NewsContent newsContent = new NewsContent(); Document docId = reader.document(sd[i].doc); newsContent.setId(Long.valueOf(docId.get("id"))); // 返回高亮后的（关键词出现次数最多的地方的）一段文本，如果当前高亮的属性中没有出搜索的关键词，则返回null String text = highlighter.getBestFragment(getAnalyzer(), "title", docId.get("title")); if (text != null) { docId.getField("title").setValue(text); // 使用高亮后的文本替换原始内容 }newsContent.setTitle(docId.get("title")); String dateStr = docId.get("addTime"); SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd"); Date date = sdf.parse(dateStr); newsContent.setAddTime(date); newsContent.setClickNum(Long.valueOf(docId.get("clickNum"))); NewsType newsType = new NewsType(); newsType.setId(Long.valueOf(docId.get("newsTypeId"))); newsContent.setNewsType(newsType); newsList.add(newsContent); } page.setResult(newsList); //计算总页数page.setTotalCount(totalCount); } catch (Exception e) { // TODO: handle exception e.printStackTrace(); }finally{ try { if(dirPath!=null) dirPath.close(); if(reader!=null) reader.close(); if(searcher!=null) searcher.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } return page; } } /*** * 为一条新闻记录建立一个索引* @param news */ @Transactional(readOnly=true) public void save(NewsContent news) { IndexWriter writer = LuceneUtils.getIndexWriter(); try { Document doc = createDocument(news); writer.addDocument(doc); } catch (Exception e) { // TODO: handle exception System.out.println("创建索引库失败！");e.printStackTrace(); }finally{ try { writer.optimize(); writer.commit(); //LuceneUtils.closeIndexWriter(); } catch (CorruptIndexException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } /*** * 删除包含此词的文档* @param col */ @Transactional(readOnly=true) public void delete(Long newsId){ IndexWriter writer = LuceneUtils.getIndexWriter(); try { Term term = new Term("id",newsId.toString()+ ""); writer.deleteDocuments(term); } catch (Exception e) { // TODO: handle exception e.printStackTrace(); }finally{ try { writer.commit(); //LuceneUtils.closeIndexWriter(); } catch (CorruptIndexException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } /*** * 删除包含此词的文档* @param col */ @Transactional(readOnly=true) public void update(NewsContent news){ IndexWriter writer = LuceneUtils.getIndexWriter(); try { Document doc = createDocument(news); Term term = new Term("id",news.getId().toString()+ ""); writer.updateDocument(term, doc); } catch (Exception e) { // TODO: handle exception e.printStackTrace(); }finally{ try { writer.optimize(); writer.commit(); //LuceneUtils.closeIndexWriter(); } catch (CorruptIndexException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } /** * 产生文档文件* @param news * @return */ @Transactional(readOnly=true) private Document createDocument(NewsContent news) { Document doc = new Document(); doc.add(new Field("id", news.getId().toString(), Field.Store.YES, Index.ANALYZED)); doc.add(new Field("title", news.getTitle(), Field.Store.YES, Index.ANALYZED)); doc.add(new Field("content", news.getContent(), Field.Store.YES, Index.ANALYZED)); SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd"); sdf.format(news.getAddTime()); doc.add(new Field("addTime", sdf.format(news.getAddTime()), Field.Store.YES, Index.ANALYZED)); doc.add(new Field("newsTypeId", news.getNewsType().getId().toString(), Field.Store.YES, Index.ANALYZED)); doc.add(new Field("clickNum", news.getClickNum().toString(), Field.Store.YES, Index.ANALYZED)); return doc; } /**** * 获取文档分析器* @return */ @Transactional(readOnly=true) public static Analyzer getAnalyzer(){ return new IKAnalyzer(); }}

这些基本的查找都写好，就差需要的时候调用了。

5.说下Lucene另一个核心分解词：

中文分词器

中文的分词比较复杂，因为不是一个字就是一个词，而且一个词在另外一个地方就可能不是一个词，如在“帽子和服装”中，“和服”就不是一个词。对于中文分词，通常有三种方式：单字分词、二分法分词、词典分词。

l单字分词：就是按照中文一个字一个字地进行分词。如：“我们是中国人”，
效果：“我”、“们”、“是”、“中”、“国”、“人”。（StandardAnalyzer就是这样）。

l二分法分词：按两个字进行切分。如：“我们是中国人”，效果：“我们”、“们是”、“是中”、“中国”、“国人”。（CJKAnalyzer就是这样）。

l词库分词：按某种算法构造词，然后去匹配已建好的词库集合，如果匹配到就切分出来成为词语。通常词库分词被认为是最理想的中文分词算法。如：“我们是中国人”，效果为：“我们”、“中国人”。（使用极易分词的MMAnalyzer。可以使用“极易分词”，或者是“庖丁分词”分词器、IKAnalyzer）。

其他的中文分词器有：

1，极易分词：MMAnalyzer，最后版本是1.5.3，更新时间是2007-12-05，不支持Lucene3.0

2，庖丁分词：PaodingAnalzyer，最后版本是2.0.4，更新时间是2008-06-03，不支持Lucene3.0

中文分词器使用IKAnalyzer，主页：http://www.oschina.net/p/ikanalyzer。

实现了以词典为基础的正反向全切分，以及正反向最大匹配切分两种方法。IKAnalyzer是第三方实现的分词器，继承自Lucene的Analyzer类，针对中文文本进行处理。具体的使用方式参见其文档。

注意：扩展的词库与停止词文件要是UTF-8的编码，并且在要文件头部加一空行。

热点排行

Apache

APACHE Lucene 的运用