lucene(一)Reading related books

2012-06-29

lucene(1)Reading related bookslucene(1)Reading related books1. Introduce to lucene1.1. Field Typeke

lucene(1)Reading related books
lucene(1)Reading related books

1. Introduce to lucene
1.1. Field Type
keyword not anlynize, but indexed and stored in index files. For example: URL, file directory, date, user name, user number, mobile phone number.

UnIndexed not indexed, not anlynized, it is stored in index files. You only need this field when you show your content together. For example, URL, primary key of the database.

UnStored indexed, anlynized, but not stored in index files. For example, the content of the html page, text document.

Text anlynized, indexed, if the field is string, it will be stored, if it is a reader, it will not be stored.

1.2 Update
It seems there is no update method for index files, only can deleting and adding.

1.3. Document Field boost
boost on document
doc.add(Field.Keywork(“senderEmail”, senderEmail));
doc.add(Field.Text(“senderName”, senderName));
doc.add(Field.Text(“subject”, subject));
doc.add(Field.UnStored(“body”, body));
if (getSenderDomain().endsWithIgnoreCase(COMPANY_DOMAIN)) {
doc.setBoost(1.5);
} else if (getSenderDomain().endsWithIgnoreCase(BAD_DOMAIN)) {
doc.setBoost(0.1);
}
writer.addDocument(doc);

boost on field
Field senderNameField = Field.Text(“senderName”, senderName);
Field subjectField = Field.Text(“subject”, subject);
subjectField.setBoost(1.2);

2. Learn from my old project easyview and easySearch
2.1 Build easyview
clone and build memcached package and upload it to my local nexus repository
>git clone https://github.com/gwhalin/Memcached-Java-Client.git
>ant

Sometimes, my local nexus will have this kind of message:
:8081/nexus/content/groups/public was cached in the local repository, resolution
will not be reattempted until the update interval of repo has elapsed or
updates are forced -> [Help 1]

delete the .cache directory in repository, delete the paoding directory in repository, make maven download the pom again.

2.2 Build easySearch

3. Study latest lucene and itegrate with easyview and easysearch
ANALYZED VS ANALYZED_NO_NORMS
Norms are created for quick scoring of documents at query time. These norms are usually all loaded into memory so that when you run a query analyzer over an index it can quickly score the search results.

No norms means that index-time field and document boosting and field length normalization are disabled. The benefit is less memory usage as norms take up one byte of RAM per indexed field for every document in the index, during searching.

Most of the class are the the same as easyview. Some changes in LuceneServiceImpl.java

package com.sillycat.easyhunter.plugin.lucene;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import com.sillycat.easyhunter.common.StringUtil;

public class LuceneServiceImpl implements LuceneService {

protected final Log log = LogFactory.getLog(getClass());

private Analyzer analyzer = new CJKAnalyzer(Version.LUCENE_36);

// default index file path
private static final String INDEX_PATH = "D:\\lucene\\index";

private String indexPath;

public List<Document> search(String[] keys, String search, boolean isMore) {
IndexSearcher searcher = null;
IndexReader reader = null;
ScoreDoc[] hits = null;
Directory dir = null;
List<Document> documents = null;
Query query = null;
try {
dir = FSDirectory.open(new File(this.getIndexPath()));
reader = IndexReader.open(dir);
searcher = new IndexSearcher(reader);
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(
Version.LUCENE_36, keys, analyzer);
queryParser.setDefaultOperator(QueryParser.Operator.OR);
query = queryParser.parse(search);
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}

TopDocs results = null;
int numTotalHits = 0;

// 5 pages first
try {
results = searcher.search(query, 5 * 10);
hits = results.scoreDocs;
numTotalHits = results.totalHits;
if (isMore && numTotalHits > 0) {
// total pages
hits = searcher.search(query, numTotalHits).scoreDocs;
}
} catch (IOException e) {
e.printStackTrace();
}

if (hits != null && hits.length > 0) {
documents = new ArrayList<Document>(hits.length);
}
for (int i = 0; i < hits.length; i++) {
try {
Document doc = searcher.doc(hits[i].doc);
documents.add(doc);
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
try {
searcher.close();
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
return documents;

}

/**
* 搜索
*
* @param key
*            要搜索的KEY，比如找context字段 context
* @param search
*            要搜索的内容，比如找context中出现了我爱你
* @param memory
*            true 内存的索引，false 配置的路径的索引
*/
public List<Document> search(String key, String search, boolean isMore) {
IndexSearcher searcher = null;
IndexReader reader = null;
ScoreDoc[] hits = null;
Directory dir = null;
List<Document> documents = null;
Query query = null;
try {
dir = FSDirectory.open(new File(this.getIndexPath()));
reader = IndexReader.open(dir);
searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(Version.LUCENE_36, key,
analyzer);
query = parser.parse(search);
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}

TopDocs results = null;
int numTotalHits = 0;

// 5 pages first
try {
results = searcher.search(query, 5 * 10);
hits = results.scoreDocs;
numTotalHits = results.totalHits;
if (isMore && numTotalHits > 0) {
// total pages
hits = searcher.search(query, numTotalHits).scoreDocs;
}
} catch (IOException e) {
e.printStackTrace();
}

if (hits != null && hits.length > 0) {
documents = new ArrayList<Document>(hits.length);
}
for (int i = 0; i < hits.length; i++) {
try {
Document doc = searcher.doc(hits[i].doc);
documents.add(doc);
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
try {
searcher.close();
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
return documents;

}

/**
* 建立索引
*
* @param list
*            要建立索引的list
* @param memory
*            true 内存中建立索引，false 配置的路径上存放索引
*/
public void buildIndex(List<LuceneObject> list, boolean isCreat) {
Directory dir = null;
IndexWriter writer = null;
try {
dir = FSDirectory.open(new File(this.getIndexPath()));
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_36,
analyzer);
if (isCreat) {
// Create a new index in the directory, removing any
// previously indexed documents:
iwc.setOpenMode(OpenMode.CREATE);
} else {
// Add new documents to an existing index:
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
}
writer = new IndexWriter(dir, iwc);
} catch (IOException e) {
e.printStackTrace();
}

Iterator<LuceneObject> iterator = list.iterator();
Document doc = null;
LuceneObject bo = null;
try {
while (iterator.hasNext()) {
bo = (LuceneObject) iterator.next();
doc = bo.buildindex();
if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
writer.addDocument(doc);
} else {
Term term = new Term("id", doc.get("id"));
writer.updateDocument(term, doc);
}
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
writer.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}

public String getIndexPath() {
if (StringUtil.isBlank(indexPath)) {
indexPath = INDEX_PATH;
}
return indexPath;
}

public void setIndexPath(String indexPath) {
this.indexPath = indexPath;
}

}

The test case is as follow:
package com.sillycat.easyhunter.model;

import static org.junit.Assert.assertEquals;

import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import junit.framework.Assert;

import org.apache.lucene.document.Document;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.test.context.ContextConfiguration;
import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;

import com.sillycat.easyhunter.plugin.lucene.LuceneObject;
import com.sillycat.easyhunter.plugin.lucene.LuceneService;

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = { "file:src/test/resources/test-context.xml" })
public class ArticleLuceneServiceTest {

@Autowired
@Qualifier("articleLuceneService")
private LuceneService articleLuceneService;

@Test
public void dumy() {
Assert.assertTrue(true);
}

@Test
public void search() throws Exception {
List<LuceneObject> list = new ArrayList<LuceneObject>();
Article a1 = new Article();
a1.setAuthor("罗华");
a1.setContent("罗华用中文写的一篇文章，发布在网站上。");
a1.setGmtCreate(new Date());
a1.setId("1");
a1.setTitle("中文的技术BLOG");
a1.setWebsiteURL("http://sillycat.iteye.com");
Article a2 = new Article();
a2.setAuthor("康怡怡");
a2.setContent("罗华用中文语言写的一篇文章，发布在网页上。");
a2.setGmtCreate(new Date());
a2.setId("2");
a2.setTitle("英文的BLOG");
a2.setWebsiteURL("http://hi.baidu.com/luohuazju");
list.add(a1);
list.add(a2);
articleLuceneService.buildIndex(list, true);
List<Document> results = articleLuceneService.search("content", "网页",
true);
Assert.assertNotNull(results);
assertEquals(1, results.size());
Document doc = results.get(0);
assertEquals("康怡怡", doc.get("author"));
assertEquals("2", doc.get("id"));
assertEquals("英文的BLOG", doc.get("title"));

results = articleLuceneService.search("content", "中文", true);
Assert.assertNotNull(results);
assertEquals(2, results.size());

results = articleLuceneService.search(new String[]{"title", "content", "author"}, "技术", true);
Assert.assertNotNull(results);
assertEquals(1, results.size());

results = articleLuceneService.search(new String[]{"title", "content", "author"}, "康怡怡", true);
Assert.assertNotNull(results);
assertEquals(1, results.size());
}
}

references:
http://www.iteye.com/topic/1116581
http://sillycat.iteye.com/blog/563586

http://www.slideshare.net/wangscu/jessica-2
http://www.iteye.com/topic/1119600
http://www.iteye.com/topic/1122348
http://lucene.apache.org/
http://lucene.apache.org/core/
http://lucene.apache.org/core/3_6_0/index.html
http://lucene.apache.org/core/3_6_0/gettingstarted.html

热点排行

软件架构设计

lucene(一)Reading related books