【引用】Oracle全文检索方面的研究(全7)
3.7 Stoplist 属性
Stoplist 允许屏蔽某些常用的词,比如is,a,this,对这些词进行索引用处不大,系统
默认会使用和数据库语言相对应的停用词库(原文:Stoplists identify the words in your
language that are not to be indexed. In English, you can also identify stopthemes that are not to be indexed. By default, the system indexes text using the system-supplied stoplist that corresponds to your database language.),Oracle text 提供最常用的停用词库语言包括English, French, German,Spanish, Chinese, Dutch, and Danish
分别有basic_stoplist,empty_stoplist,default_stoplist,multi_stoplist 几种类型
?
3.7.1 Basic_stoplist
建立用户自定义的停用词库,文档中关于stoplist 的介绍相当少,只有寥寥的数行
例子:
Create table my_stop (id number, docs varchar2(1000));
Insert into my_stop values (1, 'Stoplists identify the words in your language that are not
to be indexed.');
Insert into my_stop values (2, 'ou can also identify stopthemes that are not to be indexed');
Commit;
/
--建立basic stoplist
Begin
Ctx_ddl.create_stoplist('test_stoplist', 'basic_stoplist');
End;
Create index ind_m_stop on my_stop(docs) indextype is ctxsys.context
parameters ('stoplist test_stoplist');
Select * from my_stop where contains(docs, 'words') > 0;
Begin
Ctx_ddl.add_stopword('test_stoplist', 'language'); --添加停用词
ctx_ddl.sync_index('ind_m_stop', '2m'); --同步索引
End;
Select * from my_stop where contains(docs, 'language') > 0; --添加停用词,同步索引后发现还是
能查到,需要重新建立索引才能生效
Drop index ind_m_stop;
Create index ind_m_stop on my_stop(docs) indextype is ctxsys.context
parameters ('stoplist test_stoplist');
Select * from my_stop where contains(docs, 'language') > 0; --停用词生效
添加停用词,同步索引后发现还是能查到,需要重新建立索引才能生效。
?
3.7.2 Empty_stoplist
停用词库没有任何停用词,适用于不需要过滤的查询中,如不需要过滤is this,a 等
?
3.7.3 Default_stoplist
建立basic_stoplist 后,里面不包含任何的停用词,而default_stoplist 在basic_stoplist 的基础
上增加了预定义的默认停用词,对于不同的语言,默认的停用词库数据也不一样
例子:
Create table my_stop (id number, docs varchar2(1000));
Insert into my_stop values (1, 'Stoplists identify the words in your language that are not
to be indexed.');
Insert into my_stop values (2, 'ou can also identify stopthemes that are not to be indexed');
Commit;
/
--建立lexer,不同lexer 属性会默认不同的停用词库
Begin
ctx_ddl.create_preference('test_b_lexer', 'basic_lexer');
End;
drop index ind_m_word;
--建立默认停用词default_stoplist
Create index ind_m_word on my_stop(docs) indextype is ctxsys.context
Parameters ('lexer test_b_lexer stoplist ctxsys.default_stoplist');
--检查默认词库中是否存在
Select * from my_stop where contains(docs, 'the') > 0;
Select * from my_stop where contains(docs, 'stopthemes') > 0;
--往默认词库中添加停用词
conn ctxsys/ctxsys;
Begin
ctx_ddl.add_stopword('default_stoplist', 'stopthemes'); --增加停用词
ctx_ddl.add_stopword('default_stoplist', 'words');
ctx_ddl.remove_stopword('default_stoplist', 'words');--删除停用词
End;
--添加后需重新建立索引才能生效
conn oratext/oratext;
drop index ind_m_word;
Create index ind_m_word on my_stop(docs) indextype is ctxsys.context
Parameters ('lexer test_b_lexer stoplist ctxsys.default_stoplist');
Select * from my_stop where contains(docs, 'words') > 0;
Select * from my_stop where contains(docs, 'stopthemes') > 0;
--相关数据字典
Select * from ctx_preferences where pre_name = 'DEFAULT_LEXER';
Select * from ctx_stopwords where spw_stoplist = 'DEFAULT_STOPLIST';
?
3.7.4 multi_stoplist
多语言停用词,适用在文档中包含不同的语言(A multi-language stoplist is useful when you use
the MULTI_LEXER to index a table that contains documents in different languages, such as
English, German, and Japanese)
增加停用词时,可以为停用词指定某种语言,只对指定的语言生效,默认情况下停用词对任
何语言都是生效的。
--建立multi_stoplist
begin
ctx_ddl.create_stoplist('multistop1', 'MULTI_STOPLIST');
ctx_ddl.add_stopword('multistop1', 'Die', 'german');
ctx_ddl.add_stopword('multistop1', 'Or', 'english');
end;
添加停用词,同步索引后发现还是能查到,需要重新建立索引才能生效。
?
3.7.5 参考脚本
--建立stoplist:
Begin
Ctx_ddl.create_stoplist('test_stoplist', 'basic_stoplist');
End;
--删除stoplist:
begin
ctx_ddl.drop_stoplist(' test_stoplist ');
end;
--增加停用词
ctx_ddl.add_stopword('default_stoplist', 'stopthemes'); --增加停用词
--删除停用词
??????????????? ctx_ddl.remove_stopword('default_stoplist', 'words');--删除停用词
?
3.8 Theme 主题查询
主题查询的概念是根据文档的含义,而不仅仅是根据某个词的匹配程度来返回查询结果
的。比如查询about(’US politics’)可能会返回‘US presidential elections’ 和 ‘US foreign
policy’之类的结果(原文:An ABOUT query is a query on a document theme. A document theme
is a concept that is sufficiently developed in the text. For example, an ABOUT query on US politics
might return documents containing information about US presidential elections and US foreign
policy. Documents need not contain the exact phrase US politics to be returned.)
10g 只支持两种主题查询语言:English,French
例子:
--在context 中启用主题查询
BEGIN
CTX_DDL.CREATE_PREFERENCE('TEST_ABOUT', 'BASIC_LEXER');
CTX_DDL.SET_ATTRIBUTE('TEST_ABOUT', 'INDEX_THEMES', 'YES');
CTX_DDL.SET_ATTRIBUTE('TEST_ABOUT', 'INDEX_TEXT', 'YES');
END;
CREATE INDEX IND_m_about ON my_about(DOCS) INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('LEXER CTXSYS.TEST_ABOUT');
--查询
SELECT * FROM my_about WHERE CONTAINS(DOCS, 'ABOUT(US politics)') > 0;
???????????????