将 Solr 等 data 变换为 Mahout vector

2012-06-30

将 Solr 等 data 转换为 Mahout vector参考：http://mylazycoding.blogspot.com/2012/03/cluster-apache-so

将 Solr 等 data 转换为 Mahout vector

参考：http://mylazycoding.blogspot.com/2012/03/cluster-apache-solr-data-using-apache_13.html

Lately, I was working on Integration of Apache Mahout algorithms with Apache Solr. I am able to integrate Solr with Mahout Classification and Clustering algorithms. I will post a series of blogs on this integration. This post would guide you to Cluster your Solr data using K-Means Clustering algorithm of Mahout.
Step 1 – Configure Solr & Index Data:

Before indexing some sample data into Solr make sure to configure fields in SolrConfig.xml(schema.xml).

<field name=”field_name” type=”text” indexed=”true” stored=”true” termVector=”true” />

Add termVector=”true” for the fields which can be clustered
Indexing some sample documents into Solr
Step 2 – Convert Lucene Index to Mahout Vectors

mahout lucene.vector <PATH OF INDEXES> --output <OUTPUT VECTOR PATH> --field <field_name> --idField id –dicOut <OUTPUT DICTIONARY PATH> --norm 2

Step 3 – Run K-Means Clustering?

mahout kmeans -i <OUTPUT VECTOR PATH> -c <PATH TO CLUSTER CENTROIDS> -o <PATH TO OUTPUT CLUSTERS> -dm org.apache.mahout.common.distance.CosineDistanceMeasure –x 10 –k 20 –ow –clustering

Here:
- k: number of clusters/value of K in K-Means clustering
- x: maximum iterations
- o: path to output clusters
- ow: overwrite output directory
- dm: classname of Distance Measure
  Step 4 – Analyze Cluster Output
  
  mahout clusterdump -s <PATH TO OUTPUT CLUSTERS> -d <OUTPUT DICTIONARY PATH> -dt text -n 20 -dm org.apache.mahout.common.distance.CosineDistnanceMeasure --pointsDir <PATH OF OUTPUT CLUSTERED POINTS> --output <PATH OF OUTPUT DIR>
  
  Here:
  - s: Directory containing clusters
  - d:Path of dictionary from step #2
  - dt: Format of dictionary file
  - n: number of top terms
  - output: Path of generated clustershttp://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/Mahout Vectors from Lucene Term Vectors
    In order for Mahout to create vectors from a Lucene index, the first and foremost thing that must be done is that the index must contain Term Vectors.? A term vector is a document centric view of the terms and their frequencies (as opposed to the inverted index, which is a term centric view) and is not on by default.
    For this example, I’m going to use Solr’s example, located in <Solr Home>/example
    In Solr, storing Term Vectors is as simple as setting termVectors=”true” on on the field in the schema, as in:
    <field name=”text” type=”text” indexed=”true” stored=”true” termVectors=”true”/>
    For pure Lucene, you will need to set the TermVector option on during Field creation, as in:
    Field fld = new Field(“text”, “foo”, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES);
    From here, it’s as simple as pointing Mahout’s new shell script (try running <MAHOUT HOME>/bin/mahout for a full listing of it’s capabilities) at the index and letting it rip:
    <MAHOUT HOME>/bin/mahout lucene.vector –dir <PATH TO INDEX>/example/solr/data/index/ –output /tmp/foo/part-out.vec –field title-clustering –idField id –dictOut /tmp/foo/dict.out –norm 2
    A few things to note about this command:
    1. This outputs a single vector file, title part-out.vec to the target/foo directory
    2. It uses the title-clustering field.? If you want a combination of fields, then you will have to create a single “merged” field containing those fields.? Solr’s <copyField> syntax can make this easy.
    3. The idField is used to provide a label to the Mahout vector such that the output from Mahout’s algorithms can be traced back to the actual documents.
    4. The –dictOut outputs the list of terms that are represented in the Mahout vectors.? Mahout uses an internal, sparse vector representation for text documents (dense vector representations are also available) so this file contains the “key” for making sense of the vectors later.? As an aside, if you ever have problems with Mahout, you can often share your vectors with the list and simply keep the dictionary to yourself, since it would be pretty difficult (not sure if it is impossible) to reverse engineer just the vectors.
    5. The –norm tells Mahout how to?normalize?the vector.? For many Mahout applications, normalization is a necessary process for obtaining good results.? In this case, I am using the Euclidean distance (aka the 2-norm) to normalize the vector because I intend to cluster the documents using the Euclidean distance similarity.? Other approaches may require other norms.
    https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
    Creating Vectors from Text
    ?
    Introduction
    For clustering documents it is usually necessary to convert the raw text into vectors that can then be consumed by the clustering?Algorithms. These approaches are described below.
    From Lucene
    NOTE: Your Lucene index must be created with the same version of Lucene used in Mahout. Check Mahout's POM file to get the version number, otherwise you will likely get "Exception in thread "main" org.apache.lucene.index.CorruptIndexException: Unknown format version: -11" as an error.
    Mahout has utilities that allow one to easily produce Mahout Vector representations from a Lucene (and Solr, since they are they same) index.
    For this, we assume you know how to build a Lucene/Solr index. For those who don't, it is probably easiest to get up and running using?Solr?as it can ingest things like PDFs, XML, Office, etc. and create a Lucene index. For those wanting to use just Lucene, see the Lucene?website?or check out?Lucene In Action?by Erik Hatcher, Otis Gospodnetic and Mike McCandless.
    To get started, make sure you get a fresh copy of Mahout from?SVN?and are comfortable building it. It defines interfaces and implementations for efficiently iterating over a Data Source (it only supports Lucene currently, but should be extensible to databases, Solr, etc.) and produces a Mahout Vector file and term dictionary which can then be used for clustering. The main code for driving this is the Driver program located in the org.apache.mahout.utils.vectors package. The Driver program offers several input options, which can be displayed by specifying the --help option. Examples of running the Driver are included below:
    Generating an output file from a Lucene Index
```
$MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE INDEX> \   --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO] \   <--max <Number of vectors to output>> <--norm {INF|integer >= 0}> <--idField <Name of the idField in the Lucene index>>
```
    Create 50 Vectors from an Index
```
$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body \    --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt --max 50
```
    This uses the index specified by --dir and the body field in it and writes out the info to the output dir and the dictionary to dict.txt. It only outputs 50 vectors. If you don't specify --max, then all the documents in the index are output.
    Normalize 50 Vectors from a Lucene Index using the?L_2 Norm
```
$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body \      --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt --max 50 --norm 2
```
    From Directory of Text documents
    Mahout has utilities to generate Vectors from a directory of text documents. Before creating the vectors, you need to convert the documents to SequenceFile format. SequenceFile is a hadoop class which allows us to write arbitary key,value pairs into it. The DocumentVectorizer requires the key to be a Text with a unique document id, and value to be the Text content in UTF-8 format.
    You may find Tika (http://lucene.apache.org/tika) helpful in converting binary documents to text.
    Converting directory of documents to SequenceFile format
    Mahout has a nifty utility which reads a directory path including its sub-directories and creates the SequenceFile in a chunked manner for us. the document id generated is <PREFIX><RELATIVE PATH FROM PARENT>/document.txt
    From the examples directory run
```
$MAHOUT_HOME/bin/mahout seqdirectory \--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
```
    Creating Vectors from SequenceFile
    Mahout_0.3
    From the sequence file generated from the above step run the following to generate vectors.
```
$MAHOUT_HOME/bin/mahout seq2sparse \-i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED> \<-wt <WEIGHTING METHOD USED> {tf|tfidf}> \<-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> \<-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT> org.apache.lucene.analysis.standard.StandardAnalyzer> \<--minSupport <MINIMUM SUPPORT> 2> \<--minDF <MINIMUM DOCUMENT FREQUENCY> 1> \<--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> \<--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>"<-seq <Create SequentialAccessVectors>{false|true required for running some algorithms(LDA,Lanczos)}>"
```
    --minSupport is the min frequency for the word to be considered as a feature. --minDF is the min number of documents the word needs to be in
    --maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. This helps remove high frequency features like stop words
    Background
    - http://www.lucidimagination.com/search/document/3d8310376b6cdf6b/centroid_calculations_with_sparse_vectors#86a54dae9052d68c
    - http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
      From a Database
      TODO:
      Other
      Converting existing vectors to Mahout's format
      If you are in the happy position to already own a document (as in: texts, images or whatever item you wish to treat) processing pipeline, the question arises of how to convert the vectors into the Mahout vector format. Probably the easiest way to go would be to implement your own Iterable<Vector> (called VectorIterable in the example below) and then reuse the existing VectorWriter classes:
      long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);

热点排行

开源软件

将 Solr 等 data 变换为 Mahout vector

Introduction
For clustering documents it is usually necessary to convert the raw text into vectors that can then be consumed by the clustering?Algorithms. These approaches are described below.

Normalize 50 Vectors from a Lucene Index using the?L_2 Norm
$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body \ --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt --max 50 --norm 2

From a Database
TODO:

Other

开源软件

将 Solr 等 data 变换为 Mahout vector

IntroductionFor clustering documents it is usually necessary to convert the raw text into vectors that can then be consumed by the clustering?Algorithms. These approaches are described below.

Normalize 50 Vectors from a Lucene Index using the?L_2 Norm$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body \ --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt --max 50 --norm 2

From a DatabaseTODO:

Other

Introduction
For clustering documents it is usually necessary to convert the raw text into vectors that can then be consumed by the clustering?Algorithms. These approaches are described below.

Normalize 50 Vectors from a Lucene Index using the?L_2 Norm
$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body \ --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt --max 50 --norm 2

From a Database
TODO: