mahout源码分析之DistributedLanczosSolver(一)实战
Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。
本篇开始系列svd,即降维。这个在mahout中可以直接运行MAHOUT_HOME/mahout/svd -h 即可看到该算法的调用参数,或者在官网相应页面也可以看到,本次实战使用的svd的调用参数如下:
可以使用下面的代码把上面的文件转换为序列文件,同时value的格式为VectorWritable类型:一个是:cleanEigenvectors(最后的输出):
终端的显示信息如下:
SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/D:/workspase/mahout/lib/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/D:/workspase/mahout/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.13/10/28 00:14:13 WARN fs.FileSystem: "ubuntu:9000" is a deprecated filesystem name. Use "hdfs://ubuntu:9000/" instead.13/10/28 00:14:13 INFO common.AbstractJob: Command line arguments: {--cleansvd=[true], --endPhase=[2147483647], --inMemory=[false], --input=[hdfs://ubuntu:9000/svd/input/wind], --maxError=[0.05], --minEigenvalue=[0.0], --numCols=[14], --numRows=[178], --output=[hdfs://ubuntu:9000/svd/output], --rank=[3], --startPhase=[0], --symmetric=[square], --tempDir=[hdfs://ubuntu:9000/svd/temp]}13/10/28 00:15:43 INFO lanczos.LanczosSolver: Finding 3 singular vectors of matrix with 178 rows, via Lanczos13/10/28 00:15:45 INFO mapred.FileInputFormat: Total input paths to process : 113/10/28 00:15:49 INFO mapred.JobClient: Running job: job_201310220012_002313/10/28 00:15:50 INFO mapred.JobClient: map 0% reduce 0%13/10/28 00:18:40 INFO mapred.JobClient: map 100% reduce 0%13/10/28 00:19:11 INFO mapred.JobClient: map 100% reduce 100%13/10/28 00:19:16 INFO mapred.JobClient: Job complete: job_201310220012_002313/10/28 00:19:16 INFO mapred.JobClient: Counters: 3013/10/28 00:19:16 INFO mapred.JobClient: Job Counters 13/10/28 00:19:16 INFO mapred.JobClient: Launched reduce tasks=113/10/28 00:19:16 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=24186813/10/28 00:19:16 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=013/10/28 00:19:16 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=013/10/28 00:19:16 INFO mapred.JobClient: Launched map tasks=213/10/28 00:19:16 INFO mapred.JobClient: Data-local map tasks=213/10/28 00:19:16 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=2819813/10/28 00:19:16 INFO mapred.JobClient: File Input Format Counters 13/10/28 00:19:16 INFO mapred.JobClient: Bytes Read=2321613/10/28 00:19:16 INFO mapred.JobClient: File Output Format Counters 13/10/28 00:19:16 INFO mapred.JobClient: Bytes Written=22013/10/28 00:19:16 INFO mapred.JobClient: FileSystemCounters13/10/28 00:19:16 INFO mapred.JobClient: FILE_BYTES_READ=23813/10/28 00:19:16 INFO mapred.JobClient: HDFS_BYTES_READ=2396713/10/28 00:19:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7011113/10/28 00:19:16 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=22013/10/28 00:19:16 INFO mapred.JobClient: Map-Reduce Framework13/10/28 00:19:16 INFO mapred.JobClient: Map output materialized bytes=24413/10/28 00:19:16 INFO mapred.JobClient: Map input records=17813/10/28 00:19:16 INFO mapred.JobClient: Reduce shuffle bytes=24413/10/28 00:19:16 INFO mapred.JobClient: Spilled Records=413/10/28 00:19:16 INFO mapred.JobClient: Map output bytes=22813/10/28 00:19:16 INFO mapred.JobClient: Total committed heap usage (bytes)=24668569613/10/28 00:19:16 INFO mapred.JobClient: CPU time spent (ms)=5394013/10/28 00:19:16 INFO mapred.JobClient: Map input bytes=2109413/10/28 00:19:16 INFO mapred.JobClient: SPLIT_RAW_BYTES=17213/10/28 00:19:16 INFO mapred.JobClient: Combine input records=213/10/28 00:19:16 INFO mapred.JobClient: Reduce input records=213/10/28 00:19:16 INFO mapred.JobClient: Reduce input groups=113/10/28 00:19:16 INFO mapred.JobClient: Combine output records=213/10/28 00:19:16 INFO mapred.JobClient: Physical memory (bytes) snapshot=41443328013/10/28 00:19:16 INFO mapred.JobClient: Reduce output records=113/10/28 00:19:16 INFO mapred.JobClient: Virtual memory (bytes) snapshot=293428838413/10/28 00:19:16 INFO mapred.JobClient: Map output records=213/10/28 00:19:16 INFO lanczos.LanczosSolver: 1 passes through the corpus so far...13/10/28 00:19:17 INFO mapred.FileInputFormat: Total input paths to process : 113/10/28 00:19:18 INFO mapred.JobClient: Running job: job_201310220012_002413/10/28 00:19:19 INFO mapred.JobClient: map 0% reduce 0%13/10/28 00:20:06 INFO mapred.JobClient: map 100% reduce 0%13/10/28 00:20:25 INFO mapred.JobClient: map 100% reduce 100%13/10/28 00:20:30 INFO mapred.JobClient: Job complete: job_201310220012_002413/10/28 00:20:30 INFO mapred.JobClient: Counters: 3013/10/28 00:20:30 INFO mapred.JobClient: Job Counters 13/10/28 00:20:30 INFO mapred.JobClient: Launched reduce tasks=113/10/28 00:20:30 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=7912013/10/28 00:20:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=013/10/28 00:20:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=013/10/28 00:20:30 INFO mapred.JobClient: Launched map tasks=213/10/28 00:20:30 INFO mapred.JobClient: Data-local map tasks=213/10/28 00:20:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1519113/10/28 00:20:30 INFO mapred.JobClient: File Input Format Counters 13/10/28 00:20:30 INFO mapred.JobClient: Bytes Read=2321613/10/28 00:20:30 INFO mapred.JobClient: File Output Format Counters 13/10/28 00:20:30 INFO mapred.JobClient: Bytes Written=22013/10/28 00:20:30 INFO mapred.JobClient: FileSystemCounters13/10/28 00:20:30 INFO mapred.JobClient: FILE_BYTES_READ=23813/10/28 00:20:30 INFO mapred.JobClient: HDFS_BYTES_READ=2396713/10/28 00:20:30 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7010513/10/28 00:20:30 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=22013/10/28 00:20:30 INFO mapred.JobClient: Map-Reduce Framework13/10/28 00:20:30 INFO mapred.JobClient: Map output materialized bytes=24413/10/28 00:20:30 INFO mapred.JobClient: Map input records=17813/10/28 00:20:30 INFO mapred.JobClient: Reduce shuffle bytes=24413/10/28 00:20:30 INFO mapred.JobClient: Spilled Records=413/10/28 00:20:30 INFO mapred.JobClient: Map output bytes=22813/10/28 00:20:30 INFO mapred.JobClient: Total committed heap usage (bytes)=24668569613/10/28 00:20:30 INFO mapred.JobClient: CPU time spent (ms)=1883013/10/28 00:20:30 INFO mapred.JobClient: Map input bytes=2109413/10/28 00:20:30 INFO mapred.JobClient: SPLIT_RAW_BYTES=17213/10/28 00:20:30 INFO mapred.JobClient: Combine input records=213/10/28 00:20:30 INFO mapred.JobClient: Reduce input records=213/10/28 00:20:30 INFO mapred.JobClient: Reduce input groups=113/10/28 00:20:30 INFO mapred.JobClient: Combine output records=213/10/28 00:20:30 INFO mapred.JobClient: Physical memory (bytes) snapshot=40629452813/10/28 00:20:30 INFO mapred.JobClient: Reduce output records=113/10/28 00:20:30 INFO mapred.JobClient: Virtual memory (bytes) snapshot=293428838413/10/28 00:20:30 INFO mapred.JobClient: Map output records=213/10/28 00:20:30 INFO lanczos.LanczosSolver: 2 passes through the corpus so far...13/10/28 00:20:30 INFO lanczos.LanczosSolver: Lanczos iteration complete - now to diagonalize the tri-diagonal auxiliary matrix.13/10/28 00:20:30 INFO lanczos.LanczosSolver: Eigenvector 0 found with eigenvalue 0.013/10/28 00:20:30 INFO lanczos.LanczosSolver: Eigenvector 1 found with eigenvalue 190.6636681493584713/10/28 00:20:30 INFO lanczos.LanczosSolver: Eigenvector 2 found with eigenvalue 10886.66443437289113/10/28 00:20:30 INFO lanczos.LanczosSolver: LanczosSolver finished.13/10/28 00:20:30 INFO decomposer.DistributedLanczosSolver: Persisting 3 eigenVectors and eigenValues to: hdfs://ubuntu:9000/svd/output/rawEigenvectors13/10/28 00:20:31 INFO mapred.FileInputFormat: Total input paths to process : 113/10/28 00:20:31 INFO mapred.JobClient: Running job: job_201310220012_002513/10/28 00:20:32 INFO mapred.JobClient: map 0% reduce 0%13/10/28 00:21:06 INFO mapred.JobClient: map 50% reduce 0%13/10/28 00:21:12 INFO mapred.JobClient: map 100% reduce 0%13/10/28 00:21:30 INFO mapred.JobClient: map 100% reduce 100%13/10/28 00:21:35 INFO mapred.JobClient: Job complete: job_201310220012_002513/10/28 00:21:35 INFO mapred.JobClient: Counters: 3013/10/28 00:21:35 INFO mapred.JobClient: Job Counters 13/10/28 00:21:35 INFO mapred.JobClient: Launched reduce tasks=113/10/28 00:21:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6143813/10/28 00:21:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=013/10/28 00:21:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=013/10/28 00:21:35 INFO mapred.JobClient: Launched map tasks=213/10/28 00:21:35 INFO mapred.JobClient: Data-local map tasks=213/10/28 00:21:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1734213/10/28 00:21:35 INFO mapred.JobClient: File Input Format Counters 13/10/28 00:21:35 INFO mapred.JobClient: Bytes Read=2321613/10/28 00:21:35 INFO mapred.JobClient: File Output Format Counters 13/10/28 00:21:35 INFO mapred.JobClient: Bytes Written=22013/10/28 00:21:35 INFO mapred.JobClient: FileSystemCounters13/10/28 00:21:35 INFO mapred.JobClient: FILE_BYTES_READ=23813/10/28 00:21:35 INFO mapred.JobClient: HDFS_BYTES_READ=2406113/10/28 00:21:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7010813/10/28 00:21:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=22013/10/28 00:21:35 INFO mapred.JobClient: Map-Reduce Framework13/10/28 00:21:35 INFO mapred.JobClient: Map output materialized bytes=24413/10/28 00:21:35 INFO mapred.JobClient: Map input records=17813/10/28 00:21:35 INFO mapred.JobClient: Reduce shuffle bytes=24413/10/28 00:21:35 INFO mapred.JobClient: Spilled Records=413/10/28 00:21:35 INFO mapred.JobClient: Map output bytes=22813/10/28 00:21:35 INFO mapred.JobClient: Total committed heap usage (bytes)=29151232013/10/28 00:21:35 INFO mapred.JobClient: CPU time spent (ms)=1310013/10/28 00:21:35 INFO mapred.JobClient: Map input bytes=2109413/10/28 00:21:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=17213/10/28 00:21:35 INFO mapred.JobClient: Combine input records=213/10/28 00:21:35 INFO mapred.JobClient: Reduce input records=213/10/28 00:21:35 INFO mapred.JobClient: Reduce input groups=113/10/28 00:21:35 INFO mapred.JobClient: Combine output records=213/10/28 00:21:35 INFO mapred.JobClient: Physical memory (bytes) snapshot=41721036813/10/28 00:21:35 INFO mapred.JobClient: Reduce output records=113/10/28 00:21:35 INFO mapred.JobClient: Virtual memory (bytes) snapshot=293428838413/10/28 00:21:35 INFO mapred.JobClient: Map output records=213/10/28 00:21:36 INFO mapred.FileInputFormat: Total input paths to process : 113/10/28 00:21:36 INFO mapred.JobClient: Running job: job_201310220012_002613/10/28 00:21:37 INFO mapred.JobClient: map 0% reduce 0%13/10/28 00:22:11 INFO mapred.JobClient: map 100% reduce 0%13/10/28 00:22:29 INFO mapred.JobClient: map 100% reduce 100%13/10/28 00:22:34 INFO mapred.JobClient: Job complete: job_201310220012_002613/10/28 00:22:34 INFO mapred.JobClient: Counters: 3013/10/28 00:22:34 INFO mapred.JobClient: Job Counters 13/10/28 00:22:34 INFO mapred.JobClient: Launched reduce tasks=113/10/28 00:22:34 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4699013/10/28 00:22:34 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=013/10/28 00:22:34 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=013/10/28 00:22:34 INFO mapred.JobClient: Launched map tasks=213/10/28 00:22:34 INFO mapred.JobClient: Data-local map tasks=213/10/28 00:22:34 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1514213/10/28 00:22:34 INFO mapred.JobClient: File Input Format Counters 13/10/28 00:22:34 INFO mapred.JobClient: Bytes Read=2321613/10/28 00:22:34 INFO mapred.JobClient: File Output Format Counters 13/10/28 00:22:34 INFO mapred.JobClient: Bytes Written=22013/10/28 00:22:34 INFO mapred.JobClient: FileSystemCounters13/10/28 00:22:34 INFO mapred.JobClient: FILE_BYTES_READ=23813/10/28 00:22:34 INFO mapred.JobClient: HDFS_BYTES_READ=2406113/10/28 00:22:34 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7010513/10/28 00:22:34 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=22013/10/28 00:22:34 INFO mapred.JobClient: Map-Reduce Framework13/10/28 00:22:34 INFO mapred.JobClient: Map output materialized bytes=24413/10/28 00:22:34 INFO mapred.JobClient: Map input records=17813/10/28 00:22:34 INFO mapred.JobClient: Reduce shuffle bytes=24413/10/28 00:22:34 INFO mapred.JobClient: Spilled Records=413/10/28 00:22:34 INFO mapred.JobClient: Map output bytes=22813/10/28 00:22:34 INFO mapred.JobClient: Total committed heap usage (bytes)=33633894413/10/28 00:22:34 INFO mapred.JobClient: CPU time spent (ms)=467013/10/28 00:22:34 INFO mapred.JobClient: Map input bytes=2109413/10/28 00:22:34 INFO mapred.JobClient: SPLIT_RAW_BYTES=17213/10/28 00:22:34 INFO mapred.JobClient: Combine input records=213/10/28 00:22:34 INFO mapred.JobClient: Reduce input records=213/10/28 00:22:34 INFO mapred.JobClient: Reduce input groups=113/10/28 00:22:34 INFO mapred.JobClient: Combine output records=213/10/28 00:22:34 INFO mapred.JobClient: Physical memory (bytes) snapshot=42640998413/10/28 00:22:34 INFO mapred.JobClient: Reduce output records=113/10/28 00:22:34 INFO mapred.JobClient: Virtual memory (bytes) snapshot=293428838413/10/28 00:22:34 INFO mapred.JobClient: Map output records=213/10/28 00:22:34 INFO mapred.FileInputFormat: Total input paths to process : 113/10/28 00:22:35 INFO mapred.JobClient: Running job: job_201310220012_002713/10/28 00:22:36 INFO mapred.JobClient: map 0% reduce 0%13/10/28 00:22:54 INFO mapred.JobClient: map 100% reduce 0%13/10/28 00:23:03 INFO mapred.JobClient: map 100% reduce 33%13/10/28 00:23:09 INFO mapred.JobClient: map 100% reduce 100%13/10/28 00:23:14 INFO mapred.JobClient: Job complete: job_201310220012_002713/10/28 00:23:14 INFO mapred.JobClient: Counters: 3013/10/28 00:23:14 INFO mapred.JobClient: Job Counters 13/10/28 00:23:14 INFO mapred.JobClient: Launched reduce tasks=113/10/28 00:23:14 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=2527613/10/28 00:23:14 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=013/10/28 00:23:14 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=013/10/28 00:23:14 INFO mapred.JobClient: Launched map tasks=213/10/28 00:23:14 INFO mapred.JobClient: Data-local map tasks=213/10/28 00:23:14 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1495613/10/28 00:23:14 INFO mapred.JobClient: File Input Format Counters 13/10/28 00:23:14 INFO mapred.JobClient: Bytes Read=2321613/10/28 00:23:14 INFO mapred.JobClient: File Output Format Counters 13/10/28 00:23:14 INFO mapred.JobClient: Bytes Written=22013/10/28 00:23:14 INFO mapred.JobClient: FileSystemCounters13/10/28 00:23:14 INFO mapred.JobClient: FILE_BYTES_READ=23813/10/28 00:23:14 INFO mapred.JobClient: HDFS_BYTES_READ=2403113/10/28 00:23:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7010513/10/28 00:23:14 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=22013/10/28 00:23:14 INFO mapred.JobClient: Map-Reduce Framework13/10/28 00:23:14 INFO mapred.JobClient: Map output materialized bytes=24413/10/28 00:23:14 INFO mapred.JobClient: Map input records=17813/10/28 00:23:14 INFO mapred.JobClient: Reduce shuffle bytes=24413/10/28 00:23:14 INFO mapred.JobClient: Spilled Records=413/10/28 00:23:14 INFO mapred.JobClient: Map output bytes=22813/10/28 00:23:14 INFO mapred.JobClient: Total committed heap usage (bytes)=33633894413/10/28 00:23:14 INFO mapred.JobClient: CPU time spent (ms)=268013/10/28 00:23:14 INFO mapred.JobClient: Map input bytes=2109413/10/28 00:23:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=17213/10/28 00:23:14 INFO mapred.JobClient: Combine input records=213/10/28 00:23:14 INFO mapred.JobClient: Reduce input records=213/10/28 00:23:14 INFO mapred.JobClient: Reduce input groups=113/10/28 00:23:14 INFO mapred.JobClient: Combine output records=213/10/28 00:23:14 INFO mapred.JobClient: Physical memory (bytes) snapshot=42358784013/10/28 00:23:14 INFO mapred.JobClient: Reduce output records=113/10/28 00:23:14 INFO mapred.JobClient: Virtual memory (bytes) snapshot=293428838413/10/28 00:23:14 INFO mapred.JobClient: Map output records=2具体明天再分析吧
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990