Mahout推荐算法的现实应用（二）

2012-07-30

Mahout推荐算法的实际应用（二）为Wikipedia的链接关系做推荐数据量：130,160,392 links from 5,706,070 arti

Mahout推荐算法的实际应用（二）

为Wikipedia的链接关系做推荐

数据量：130,160,392 links from 5,706,070 articles, to 3,773,865无评分值（链接关系仅表示相关所以可以使用LogLikelihoodSimilarity）

因为分布式推荐系统（map_reduce）运行速度一般较慢，一般并不适合在线推荐系统

实际实现：

基于Item_based的推荐算法实际使用

101

102

103

104

105

106

107

101

2.0

40.0

102

0.0

18.5

103

0.0

24.5

104

4.0

40.0

105

4.5

26.0

106

0.0

16.5

107

5.0

15.5

以上为利用Item相似度矩阵和U3对其中部分Item偏好计算的推荐结果

Mahout推荐算法的Hadoop实现org.apache.mahout.cf.taste.hadoop.RecommenderJob

具体实现步骤

1. 生成用户矢量

1） Input files are treated as (Long,String) pairs by the framework, where the Long key is a position in the file and String value is the line of the text file. Example: 239 / 98955: 590 22 9059

2） Each line is parsed into user ID and several item IDs by a map function. The function emits new key-value pairs: user ID mapped to item ID, for each item ID. Example: 98955 / 590

3） The framework collects all item IDs that were mapped to each user ID together.

4） A reduce function constructors a Vector from all item IDs for the user, and outputs the user ID mapped to the user’s preference vector. All values in this vector are 0 or 1. Example: 98955 / [590:1.0, 22:1.0, 9059:1.0]

为每一个用户保留一个相关的Item列表

2. 计算相似度矩阵

1） Input is user IDs mapped to Vectors of user preferences -- the output of the last MapReduce. Example: 98955 / [590:1.0,22:1.0,9059:1.0]

2） The map function determines all co-occurrences from one user’s preferences, and emits one pair of item IDs for each co-occurrence -- item ID mapped to item ID. Both mappings, from one item ID to the other and vice versa, are recorded. Example: 590 / 22

Map 保存每个用户矢量内部全部相关的Item组

3） The framework collects, for each item, all co-occurrences mapped from that item.

4） The reducer counts, for each item ID, all co-occurrences that it receives and constructs a new Vector, which represents all co-occurrences for one item with count of number of times they have co-occurred. These can be used as the rows -- or columns -- of the co-occurrence matrix. Example: 590 / [22:3.0,95:1.0,…,9059:1.0,…]

生成相关度矩阵（从Item组中得到保存权重）

3. 将1的向量与2的矩阵相乘得到推荐

for each row i in the co-occurrence matrix

compute dot product of row vector i with the user vector

assign dot product to ith element of R（正常使用的推荐算法）

=》

由于相似度矩阵是沿对角先对称的上门的算法与下面的一致

assign R to be the zero vector

for each column i in the co-occurrence matrix

multiply column vector i by the ith element of the user vector

add this vector to R

实际计算过程：

101

102

103

104

105

106

107

101

2.0

40.0

102

4.5

0.0

18.5

103

4.5

0.0

24.5

104

4.0

40.0

105

4.5

26.0

106

4.5

0.0

16.5

107

4.5

5.0

15.5

注意：

1、对应不在用户矢量内部的Item用户未作评价不会影响到最终的输出结果

（上表中 102列 U3偏好值为0 在乘法中102列实在与0相乘不会影响最终结果）

由于Item数目远多于User矢量的维度（已表达偏好的Item）所以计算量将极大程度的简化

2、使用的列向量是非常适合分布式存储的且完全不相干

Mapper 1:

5） Input for mapper 1 is the co-occurrence matrix: item IDs as keys, mapped to columns as Vectors. Example: 590 / [22:3.0,95:1.0,…,9059:1.0,…]

6） The map function simply echoes its input, but with the Vector wrapped in a VectorOrPrefWritable.

Mapper 2:

1） Input for mapper 2 is again the user vectors: user IDs as keys, mapped to preference Vectors. Example: 98955 / [590:1.0,22:1.0,9059:1.0]

2） For each non-zero value in the user vector, the map function outputs item ID mapped to the user ID and preference value, wrapped in a VectorOrPrefWritable. Example: 590 / [98955:1.0]

3） The framework collects together, by item ID, the co-occurrence column and all user ID / preference value pairs.

每个项目的最后偏好值计算步骤

1） Input to the mapper is all co-occurrence column / user records. Example: 590 / [22:3.0,95:1.0,…,9059:1.0,…] and 590 / [98955:1.0]

2） Mapper outputs the co-occurrence column for each associated user times the preference value. Example: 590 / [22:3.0,95:1.0,…,9059:1.0,…]

3） The framework collects these partial products together, by user

4） The reducer unpacks this input and sums all the vectors, which gives the user’s final recommendation vector (call it R). Example: 590 / [22:4.0,45:3.0,95:11.0,…,9059:1.0,…]

此时的输出排序后即可作为推荐结果

ReCommender在hadoop中运行结构图

Mahout推荐算法的现实应用（二）

Mahout的Hadoop另一种使用方法：在多台机器上运行同一个推荐引擎

（将数据复制到每一台机器上（对数据量有限制），在每台机器上针对用户子集运行推荐算法）

优点：不用对现有的已经实现的推荐算法进行修改

局限：数据量仍然有限，数据量必须限制在一台机器的处理能力之内

用法举例：bin/hadoop jar target/mahout-core-0.4-SNAPSHOT.job

org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob

-Dmapred.input.dir=input/ua.base

-Dmapred.output.dir=output

--recommenderClassName

org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender

热点排行

云计算

Mahout推荐算法的现实应用（二）