Spark mlib协同过滤算法中文翻译

2,739次阅读
没有评论

最近看了下spark协同过滤的api,并根据提供的代码写了一版商品推荐代码,现在将当前的模块一些api函数翻译一下,万一有人需要呢,这个也是说不准,也加强自己对其的理解吧,大数据之路走起来

pyspark.mllib.recommendation module中文翻译

class pyspark.mllib.recommendation.MatrixFactorizationModel(java_model)[source]
A matrix factorisation model trained by regularized alternating least-squares.

<span class="gp">>>> </span><span class="n">r1</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">r2</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">r3</span> <span class="o">=</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">)
</span>rating 评分数据,用户 商品 评分 建议还是不要像官方这么写,rdd数据最好协程Rating类型,这样看起来直观
<span class="gp">>>> </span><span class="n">ratings</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="n">r1</span><span class="p">,</span> <span class="n">r2</span><span class="p">,</span> <span class="n">r3</span><span class="p">])
</span>trainImplicit是一种隐式训练的方式,其得分并不是显示得分,区别于显示训练(相对而言),第二个参数是A=U*V中U V 矩阵的秩 seed就是随机种子,随机初始化训练矩阵
<span class="gp">>>> </span><span class="n">model</span> <span class="o">=</span> <span class="n">ALS</span><span class="o">.</span><span class="n">trainImplicit</span><span class="p">(</span><span class="n">ratings</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="go">0.4...</span>
<span class="gp">>>> </span><span class="n">testset</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)])</span>
<span class="gp">>>> </span><span class="n">model</span> <span class="o">=</span> <span class="n">ALS</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">ratings</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">model</span><span class="o">.</span><span class="n">predictAll</span><span class="p">(</span><span class="n">testset</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Rating(user=1, product=1, rating=1.0...), Rating(user=1, product=2, rating=1.9...)]</span>
<span class="gp">>>> </span><span class="n">model</span> <span class="o">=</span> <span class="n">ALS</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">ratings</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">10</span><span class="p">)
</span>就是上面提到的矩阵分解中的U矩阵结果
<span class="gp">>>> </span><span class="n">model</span><span class="o">.</span><span class="n">userFeatures</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[(1, array('d', [...])), (2, array('d', [...]))]</span>
<span class="gp">相似用户推荐
>>> </span><span class="n">model</span><span class="o">.</span><span class="n">recommendUsers</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="go">[Rating(user=2, product=1, rating=1.9...), Rating(user=1, product=1, rating=1.0...)]
</span>相似商品推荐
<span class="gp">>>> </span><span class="n">model</span><span class="o">.</span><span class="n">recommendProducts</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="go">[Rating(user=1, product=2, rating=1.9...), Rating(user=1, product=1, rating=1.0...)]</span>
<span class="gp">>>> </span><span class="n">model</span><span class="o">.</span><span class="n">rank</span>
<span class="go">4</span>
<span class="gp">上面的矩阵分解中的U矩阵
>>> </span><span class="n">first_user</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">userFeatures</span><span class="p">()</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="gp">>>> </span><span class="n">latents</span> <span class="o">=</span> <span class="n">first_user</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="gp">>>> </span><span class="nb">len</span><span class="p">(</span><span class="n">latents</span><span class="p">)</span>
<span class="go">4</span>
<span class="gp">商品特征矩阵就是上面提到的V矩阵,与商品有关
>>> </span><span class="n">model</span><span class="o">.</span><span class="n">productFeatures</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[(1, array('d', [...])), (2, array('d', [...]))]</span>
<span class="gp">>>> </span><span class="n">first_product</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">productFeatures</span><span class="p">()</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="gp">>>> </span><span class="n">latents</span> <span class="o">=</span> <span class="n">first_product</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="gp">>>> </span><span class="nb">len</span><span class="p">(</span><span class="n">latents</span><span class="p">)</span>
<span class="go">4</span>
<span class="gp">>>> </span><span class="n">products_for_users</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">recommendProductsForUsers</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="gp">>>> </span><span class="nb">len</span><span class="p">(</span><span class="n">products_for_users</span><span class="p">)</span>
<span class="go">2</span>
<span class="gp">>>> </span><span class="n">products_for_users</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="go">(1, (Rating(user=1, product=2, rating=...),))</span>
<span class="gp">>>> </span><span class="n">users_for_products</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">recommendUsersForProducts</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="gp">>>> </span><span class="nb">len</span><span class="p">(</span><span class="n">users_for_products</span><span class="p">)</span>
<span class="go">2</span>
<span class="gp">>>> </span><span class="n">users_for_products</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="go">(1, (Rating(user=2, product=1, rating=...),))</span>
<span class="gp">>>> </span><span class="n">model</span> <span class="o">=</span> <span class="n">ALS</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">ratings</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">nonnegative</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="go">3.73...</span>
<span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([</span><span class="n">Rating</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span> <span class="n">Rating</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">),</span> <span class="n">Rating</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">)])</span>
<span class="gp">>>> </span><span class="n">model</span> <span class="o">=</span> <span class="n">ALS</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">nonnegative</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="go">3.73...</span>
<span class="gp">>>> </span><span class="n">model</span> <span class="o">=</span> <span class="n">ALS</span><span class="o">.</span><span class="n">trainImplicit</span><span class="p">(</span><span class="n">ratings</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">nonnegative</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="go">0.4...</span>
<span class="gp">>>> </span><span class="kn">import</span> <span class="nn">os</span><span class="o">,</span> <span class="nn">tempfile</span>
<span class="gp">>>> </span><span class="n">path</span> <span class="o">=</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">path</span><span class="p">)
</span>保存model并且重新加载
<span class="gp">>>> </span><span class="n">sameModel</span> <span class="o">=</span> <span class="n">MatrixFactorizationModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">path</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">sameModel</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="go">0.4...</span>
<span class="gp">>>> </span><span class="n">sameModel</span><span class="o">.</span><span class="n">predictAll</span><span class="p">(</span><span class="n">testset</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Rating(...</span>
<span class="gp">>>> </span><span class="kn">from</span> <span class="nn">shutil</span> <span class="kn">import</span> <span class="n">rmtree</span>
<span class="gp">>>> </span><span class="k">try</span><span class="p">:</span>
<span class="gp">... </span>    <span class="n">rmtree</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
<span class="gp">... </span><span class="k">except</span> <span class="ne">OSError</span><span class="p">:</span>
<span class="gp">... </span>    <span class="k">pass</span>

New in version 0.9.0.

classmethod load(sc, path)[source]
从指定path路径加载模型  Load a model from the given path

New in version 1.3.1.

predict(user, product)[source]
预测评分 Predicts rating for the given user and product.

New in version 0.9.0.

predictAll(user_product)[source]
Returns a list of predicted ratings for input user and product pairs.

New in version 0.9.0.

productFeatures()[source]
矩阵分解中的v矩阵信息,返回键值对RDD,第一个元素是商品的名字,第二个是商品的特征向量

Returns a paired RDD, where the first element is the product and the second is an array of features corresponding to that product.

New in version 1.2.0.

rank[source]
分解矩阵的秩,是训练模型时的参数
Rank for the features in this model

New in version 1.4.0.

recommendProducts(user, num)[source]
返回推荐的商品按照评分的降序排列
Recommends the top “num” number of products for a given user and returns a list of Rating objects sorted by the predicted rating in descending order.

New in version 1.4.0.

recommendProductsForUsers(num)[source]
为用户推荐num个商品
Recommends top “num” products for all users. The number returned may be less than this.
recommendUsers(product, num)[source]
根据商品推荐num个用户,并且返回相关的商品,按照评分的降序排列
Recommends the top “num” number of users for a given product and returns a list of Rating objects sorted by the predicted rating in descending order.

New in version 1.4.0.

recommendUsersForProducts(num)[source]
返回前num个用户
Recommends top “num” users for all products. The number returned may be less than this.
userFeatures()[source]
返回矩阵分解中的U矩阵
Returns a paired RDD, where the first element is the user and the second is an array of features corresponding to that user.

New in version 1.2.0.

class pyspark.mllib.recommendation.ALS[source]
Alternating Least Squares matrix factorization

New in version 0.9.0.

classmethod train(ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative=False, seed=None)[source]
Train a matrix factorization model given an RDD of ratings given by users to some products, in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the product of two lower-rank matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of ALS. This is done using a level of parallelism given by blocks.

New in version 0.9.0.

classmethod trainImplicit(ratings, rank, iterations=5, lambda_=0.01, blocks=-1, alpha=0.01, nonnegative=False, seed=None)[source]
Train a matrix factorization model given an RDD of ‘implicit preferences’ given by users to some products, in the form of (userID, productID, preference) pairs. We approximate the ratings matrix as the product of two lower-rank matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of ALS. This is done using a level of parallelism given by blocks.

New in version 0.9.0.

class pyspark.mllib.recommendation.Rating[source]
Represents a (user, product, rating) tuple.

<span class="gp">>>> </span><span class="n">r</span> <span class="o">=</span> <span class="n">Rating</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">)</span>
<span class="gp">>>> </span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">user</span><span class="p">,</span> <span class="n">r</span><span class="o">.</span><span class="n">product</span><span class="p">,</span> <span class="n">r</span><span class="o">.</span><span class="n">rating</span><span class="p">)</span>
<span class="go">(1, 2, 5.0)</span>
<span class="gp">>>> </span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">r</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">r</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="go">(1, 2, 5.0)</span>

New in version 1.2.0.

admin
版权声明:本站原创文章,由admin2017-05-27发表,共计4958字。
转载提示:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
评论(没有评论)