Spark mlib协同过滤算法中文翻译

6,135次阅读

共计 4958 个字符，预计需要花费 13 分钟才能阅读完成。

最近看了下spark协同过滤的api，并根据提供的代码写了一版商品推荐代码，现在将当前的模块一些api函数翻译一下，万一有人需要呢，这个也是说不准，也加强自己对其的理解吧，大数据之路走起来

class pyspark.mllib.recommendation.MatrixFactorizationModel(java_model)[source]

A matrix factorisation model trained by regularized alternating least-squares.

>>> r1 = (1, 1, 1.0)
>>> r2 = (1, 2, 2.0)
>>> r3 = (2, 1, 2.0)
rating 评分数据，用户 商品 评分 建议还是不要像官方这么写，rdd数据最好协程Rating类型，这样看起来直观
>>> ratings = sc.parallelize([r1, r2, r3])
trainImplicit是一种隐式训练的方式，其得分并不是显示得分，区别于显示训练（相对而言），第二个参数是A=U*V中U V 矩阵的秩 seed就是随机种子，随机初始化训练矩阵
>>> model = ALS.trainImplicit(ratings, 1, seed=10)
>>> model.predict(2, 2)
0.4...

>>> testset = sc.parallelize([(1, 2), (1, 1)])
>>> model = ALS.train(ratings, 2, seed=0)
>>> model.predictAll(testset).collect()
[Rating(user=1, product=1, rating=1.0...), Rating(user=1, product=2, rating=1.9...)]

>>> model = ALS.train(ratings, 4, seed=10)
就是上面提到的矩阵分解中的U矩阵结果
>>> model.userFeatures().collect()
[(1, array('d', [...])), (2, array('d', [...]))]

相似用户推荐
>>> model.recommendUsers(1, 2)
[Rating(user=2, product=1, rating=1.9...), Rating(user=1, product=1, rating=1.0...)]
相似商品推荐
>>> model.recommendProducts(1, 2)
[Rating(user=1, product=2, rating=1.9...), Rating(user=1, product=1, rating=1.0...)]
>>> model.rank
4

上面的矩阵分解中的U矩阵
>>> first_user = model.userFeatures().take(1)[0]
>>> latents = first_user[1]
>>> len(latents)
4

商品特征矩阵就是上面提到的V矩阵，与商品有关
>>> model.productFeatures().collect()
[(1, array('d', [...])), (2, array('d', [...]))]

>>> first_product = model.productFeatures().take(1)[0]
>>> latents = first_product[1]
>>> len(latents)
4

>>> products_for_users = model.recommendProductsForUsers(1).collect()
>>> len(products_for_users)
2
>>> products_for_users[0]
(1, (Rating(user=1, product=2, rating=...),))

>>> users_for_products = model.recommendUsersForProducts(1).collect()
>>> len(users_for_products)
2
>>> users_for_products[0]
(1, (Rating(user=2, product=1, rating=...),))

>>> model = ALS.train(ratings, 1, nonnegative=True, seed=10)
>>> model.predict(2, 2)
3.73...

>>> df = sqlContext.createDataFrame([Rating(1, 1, 1.0), Rating(1, 2, 2.0), Rating(2, 1, 2.0)])
>>> model = ALS.train(df, 1, nonnegative=True, seed=10)
>>> model.predict(2, 2)
3.73...

>>> model = ALS.trainImplicit(ratings, 1, nonnegative=True, seed=10)
>>> model.predict(2, 2)
0.4...

>>> import os, tempfile
>>> path = tempfile.mkdtemp()
>>> model.save(sc, path)
保存model并且重新加载
>>> sameModel = MatrixFactorizationModel.load(sc, path)
>>> sameModel.predict(2, 2)
0.4...
>>> sameModel.predictAll(testset).collect()
[Rating(...
>>> from shutil import rmtree
>>> try:
...     rmtree(path)
... except OSError:
...     pass

New in version 0.9.0.

classmethod load(sc, path)[source]: 从指定path路径加载模型 Load a model from the given path

New in version 1.3.1.

predict(user, product)[source]: 预测评分 Predicts rating for the given user and product.

New in version 0.9.0.

predictAll(user_product)[source]: Returns a list of predicted ratings for input user and product pairs.

New in version 0.9.0.

productFeatures()[source]

矩阵分解中的v矩阵信息，返回键值对RDD，第一个元素是商品的名字，第二个是商品的特征向量

Returns a paired RDD, where the first element is the product and the second is an array of features corresponding to that product.

New in version 1.2.0.

rank[source]: 分解矩阵的秩，是训练模型时的参数; Rank for the features in this model

New in version 1.4.0.

recommendProducts(user, num)[source]: 返回推荐的商品按照评分的降序排列; Recommends the top “num” number of products for a given user and returns a list of Rating objects sorted by the predicted rating in descending order.

New in version 1.4.0.

recommendProductsForUsers(num)[source]: 为用户推荐num个商品; Recommends top “num” products for all users. The number returned may be less than this.

recommendUsers(product, num)[source]: 根据商品推荐num个用户，并且返回相关的商品，按照评分的降序排列; Recommends the top “num” number of users for a given product and returns a list of Rating objects sorted by the predicted rating in descending order.

New in version 1.4.0.

recommendUsersForProducts(num)[source]: 返回前num个用户; Recommends top “num” users for all products. The number returned may be less than this.

userFeatures()[source]: 返回矩阵分解中的U矩阵; Returns a paired RDD, where the first element is the user and the second is an array of features corresponding to that user.

New in version 1.2.0.

class pyspark.mllib.recommendation.ALS[source]

Alternating Least Squares matrix factorization

New in version 0.9.0.

classmethod train(ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative=False, seed=None)[source]: Train a matrix factorization model given an RDD of ratings given by users to some products, in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the product of two lower-rank matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of ALS. This is done using a level of parallelism given by blocks.

New in version 0.9.0.

classmethod trainImplicit(ratings, rank, iterations=5, lambda_=0.01, blocks=-1, alpha=0.01, nonnegative=False, seed=None)[source]: Train a matrix factorization model given an RDD of ‘implicit preferences’ given by users to some products, in the form of (userID, productID, preference) pairs. We approximate the ratings matrix as the product of two lower-rank matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of ALS. This is done using a level of parallelism given by blocks.

New in version 0.9.0.

class pyspark.mllib.recommendation.Rating[source]

Represents a (user, product, rating) tuple.

>>> r = Rating(1, 2, 5.0)
>>> (r.user, r.product, r.rating)
(1, 2, 5.0)
>>> (r[0], r[1], r[2])
(1, 2, 5.0)

New in version 1.2.0.

正文完

请博主喝杯咖啡吧！

spark

发表至： bigdata

2017-05-27

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请注明出处。

特征选择（7）-基于树模型的选择

技术篇-每日一篇0x1

特征选择（3）-卡方检验

SparkSql系列(6/25) collect 使用

Spark读取csv跳首行

Spark mlib协同过滤算法中文翻译

pyspark.mllib.recommendation module中文翻译