特征工程(6)-数据预处理数据变换

2,458次阅读
没有评论

上一篇讲解了使用哑编码的方式来进行数据预处理,这篇文章看起来只是用来作为数据类型的转换,比如多项式操作或者自定义函数

常见的数据变换有基于多项式的、基于指数函数的、基于对数函数的。2个特征,度为2的多项式转换公式如下:

$$ (x_1,x_2)=(1,x_1,x_2,x_1^2,x_1*x_2,x_2^2) $$

sklearn函数剖析

from sklearn.preprocessing import PolynomialFeatures
data=PolynomialFeatures().fit_transform(irisdata.data)
print data[0:5]
[[  1.     5.1    3.5    1.4    0.2   26.01  17.85   7.14   1.02  12.25
    4.9    0.7    1.96   0.28   0.04]
 [  1.     4.9    3.     1.4    0.2   24.01  14.7    6.86   0.98   9.     4.2
    0.6    1.96   0.28   0.04]
 [  1.     4.7    3.2    1.3    0.2   22.09  15.04   6.11   0.94  10.24
    4.16   0.64   1.69   0.26   0.04]
 [  1.     4.6    3.1    1.5    0.2   21.16  14.26   6.9    0.92   9.61
    4.65   0.62   2.25   0.3    0.04]
 [  1.     5.     3.6    1.4    0.2   25.    18.     7.     1.    12.96
    5.04   0.72   1.96   0.28   0.04]]

spark函数剖析

<span class="gp">>>></span><span class="kn">from</span> <span class="nn">pyspark.mllib.linalg</span> <span class="kn">import</span> <span class="n">Vectors
</span>>>>from pyspark.ml.feature import PolynomialExpansion
<span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="n">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="p">([</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">]),)],</span> <span class="p">[</span><span class="s">"dense"</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">px</span> <span class="o">=</span> <span class="n">PolynomialExpansion</span><span class="p">(</span><span class="n">degree</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">inputCol</span><span class="o">=</span><span class="s">"dense"</span><span class="p">,</span> <span class="n">outputCol</span><span class="o">=</span><span class="s">"expanded"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">px</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span><span class="o">.</span><span class="n">expanded</span>
<span class="go">DenseVector([0.5, 0.25, 2.0, 1.0, 4.0])</span>
<span class="gp">>>> </span><span class="n">px</span><span class="o">.</span><span class="n">setParams</span><span class="p">(</span><span class="n">outputCol</span><span class="o">=</span><span class="s">"test"</span><span class="p">)</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span><span class="o">.</span><span class="n">test</span>
<span class="go">DenseVector([0.5, 0.25, 2.0, 1.0, 4.0])</span>

备注:spark和sklearn有点不一样,sklearn总会包含数字1

admin
版权声明:本站原创文章,由admin2017-08-18发表,共计1002字。
转载提示:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
评论(没有评论)