pandas中遍历DataFrame行

2,285次阅读
没有评论

pandas中遍历DataFrame行

有如下 Pandas DataFrame:

<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
inp = [{<span class="hljs-string">'c1'</span>:<span class="hljs-number">10</span>, <span class="hljs-string">'c2'</span>:<span class="hljs-number">100</span>}, {<span class="hljs-string">'c1'</span>:<span class="hljs-number">11</span>,<span class="hljs-string">'c2'</span>:<span class="hljs-number">110</span>}, {<span class="hljs-string">'c1'</span>:<span class="hljs-number">12</span>,<span class="hljs-string">'c2'</span>:<span class="hljs-number">120</span>}]
df = pd.DataFrame(inp)
<span class="hljs-keyword">print</span> df

上面代码输出:

   <span class="hljs-attribute">c1</span>   c2
<span class="hljs-number">0</span>  <span class="hljs-number">10</span>  <span class="hljs-number">100</span>
<span class="hljs-number">1</span>  <span class="hljs-number">11</span>  <span class="hljs-number">110</span>
<span class="hljs-number">2</span>  <span class="hljs-number">12</span>  <span class="hljs-number">120</span>

现在需要遍历上面DataFrame的行。对于每一行,都希望能够通过列名访问对应的元素(单元格中的值)。也就是说,需要类似如下的功能:

<span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> df.rows:
   <span class="hljs-keyword">print</span> row[<span class="hljs-string">'c1'</span>], row[<span class="hljs-string">'c2'</span>]

Pandas 可以这样做吗?

我找到了similar question。但这并不能给我需要的答案,里面提到:

<span class="hljs-keyword">for</span> date, row <span class="hljs-keyword">in</span> df.T.iteritems():

要么

<span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> df.iterrows():

但是我不明白row对象是什么,以及我如何使用它。

最佳解决方案

要以 Pandas 的方式迭代遍历DataFrame的行,可以使用:

  • DataFrame.iterrows()
    <span class="hljs-keyword">for</span> index, row <span class="hljs-keyword">in</span> df.iterrows():
        <span class="hljs-keyword">print</span> row[<span class="hljs-string">"c1"</span>], row[<span class="hljs-string">"c2"</span>]
    
  • DataFrame.itertuples()
    <span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> df.itertuples(index=<span class="hljs-keyword">True</span>, name=<span class="hljs-string">'Pandas'</span>):
        <span class="hljs-keyword">print</span> getattr(row, <span class="hljs-string">"c1"</span>), getattr(row, <span class="hljs-string">"c2"</span>)
    

itertuples()应该比iterrows()

但请注意,根据文档(目前 Pandas 0.19.1):

  • iterrows:数据的dtype可能不是按行匹配的,因为iterrows返回一个系列的每一行,它不会保留行的dtypes(dtypes跨DataFrames列保留)*
  • iterrows:不要修改行

    你不应该修改你正在迭代的东西。这不能保证在所有情况下都能正常工作。根据数据类型的不同,迭代器返回一个副本而不是一个视图,写入它将不起作用。

    改用DataFrame.apply()

    <span class="hljs-attr">new_df</span> = df.apply(lambda x: x * <span class="hljs-number">2</span>)
    
  • itertuples:列名称将被重命名为位置名称,如果它们是无效的Python标识符,重复或以下划线开头。对于大量的列(> 255),返回常规元组。

第二种方案: apply

您也可以使用df.apply()遍历行并访问函数的多个列。

docs: DataFrame.apply()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">valuation_formula</span><span class="hljs-params">(x, y)</span>:</span>
    <span class="hljs-keyword">return</span> x * y * <span class="hljs-number">0.5</span>

df[<span class="hljs-string">'price'</span>] = df.apply(<span class="hljs-keyword">lambda</span> row: valuation_formula(row[<span class="hljs-string">'x'</span>], row[<span class="hljs-string">'y'</span>]), axis=<span class="hljs-number">1</span>)

第三种方案:iloc

您可以使用df.iloc函数,如下所示:

<span class="hljs-selector-tag">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-selector-tag">in</span> <span class="hljs-selector-tag">range</span>(0, <span class="hljs-selector-tag">len</span>(<span class="hljs-selector-tag">df</span>)):
    <span class="hljs-selector-tag">print</span> <span class="hljs-selector-tag">df</span><span class="hljs-selector-class">.iloc</span><span class="hljs-selector-attr">[i]</span><span class="hljs-selector-attr">['c1']</span>, <span class="hljs-selector-tag">df</span><span class="hljs-selector-class">.iloc</span><span class="hljs-selector-attr">[i]</span><span class="hljs-selector-attr">['c2']</span>

第四种方案:略麻烦,但是更高效,将DataFrame转为List

您可以编写自己的实现namedtuple的迭代器

<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> namedtuple

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">myiter</span><span class="hljs-params">(d, cols=None)</span>:</span>
    <span class="hljs-keyword">if</span> cols <span class="hljs-keyword">is</span> <span class="hljs-keyword">None</span>:
        v = d.values.tolist()
        cols = d.columns.values.tolist()
    <span class="hljs-keyword">else</span>:
        j = [d.columns.get_loc(c) <span class="hljs-keyword">for</span> c <span class="hljs-keyword">in</span> cols]
        v = d.values[:, j].tolist()

    n = namedtuple(<span class="hljs-string">'MyTuple'</span>, cols)

    <span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> iter(v):
        <span class="hljs-keyword">yield</span> n(*line)

这相当于pd.DataFrame.itertuples,但是效率更高。


将自定义函数用于给定的DataFrame:

<span class="hljs-keyword">list</span>(myiter(df))

[MyTuple(c1=<span class="hljs-number">10</span>, c2=<span class="hljs-number">100</span>), MyTuple(c1=<span class="hljs-number">11</span>, c2=<span class="hljs-number">110</span>), MyTuple(c1=<span class="hljs-number">12</span>, c2=<span class="hljs-number">120</span>)]

或与pd.DataFrame.itertuples

<span class="hljs-keyword">list</span>(df.itertuples(index=<span class="hljs-keyword">False</span>))

[Pandas(c1=<span class="hljs-number">10</span>, c2=<span class="hljs-number">100</span>), Pandas(c1=<span class="hljs-number">11</span>, c2=<span class="hljs-number">110</span>), Pandas(c1=<span class="hljs-number">12</span>, c2=<span class="hljs-number">120</span>)]

全面的测试

我们测试了所有可用列:

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">iterfullA</span><span class="hljs-params">(d)</span>:</span>
    <span class="hljs-keyword">return</span> list(myiter(d))

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">iterfullB</span><span class="hljs-params">(d)</span>:</span>
    <span class="hljs-keyword">return</span> list(d.itertuples(index=<span class="hljs-keyword">False</span>))

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">itersubA</span><span class="hljs-params">(d)</span>:</span>
    <span class="hljs-keyword">return</span> list(myiter(d, [<span class="hljs-string">'col3'</span>, <span class="hljs-string">'col4'</span>, <span class="hljs-string">'col5'</span>, <span class="hljs-string">'col6'</span>, <span class="hljs-string">'col7'</span>]))

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">itersubB</span><span class="hljs-params">(d)</span>:</span>
    <span class="hljs-keyword">return</span> list(d[[<span class="hljs-string">'col3'</span>, <span class="hljs-string">'col4'</span>, <span class="hljs-string">'col5'</span>, <span class="hljs-string">'col6'</span>, <span class="hljs-string">'col7'</span>]].itertuples(index=<span class="hljs-keyword">False</span>))

res = pd.DataFrame(
    index=[<span class="hljs-number">10</span>, <span class="hljs-number">30</span>, <span class="hljs-number">100</span>, <span class="hljs-number">300</span>, <span class="hljs-number">1000</span>, <span class="hljs-number">3000</span>, <span class="hljs-number">10000</span>, <span class="hljs-number">30000</span>],
    columns=<span class="hljs-string">'iterfullA iterfullB itersubA itersubB'</span>.split(),
    dtype=float
)

<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> res.index:
    d = pd.DataFrame(np.random.randint(<span class="hljs-number">10</span>, size=(i, <span class="hljs-number">10</span>))).add_prefix(<span class="hljs-string">'col'</span>)
    <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> res.columns:
        stmt = <span class="hljs-string">'{}(d)'</span>.format(j)
        setp = <span class="hljs-string">'from __main__ import d, {}'</span>.format(j)
        res.at[i, j] = timeit(stmt, setp, number=<span class="hljs-number">100</span>)

res.groupby(res.columns.str[<span class="hljs-number">4</span>:<span class="hljs-number">-1</span>], axis=<span class="hljs-number">1</span>).plot(loglog=<span class="hljs-keyword">True</span>);

pandas中遍历DataFrame行

pandas中遍历DataFrame行

admin
版权声明:本站原创文章,由admin2018-05-13发表,共计2426字。
转载提示:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
评论(没有评论)