hive sql执行顺序以及相关例子

4,544次阅读
没有评论
日常使用的数据查询工具都是hive,了解一下hive sql执行顺序,其实与mysql之间的差异不是太大,参考mysql的执行顺序,hive的执行顺序如下
  1. from
  2. on
  3. join
  4. where
  5. group by
  6. having
  7. select
  8. distinct
  9. union
  10. order by
举例
explainselectcity,ad_type,device,sum(cnt)ascntfromtb_pmp_raw_log_basic_analysiswhereday=’2016-05-28’andtype =0andmedia =’sohu’and(deal_id =”ordeal_id =’-‘ordeal_idisNULL)groupbycity,ad_type,device
对应的stage任务如下
STAGE DEPENDENCIES:
  Stage-1is a root stage
  Stage-0is a root stage
STAGE PLANS:
  Stage: Stage-1
   MapReduce
     MapOperator Tree:
          TableScan
            alias: tb_pmp_raw_log_basic_analysis
            Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE
            Filter Operator
              predicate: (((deal_id=”)or(deal_id=’-‘))ordeal_id isnull) (type: boolean)
              Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE
             SelectOperator
                expressions: city (type:string), ad_type (type:string), device (type:string), cnt (type: bigint)
                outputColumnNames: city, ad_type, device, cnt
                Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE
               GroupByOperator
                  aggregations:sum(cnt)
                  keys: city (type:string), ad_type (type:string), device (type:string)
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2, _col3
                  Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE
                  Reduce Output Operator
                    key expressions: _col0 (type:string), _col1 (type:string), _col2 (type:string)
                    sortorder:+++
                   Map-reducepartition columns: _col0 (type:string), _col1 (type:string), _col2 (type:string)
                    Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE
                    value expressions: _col3 (type: bigint)
      Reduce Operator Tree:
       GroupByOperator
          aggregations:sum(VALUE._col0)
          keys: KEY._col0 (type:string), KEY._col1 (type:string), KEY._col2 (type:string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Numrows:4097678Datasize:290028976Basic stats: COMPLETE Column stats:NONE
         SelectOperator
            expressions: _col0 (type:string), _col1 (type:string), _col2 (type:string), _col3 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Numrows:4097678Datasize:290028976Basic stats: COMPLETE Column stats:NONE
            File Output Operator
              compressed:false
              Statistics: Numrows:4097678Datasize:290028976Basic stats: COMPLETE Column stats:NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit:-1
具体介绍如下
**stage1的map阶段**
        TableScan:from加载表,描述中有行数和大小等
        Filter Operator:where过滤条件筛选数据,描述有具体筛选条件和行数、大小等
        Select Operator:筛选列,描述中有列名、类型,输出类型、大小等。
        Group By Operator:分组,描述了分组后需要计算的函数,keys描述用于分组的列,outputColumnNames为输出的列名,可以看出列默认使用固定的别名_col0,以及其他信息
        Reduce Output Operator:map端本地的reduce,进行本地的计算,然后按列映射到对应的reduce
**stage1的reduce阶段Reduce Operator Tree**
        Group By Operator:总体分组,并按函数计算。map计算后的结果在reduce端的合并。描述类似。mode: mergepartial是说合并map的计算结果。map端是hash映射分组
        Select Operator:最后过滤列用于输出结果
        File Output Operator:输出结果到临时文件中,描述介绍了压缩格式、输出文件格式。
        stage0第二阶段没有,这里可以实现limit 100的操作
一些填坑点相关的记录
(1)join  on  where的使用
  • 对于inner join 不会有影响
  • 对于外连接有影响
    a. WHERE 操作是对在join之后操作
    b. ON操作是在join之前
举例说明: 下面给出相应的表数据
   1.documents:
    |id   |name       |
    ——–|————-|
    |1    |Document1  |
    |2    |Document2  |
    |3    |Document3  |
    |4    |Document4  |
    |5    |Document5  |
   2.downloads:
    |id  |document_id  |username|
    |——|—————|———-|
    |1   |1            |sandeep |
    |2   |1            |simi    |
    |3   |2            |sandeep |
    |4   |2            |reya    |
    |5   |3            |simi    |
a) 使用where操作
由于是在jion之后才进行筛选操作,使用left join操作实际上会产生7条数据,由于ID在downloads表中没有4,5,所以对应的数据都是NULL
此时where的条件筛选的作用出来了,只有username=‘sandeep’的数据才会被保留,因此符合这个条件的只有两条数据
 SELECTdocuments.name,downloads.id
   FROMdocuments
   LEFTOUTERJOINdownloads
     ONdocuments.id=downloads.document_id
   WHEREusername=’sandeep’
Forabove query the intermediatejointablewill looklikethis.
   |id(fromdocuments)|name        |id(fromdownloads)|document_id|username|
   |——————–|————–|———————|————-|———-|
   |1                 |Document1   |1                  |1          |sandeep |
   |1                 |Document1   |2                  |1          |simi    |
   |2                 |Document2   |3                  |2          |sandeep |
   |2                 |Document2   |4                  |2          |reya    |
   |3                 |Document3   |5                  |3          |simi    |
   |4                 |Document4   |NULL               |NULL       |NULL    |
   |5                 |Document5   |NULL               |NULL       |NULL    |
  After applying the`WHERE`clauseandselecting the listed attributes,the result will be:
  |name        |id|
  |————–|—-|
  |Document1   |1 |
  |Document2   |3 |
b) 使用on,是在join之前就参与条件筛选
 SELECTdocuments.name,downloads.id
 FROMdocuments
   LEFTOUTERJOINdownloads
     ONdocuments.id=downloads.document_id
       ANDusername=’sandeep’
在没有select之前进行获取的左连接的数据如下所示:
 
   |id(fromdocuments)|name        |id(fromdownloads)|document_id|username|
   |——————–|————–|———————|————-|———-|
   |1                 |Document1   |1                  |1          |sandeep |
   |2                 |Document2   |3                  |2          |sandeep |
   |3                 |Document3   |NULL               |NULL       |NULL    |
   |4                 |Document4   |NULL               |NULL       |NULL    |
   |5                 |Document5   |NULL               |NULL       |NULL    |
上面数据表中没有满足两个条件的输出的数据都是null,只有满足两个条件的数据才会输出实际的数据结果,比如前两个数据
最终输出的结果数据如下所示:
 
  |name      |id  |
  |————|——|
  |  Document1|1   |
  |  Document2|3   |
  |  Document3|NULL|
  |  Document4|NULL|
  |  Document5|NULL|
 
admin
版权声明:本站原创文章,由 admin 2017-12-27发表,共计6140字。
转载说明:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
评论(没有评论)