共计 4711 个字符,预计需要花费 12 分钟才能阅读完成。
对 DataFrame 的排序有两种方法: sort()
or orderBy()
,下面就会介绍这两种方法的使用。
构建 DataFrame
import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()
输出如下所示
root
|-- employee_name: string (nullable = true)
|-- department: string (nullable = true)
|-- state: string (nullable = true)
|-- salary: integer (nullable = false)
|-- age: integer (nullable = false)
|-- bonus: integer (nullable = false)
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+
sort 函数使用
使用 sort
函数可以指定一列或者多列进行排序,默认排序的方式是升序
语法
sort(sortCol : scala.Predef.String, sortCols : scala.Predef.String*) : Dataset[T]
sort(sortExprs : org.apache.spark.sql.Column*) : Dataset[T]
示例
df.sort("department","state").show(false)
df.sort(col("department"),col("state")).show(false)
上面的两个例子输出的结果都是一样的,无非就是传参的差异性,一个是使用字符串,一个是使用 Column 类型。
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria |Finance |CA |90000 |24 |23000|
|Raman |Finance |CA |99000 |40 |24000|
|Jen |Finance |NY |79000 |53 |15000|
|Scott |Finance |NY |83000 |36 |19000|
|Jeff |Marketing |CA |80000 |25 |18000|
|Kumar |Marketing |NY |91000 |50 |21000|
|Robert |Sales |CA |81000 |30 |23000|
|James |Sales |NY |90000 |34 |10000|
|Michael |Sales |NY |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+
orderBy 函数使用
orderBy
和 sort
的效果基本上都是一样的。
语法
orderBy(sortCol : scala.Predef.String, sortCols : scala.Predef.String*) : Dataset[T]
orderBy(sortExprs : org.apache.spark.sql.Column*) : Dataset[T]
示例
df.orderBy("department","state").show(false)
df.orderBy(col("department"),col("state")).show(false)
如何控制排序顺序
在使用排序函数的时候如果没有指定相应的排序顺序默认都是升序的方式,下面的例子就是介绍如何控制顺序。
df.sort(col("department").asc,col("state").asc).show(false)
df.orderBy(col("department").asc,col("state").asc).show(false)
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria |Finance |CA |90000 |24 |23000|
|Raman |Finance |CA |99000 |40 |24000|
|Jen |Finance |NY |79000 |53 |15000|
|Scott |Finance |NY |83000 |36 |19000|
|Jeff |Marketing |CA |80000 |25 |18000|
|Kumar |Marketing |NY |91000 |50 |21000|
|Robert |Sales |CA |81000 |30 |23000|
|James |Sales |NY |90000 |34 |10000|
|Michael |Sales |NY |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+
如果你要降序那么使用 desc
排序函数
Spark SQL 里提供了相应的指定顺序的函数。
df.select("employee_name",asc("department"),desc("state"),"salary","age","bonus").show(false)
在SQL 语句里实现排序
df.createOrReplaceTempView("EMP")
spark.sql(" select employee_name,asc('department'),desc('state'),salary,age,bonus from EMP").show(false)
完整代码
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SortExample extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.printSchema()
df.show()
df.sort("department","state").show(false)
df.sort(col("department"),col("state")).show(false)
df.orderBy("department","state").show(false)
df.orderBy(col("department"),col("state")).show(false)
df.sort(col("department").asc,col("state").asc).show(false)
df.orderBy(col("department").asc,col("state").asc).show(false)
df.sort(col("department").asc,col("state").desc).show(false)
df.orderBy(col("department").asc,col("state").desc).show(false)
df.select("employee_name",asc("department"),desc("state"),"salary","age","bonus").show(false)
df.createOrReplaceTempView("EMP")
spark.sql(" select employee_name,asc('department'),desc('state'),salary,age,bonus from EMP").show(false)
}
正文完
请博主喝杯咖啡吧!