scala - Split one big parquet file into multiple parquet files by a key -


i'd split big parquet file multiple parquet files in different folder in hdfs, can build partitioned table (whatever hive/drill/spark sql) on it.

data example:

+-----+------+ |model|  num1| +-----+------+ |  v80| 195.0| |  v80| 750.0| |  v80| 101.0| |  v80|   0.0| |  v80|   0.0| |  v80| 720.0| |  v80|1360.0| |  v80| 162.0| |  v80| 150.0| |  v90| 450.0| |  v90| 189.0| |  v90| 400.0| |  v90| 120.0| |  v90|  20.3| |  v90|   0.0| |  v90|  84.0| |  v90| 555.0| |  v90|   0.0| |  v90|   9.0| |  v90|  75.6| +-----+------+ 

the result folder structure should grouped "model" field:

+ | +-----model=v80 |       |  |       +----- xxx.parquet +-----model=v90 |       |  |       +----- xxx.parquet 

i tried script this:

def main(args: array[string]): unit = {    val conf = new sparkconf()    case class infos(name:string, name1:string)     val sc = new sparkcontext(conf)     val sqlcontext = new org.apache.spark.sql.sqlcontext(sc)     val rdd = sqlcontext.read.load("hdfs://nameservice1/user/hive/warehouse/a_e550_parquet").select("model", "num1").limit(10000)      val tmprdd = rdd.map { item => (item(0), infos(item.getstring(0), item.getstring(1))) }.groupbykey()      (item <- tmprdd) {       import sqlcontext.implicits._       val df = item._2.toseq.todf()       df.write.mode(savemode.overwrite).parquet("hdfs://nameservice1/tmp/model=" + item._1)     }   } 

just threw out null point exception.

you should use partitionby dataframe. not need groupby. below should give want.

val df = sqlcontext.read.parquet("hdfs://nameservice1/user/hive/warehouse/a_e550_parquet").select("model", "num1").limit(10000) df.write.partitionby("model").mode(savemode.overwrite) 

Comments