i'd split big parquet file multiple parquet files in different folder in hdfs, can build partitioned table (whatever hive/drill/spark sql) on it.
data example:
+-----+------+ |model| num1| +-----+------+ | v80| 195.0| | v80| 750.0| | v80| 101.0| | v80| 0.0| | v80| 0.0| | v80| 720.0| | v80|1360.0| | v80| 162.0| | v80| 150.0| | v90| 450.0| | v90| 189.0| | v90| 400.0| | v90| 120.0| | v90| 20.3| | v90| 0.0| | v90| 84.0| | v90| 555.0| | v90| 0.0| | v90| 9.0| | v90| 75.6| +-----+------+
the result folder structure should grouped "model" field:
+ | +-----model=v80 | | | +----- xxx.parquet +-----model=v90 | | | +----- xxx.parquet
i tried script this:
def main(args: array[string]): unit = { val conf = new sparkconf() case class infos(name:string, name1:string) val sc = new sparkcontext(conf) val sqlcontext = new org.apache.spark.sql.sqlcontext(sc) val rdd = sqlcontext.read.load("hdfs://nameservice1/user/hive/warehouse/a_e550_parquet").select("model", "num1").limit(10000) val tmprdd = rdd.map { item => (item(0), infos(item.getstring(0), item.getstring(1))) }.groupbykey() (item <- tmprdd) { import sqlcontext.implicits._ val df = item._2.toseq.todf() df.write.mode(savemode.overwrite).parquet("hdfs://nameservice1/tmp/model=" + item._1) } }
just threw out null point exception.
you should use partitionby dataframe. not need groupby. below should give want.
val df = sqlcontext.read.parquet("hdfs://nameservice1/user/hive/warehouse/a_e550_parquet").select("model", "num1").limit(10000) df.write.partitionby("model").mode(savemode.overwrite)
Comments
Post a Comment