Apache Spark Scala : How to maintain order of values while grouping rdd by key -

may asking basic question apology that, didn't find it's answer on internet. have paired rdd want use aggragatebykey , concatenating values key. value occur first in input rdd should come first in aggragated rdd.

input rdd [int, int]  2 20  1 10  2 8  2 25  output rdd (aggregated rdd) 2 20 8 25 1 10

i tried aggregatebykey , gropbykey, both giving me ouput, order of values not maintained. please suggest in this.

since groupbykey , aggregatebykey indeed cannot preserve order - you'll have artificially add "hint" each record can order hint after grouping:

val input = sc.parallelize(seq((2, 20), (1, 10), (2, 8), (2, 25)))  val withindex: rdd[(int, (long, int))] = input   .zipwithindex()  // adds index each record, used order result   .map { case ((k, v), i) => (k, (i, v)) } // restructure (key, (index, value))  val result: rdd[(int, list[int])] = withindex   .groupbykey()   .map { case (k, it) => (k, it.tolist.sortby(_._1).map(_._2)) } // order values , remove index

Thr

Search This Blog

Apache Spark Scala : How to maintain order of values while grouping rdd by key -

Comments

Post a Comment