i have pyspark dataframe of 48790 rows, on 37109 have [income == 0]. want reduce 37109 rows random 10 000. left 10 000 rows income == 0 (i balancing classes ml algo).
how can fetch away 10000 rows of dataframe ?
i tried :
data8 = data6.filter("income == 0") data9 = data8.sample(false, 10000/float(data8.count())) print data6.count(), data8.count(), data9.count() 48790 37109 10094
but gives error :
data10 = data6.subtract(data9) data10.count() py4jjavaerror: error occurred while calling o3692.count. : java.lang.runtimeexception: no default type org.apache.spark.ml.linalg.vectorudt@3bfc3ba7
here data6 schema :
structtype(list(structfield(features,vectorudt,true),structfield(income,doubletype,true)))
i same problem spark version 2.1.0 when have been using "except" function of dataframe. confusing, , seems bug of "except" function.
the "except" or "subtract" operator trigger leftanti plan, call "joinselection.apply" function, in "extractequijoinkeys.unapply" function called, call "literal.default" function, "literal.default" function not support datatype of vectorudt, throws runtime exception. cause of issue.
Comments
Post a Comment