i have dataframe created running sqlcontext.read of parquet file.
the dataframe consists of 300 m rows. need use these rows input function, want in smaller batches prevent oom error.
currently, using df.head(1000000) read first 1m rows, cannot find way read subsequent rows. tried df.collect(), gives me java oom error.
i want iterate on dataframe. tried adding column withcolumn() api generate unique set of values iterate over, none of existing columns in dataframe have solely unique values.
for example, tried val df = df1.withcolumn("newcolumn", df1("col") + 1) val df = df1.withcolumn("newcolumn",lit(i+=1)), both of not return sequential set of values.
any other way first n rows of dataframe , next n rows, works range function of sqlcontext?
you can simple use limit , except api of dataset or dataframes follows
long count = df.count(); int limit = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; }
Comments
Post a Comment