Spark IllegalArgumentException: Size exceeds Integer.MAX_VALUE -


i have pyspark program doing simple data conversion on large number of records. sporadically following error. i've added code make sure none of integertype values write out of normal 32-bit integer value range. out of range integer seems internal value, not pass in.

this happens on simple call count. python line triggers error simple:

data_frame_count = data_frame.count() 

i running on spark 2.0 under amazon emr 5.0.

i've done search on error. see conversation thread focused on fancier logistic regression processing. here, doing, simple, load data, basic parsing/cleaning of data, , writing out data.

caused by: java.lang.runtimeexception: java.lang.illegalargumentexception: size exceeds integer.max_value     @ sun.nio.ch.filechannelimpl.map(filechannelimpl.java:869)     @ org.apache.spark.storage.diskstore$$anonfun$getbytes$2.apply(diskstore.scala:103)     @ org.apache.spark.storage.diskstore$$anonfun$getbytes$2.apply(diskstore.scala:91)     @ org.apache.spark.util.utils$.trywithsafefinally(utils.scala:1287)     @ org.apache.spark.storage.diskstore.getbytes(diskstore.scala:105)     @ org.apache.spark.storage.blockmanager.org$apache$spark$storage$blockmanager$$dogetlocalbytes(blockmanager.scala:497)     @ org.apache.spark.storage.blockmanager$$anonfun$getlocalbytes$2.apply(blockmanager.scala:475)     @ org.apache.spark.storage.blockmanager$$anonfun$getlocalbytes$2.apply(blockmanager.scala:475)     @ scala.option.map(option.scala:146)     @ org.apache.spark.storage.blockmanager.getlocalbytes(blockmanager.scala:475)     @ org.apache.spark.storage.blockmanager.getblockdata(blockmanager.scala:280)     @ org.apache.spark.network.netty.nettyblockrpcserver$$anonfun$2.apply(nettyblockrpcserver.scala:60)     @ org.apache.spark.network.netty.nettyblockrpcserver$$anonfun$2.apply(nettyblockrpcserver.scala:60)     @ scala.collection.traversablelike$$anonfun$map$1.apply(traversablelike.scala:234)     @ scala.collection.traversablelike$$anonfun$map$1.apply(traversablelike.scala:234)     @ scala.collection.indexedseqoptimized$class.foreach(indexedseqoptimized.scala:33)     @ scala.collection.mutable.arrayops$ofref.foreach(arrayops.scala:186)     @ scala.collection.traversablelike$class.map(traversablelike.scala:234)     @ scala.collection.mutable.arrayops$ofref.map(arrayops.scala:186)     @ org.apache.spark.network.netty.nettyblockrpcserver.receive(nettyblockrpcserver.scala:60)     @ org.apache.spark.network.server.transportrequesthandler.processrpcrequest(transportrequesthandler.java:158)     @ org.apache.spark.network.server.transportrequesthandler.handle(transportrequesthandler.java:106)     @ org.apache.spark.network.server.transportchannelhandler.channelread0(transportchannelhandler.java:119)     @ org.apache.spark.network.server.transportchannelhandler.channelread0(transportchannelhandler.java:51)     @ io.netty.channel.simplechannelinboundhandler.channelread(simplechannelinboundhandler.java:105)     @ io.netty.channel.abstractchannelhandlercontext.invokechannelread(abstractchannelhandlercontext.java:308)     @ io.netty.channel.abstractchannelhandlercontext.firechannelread(abstractchannelhandlercontext.java:294)     @ io.netty.handler.timeout.idlestatehandler.channelread(idlestatehandler.java:266)     @ io.netty.channel.abstractchannelhandlercontext.invokechannelread(abstractchannelhandlercontext.java:308)     @ io.netty.channel.abstractchannelhandlercontext.firechannelread(abstractchannelhandlercontext.java:294)     @ io.netty.handler.codec.messagetomessagedecoder.channelread(messagetomessagedecoder.java:103)     @ io.netty.channel.abstractchannelhandlercontext.invokechannelread(abstractchannelhandlercontext.java:308)     @ io.netty.channel.abstractchannelhandlercontext.firechannelread(abstractchannelhandlercontext.java:294)     @ org.apache.spark.network.util.transportframedecoder.channelread(transportframedecoder.java:85)     @ io.netty.channel.abstractchannelhandlercontext.invokechannelread(abstractchannelhandlercontext.java:308)     @ io.netty.channel.abstractchannelhandlercontext.firechannelread(abstractchannelhandlercontext.java:294)     @ io.netty.channel.defaultchannelpipeline.firechannelread(defaultchannelpipeline.java:846)     @ io.netty.channel.nio.abstractniobytechannel$niobyteunsafe.read(abstractniobytechannel.java:131)     @ io.netty.channel.nio.nioeventloop.processselectedkey(nioeventloop.java:511)     @ io.netty.channel.nio.nioeventloop.processselectedkeysoptimized(nioeventloop.java:468)     @ io.netty.channel.nio.nioeventloop.processselectedkeys(nioeventloop.java:382)     @ io.netty.channel.nio.nioeventloop.run(nioeventloop.java:354)     @ io.netty.util.concurrent.singlethreadeventexecutor$2.run(singlethreadeventexecutor.java:111)     @ java.lang.thread.run(thread.java:745) 

it seems there open issue it...

https://issues.apache.org/jira/browse/spark-1476

from description:

underlying abstraction blocks in spark bytebuffer : limits size of block 2gb. has implication not managed blocks in use, shuffle blocks (memory mapped blocks limited 2gig, though api allows long), ser-deser via byte array backed outstreams (spark-1391), etc. severe limitation use of spark when used on non trivial datasets.


Comments