i have pyspark program doing simple data conversion on large number of records. sporadically following error. i've added code make sure none of integertype values write out of normal 32-bit integer value range. out of range integer seems internal value, not pass in.
this happens on simple call count. python line triggers error simple:
data_frame_count = data_frame.count()
i running on spark 2.0 under amazon emr 5.0.
i've done search on error. see conversation thread focused on fancier logistic regression processing. here, doing, simple, load data, basic parsing/cleaning of data, , writing out data.
caused by: java.lang.runtimeexception: java.lang.illegalargumentexception: size exceeds integer.max_value @ sun.nio.ch.filechannelimpl.map(filechannelimpl.java:869) @ org.apache.spark.storage.diskstore$$anonfun$getbytes$2.apply(diskstore.scala:103) @ org.apache.spark.storage.diskstore$$anonfun$getbytes$2.apply(diskstore.scala:91) @ org.apache.spark.util.utils$.trywithsafefinally(utils.scala:1287) @ org.apache.spark.storage.diskstore.getbytes(diskstore.scala:105) @ org.apache.spark.storage.blockmanager.org$apache$spark$storage$blockmanager$$dogetlocalbytes(blockmanager.scala:497) @ org.apache.spark.storage.blockmanager$$anonfun$getlocalbytes$2.apply(blockmanager.scala:475) @ org.apache.spark.storage.blockmanager$$anonfun$getlocalbytes$2.apply(blockmanager.scala:475) @ scala.option.map(option.scala:146) @ org.apache.spark.storage.blockmanager.getlocalbytes(blockmanager.scala:475) @ org.apache.spark.storage.blockmanager.getblockdata(blockmanager.scala:280) @ org.apache.spark.network.netty.nettyblockrpcserver$$anonfun$2.apply(nettyblockrpcserver.scala:60) @ org.apache.spark.network.netty.nettyblockrpcserver$$anonfun$2.apply(nettyblockrpcserver.scala:60) @ scala.collection.traversablelike$$anonfun$map$1.apply(traversablelike.scala:234) @ scala.collection.traversablelike$$anonfun$map$1.apply(traversablelike.scala:234) @ scala.collection.indexedseqoptimized$class.foreach(indexedseqoptimized.scala:33) @ scala.collection.mutable.arrayops$ofref.foreach(arrayops.scala:186) @ scala.collection.traversablelike$class.map(traversablelike.scala:234) @ scala.collection.mutable.arrayops$ofref.map(arrayops.scala:186) @ org.apache.spark.network.netty.nettyblockrpcserver.receive(nettyblockrpcserver.scala:60) @ org.apache.spark.network.server.transportrequesthandler.processrpcrequest(transportrequesthandler.java:158) @ org.apache.spark.network.server.transportrequesthandler.handle(transportrequesthandler.java:106) @ org.apache.spark.network.server.transportchannelhandler.channelread0(transportchannelhandler.java:119) @ org.apache.spark.network.server.transportchannelhandler.channelread0(transportchannelhandler.java:51) @ io.netty.channel.simplechannelinboundhandler.channelread(simplechannelinboundhandler.java:105) @ io.netty.channel.abstractchannelhandlercontext.invokechannelread(abstractchannelhandlercontext.java:308) @ io.netty.channel.abstractchannelhandlercontext.firechannelread(abstractchannelhandlercontext.java:294) @ io.netty.handler.timeout.idlestatehandler.channelread(idlestatehandler.java:266) @ io.netty.channel.abstractchannelhandlercontext.invokechannelread(abstractchannelhandlercontext.java:308) @ io.netty.channel.abstractchannelhandlercontext.firechannelread(abstractchannelhandlercontext.java:294) @ io.netty.handler.codec.messagetomessagedecoder.channelread(messagetomessagedecoder.java:103) @ io.netty.channel.abstractchannelhandlercontext.invokechannelread(abstractchannelhandlercontext.java:308) @ io.netty.channel.abstractchannelhandlercontext.firechannelread(abstractchannelhandlercontext.java:294) @ org.apache.spark.network.util.transportframedecoder.channelread(transportframedecoder.java:85) @ io.netty.channel.abstractchannelhandlercontext.invokechannelread(abstractchannelhandlercontext.java:308) @ io.netty.channel.abstractchannelhandlercontext.firechannelread(abstractchannelhandlercontext.java:294) @ io.netty.channel.defaultchannelpipeline.firechannelread(defaultchannelpipeline.java:846) @ io.netty.channel.nio.abstractniobytechannel$niobyteunsafe.read(abstractniobytechannel.java:131) @ io.netty.channel.nio.nioeventloop.processselectedkey(nioeventloop.java:511) @ io.netty.channel.nio.nioeventloop.processselectedkeysoptimized(nioeventloop.java:468) @ io.netty.channel.nio.nioeventloop.processselectedkeys(nioeventloop.java:382) @ io.netty.channel.nio.nioeventloop.run(nioeventloop.java:354) @ io.netty.util.concurrent.singlethreadeventexecutor$2.run(singlethreadeventexecutor.java:111) @ java.lang.thread.run(thread.java:745)
it seems there open issue it...
https://issues.apache.org/jira/browse/spark-1476
from description:
underlying abstraction blocks in spark bytebuffer : limits size of block 2gb. has implication not managed blocks in use, shuffle blocks (memory mapped blocks limited 2gig, though api allows long), ser-deser via byte array backed outstreams (spark-1391), etc. severe limitation use of spark when used on non trivial datasets.
Comments
Post a Comment