OutOfMemoryError - HazelCast v3.6.4 -


we're seeing nodes in cluster hitting oome every , then. logs machine (xx.xx.xx.187) hit issue shown below. communication problem between xx.xx.xx.187 , xx.xx.xx.184 (which node in cluster) seems causing problem we're not sure of it. once machine hits oome, hazelcast shutdown.

following logs of interest:

2016-08-28t10:30:44:715+0200 - com.hazelcast.cluster.impl.operations.masterconfirmationoperation warning: [xx.xx.xx.187]:45000 [dev] [3.6.4] address[xx.xx.xx.184]:45000 has sent masterconfirmation, node not master! 2016-08-28t10:31:14:715+0200 - com.hazelcast.cluster.impl.operations.masterconfirmationoperation warning: [xx.xx.xx.187]:45000 [dev] [3.6.4] address[xx.xx.xx.184]:45000 has sent masterconfirmation, node not master! 2016-08-28t10:31:44:715+0200 - com.hazelcast.cluster.impl.operations.masterconfirmationoperation warning: [xx.xx.xx.187]:45000 [dev] [3.6.4] address[xx.xx.xx.184]:45000 has sent masterconfirmation, node not master! 2016-08-28t10:31:51:986+0200 - com.hazelcast.nio.tcp.tcpipconnection info: [xx.xx.xx.187]:45000 [dev] [3.6.4] connection [address[xx.xx.xx.184]:45000] lost. reason: socket explicitly closed 2016-08-28t10:31:51:987+0200 - com.hazelcast.cluster.clusterservice info: [xx.xx.xx.187]:45000 [dev] [3.6.4] removing member [xx.xx.xx.184]:45000 2016-08-28t10:31:51:987+0200 - com.hazelcast.partition.internalpartitionservice info: [xx.xx.xx.187]:45000 [dev] [3.6.4] removing member [xx.xx.xx.184]:45000 2016-08-28t10:31:51:988+0200 - com.hazelcast.cluster.clusterservice info: [xx.xx.xx.187]:45000 [dev] [3.6.4]  members [3] { member [xx.xx.xx.186]:45000 member [xx.xx.xx.187]:45000 member [xx.xx.xx.185]:45000 }  2016-08-28t10:31:51:988+0200 - com.hazelcast.transaction.transactionmanagerservice info: [xx.xx.xx.187]:45000 [dev] [3.6.4] committing/rolling-back alive transactions of member [xx.xx.xx.184]:45000, uuid: 3fb9cff4-a33d-4ca1-b51d-6b387fbb0856 2016-08-28t10:31:52:497+0200 - com.hazelcast.nio.tcp.socketacceptorthread info: [xx.xx.xx.187]:45000 [dev] [3.6.4] accepting socket connection /xx.xx.xx.184:54655 2016-08-28t10:31:52:497+0200 - com.hazelcast.nio.tcp.tcpipconnectionmanager info: [xx.xx.xx.187]:45000 [dev] [3.6.4] established socket connection between /xx.xx.xx.187:45000 , /xx.xx.xx.184:54655 2016-08-28t10:32:14:715+0200 - com.hazelcast.cluster.impl.operations.masterconfirmationoperation warning: [xx.xx.xx.187]:45000 [dev] [3.6.4] masterconfirmation has been received address[xx.xx.xx.184]:45000, not member of cluster! 2016-08-28t10:32:14:718+0200 - com.hazelcast.nio.tcp.tcpipconnection info: [xx.xx.xx.187]:45000 [dev] [3.6.4] connection [address[xx.xx.xx.184]:45000] lost. reason: java.io.eofexception[remote socket closed!] 2016-08-28t10:32:14:718+0200 - com.hazelcast.nio.tcp.nonblocking.nonblockingsocketreader warning: [xx.xx.xx.187]:45000 [dev] [3.6.4] hz._hzinstance_1_dev.io.thread-in-1 closing socket endpoint address[xx.xx.xx.184]:45000, cause:java.io.eofexception: remote socket closed! 2016-08-28t10:32:30:801+0200 - com.hazelcast.nio.tcp.socketacceptorthread info: [xx.xx.xx.187]:45000 [dev] [3.6.4] accepting socket connection /xx.xx.xx.184:34362 2016-08-28t10:32:30:802+0200 - com.hazelcast.nio.tcp.tcpipconnectionmanager info: [xx.xx.xx.187]:45000 [dev] [3.6.4] established socket connection between /xx.xx.xx.187:45000 , /xx.xx.xx.184:34362 2016-08-28t10:32:30:812+0200 - com.hazelcast.cluster.impl.operations.joincheckoperation info: [xx.xx.xx.187]:45000 [dev] [3.6.4] ignoring join check address[xx.xx.xx.184]:45000, because node not master... 2016-08-28t10:34:30:801+0200 - com.hazelcast.cluster.impl.operations.joincheckoperation info: [xx.xx.xx.187]:45000 [dev] [3.6.4] ignoring join check address[xx.xx.xx.184]:45000, because node not master... 2016-08-28t10:36:30:800+0200 - com.hazelcast.cluster.impl.operations.joincheckoperation info: [xx.xx.xx.187]:45000 [dev] [3.6.4] ignoring join check address[xx.xx.xx.184]:45000, because node not master... 2016-08-28t10:38:30:802+0200 - com.hazelcast.cluster.impl.operations.joincheckoperation info: [xx.xx.xx.187]:45000 [dev] [3.6.4] ignoring join check address[xx.xx.xx.184]:45000, because node not master... 2016-08-28t10:38:30:845+0200 - com.hazelcast.nio.tcp.tcpipconnection info: [xx.xx.xx.187]:45000 [dev] [3.6.4] connection [address[xx.xx.xx.184]:45000] lost. reason: java.io.eofexception[remote socket closed!] 2016-08-28t10:38:30:845+0200 - com.hazelcast.nio.tcp.nonblocking.nonblockingsocketreader warning: [xx.xx.xx.187]:45000 [dev] [3.6.4] hz._hzinstance_1_dev.io.thread-in-1 closing socket endpoint address[xx.xx.xx.184]:45000, cause:java.io.eofexception: remote socket closed! 2016-08-28t10:38:36:857+0200 - com.hazelcast.cluster.clusterservice info: [xx.xx.xx.187]:45000 [dev] [3.6.4]  members [4] { member [xx.xx.xx.186]:45000 member [xx.xx.xx.187]:45000 member [xx.xx.xx.185]:45000 member [xx.xx.xx.184]:45000 }  2016-08-28t10:38:36:857+0200 - com.hazelcast.nio.tcp.initconnectiontask info: [xx.xx.xx.187]:45000 [dev] [3.6.4] connecting /xx.xx.xx.184:45000, timeout: 0, bind-any: true 2016-08-28t10:38:36:858+0200 - com.hazelcast.nio.tcp.tcpipconnectionmanager info: [xx.xx.xx.187]:45000 [dev] [3.6.4] established socket connection between /xx.xx.xx.187:58928 , /xx.xx.xx.184:45000 ...... .... 2016-08-28t23:07:44:727+0200 - com.hazelcast.cluster.impl.clusterheartbeatmanager warning: [xx.xx.xx.187]:45000 [dev] [3.6.4] ignoring heartbeat member [xx.xx.xx.186]:45000 since expired (now: sun aug 28 23:07:47 cest 2016, timestamp: sun aug 28 23:05:12 cest 2016) 2016-08-28t23:07:44:727+0200 - com.hazelcast.cluster.impl.clusterheartbeatmanager warning: [xx.xx.xx.187]:45000 [dev] [3.6.4] ignoring heartbeat member [xx.xx.xx.186]:45000 since expired (now: sun aug 28 23:07:47 cest 2016, timestamp: sun aug 28 23:05:07 cest 2016) 2016-08-28t23:07:45:056+0200 - com.hazelcast.cluster.impl.clusterheartbeatmanager warning: [xx.xx.xx.187]:45000 [dev] [3.6.4] ignoring heartbeat member [xx.xx.xx.186]:45000 since expired (now: sun aug 28 23:07:48 cest 2016, timestamp: sun aug 28 23:05:17 cest 2016) 2016-08-28t23:07:52:793+0200 - com.hazelcast.cluster.impl.clusterheartbeatmanager info: [xx.xx.xx.187]:45000 [dev] [3.6.4] system clock apparently jumped 2016-08-28t23:05:03.295 2016-08-28t23:07:52.792 since last heartbeat (+164497 ms) 2016-08-28t23:07:52:794+0200 - com.hazelcast.cluster.impl.clusterheartbeatmanager warning: [xx.xx.xx.187]:45000 [dev] [3.6.4] resetting heartbeat timestamps because of huge system clock jump! clock-jump: 164497 ms, heartbeat-timeout: 300000 ms 2016-08-28t23:09:34:960+0200 - com.hazelcast.nio.tcp.tcpipconnection info: [xx.xx.xx.187]:45000 [dev] [3.6.4] connection [address[xx.xx.xx.185]:45000] lost. reason: socket explicitly closed 2016-08-28t23:09:34:960+0200 - com.hazelcast.nio.tcp.tcpipconnection info: [xx.xx.xx.187]:45000 [dev] [3.6.4] connection [address[xx.xx.xx.186]:45000] lost. reason: socket explicitly closed 2016-08-28t23:09:34:961+0200 - com.hazelcast.nio.tcp.tcpipconnection info: [xx.xx.xx.187]:45000 [dev] [3.6.4] connection [address[xx.xx.xx.184]:45000] lost. reason: socket explicitly closed 2016-08-28t23:09:34:962+0200 - com.hazelcast.instance.node warning: [xx.xx.xx.187]:45000 [dev] [3.6.4] terminating forcefully... 2016-08-28t23:09:34:962+0200 - com.hazelcast.instance.node info: [xx.xx.xx.187]:45000 [dev] [3.6.4] shutting down connection manager... 2016-08-28t23:09:34:963+0200 - com.hazelcast.instance.node info: [xx.xx.xx.187]:45000 [dev] [3.6.4] shutting down node engine... 2016-08-28t23:09:35:291+0200 - com.hazelcast.instance.nodeextension info: [xx.xx.xx.187]:45000 [dev] [3.6.4] destroying node nodeextension. 2016-08-28t23:09:35:291+0200 - com.hazelcast.instance.node info: [xx.xx.xx.187]:45000 [dev] [3.6.4] hazelcast shutdown completed in 329 ms. 

the outofmemoryexception caused nonblockingsocketwriter.writequeue grows consume around 6gb.

could 1 please go through logs , suggest might causing issue? or guidance appreciated.

you should first try sync clocks of machine, example using ntp. if doing that, what's system load on each of machines. might have gc problem means jvm stops long , heartbeat ignored reason , maybe member considered dead causes reconnect.


Comments