I have deployed evcache to a highish load production environment and we appeared to have some kind of network glitch where cache calls ended up taking 10 seconds per call (or multiples of 10 i.e. 20 or 30 seconds). Even though I had sent the max timeout to 100ms, this resulted in threads backing up and server grinding to a hault.
As it was production we had very limited logging and I after extensive testing in QA I was unable to replicate the issue. So would anyone have any idea what could cause this?
evcache version:1.0.5
spymemcache version:2.7.3
ConfigurationManager.getConfigInstance();
EVCacheClientPoolManager.getInstance().initEVCache(applicationName);
evCache = (new EVCache.Builder()).setAppName(applicationName).enableZoneFallback().build();
Config
EC2_AVAILABILITY_ZONE=zone1
${evcache.pool.provider}=com.netflix.evcache.pool.standalone.ZoneClusteredEVCacheClientPoolImpl
${evcache.application.name}.${evcache.server.zone.a.name}.EVCacheClientPool.hosts">${evcache.server.zone.a.hostnames}=server1,server2,server3
${evcache.application.name}.${evcache.server.zone.b.name}.EVCacheClientPool.hosts">${evcache.server.zone.b.hostnames}=server4,server5,server6
${evcache.application.name}.EVCacheClientPool.poolSize">${evcache.connection.poolsize}=1
${evcache.application.name}.EVCacheClientPool.readTimeout">${evcache.read.timeout}=100
Logs Before problem
We recieved a number of these warnings from a heartbeat thread we had running
MemcachedConnection] - Could not redistribute to another node, retrying primary node for key11111111
The about 1 minute later (where there were a number of successful cache calls) we started having the issue.
Logs around problem time
[ n.s.m.MemcachedConnection] - Closing, and reopening {QA sa=/serverip_1:serverport, #Rops=2, #Wops=162, #iq=0, topRop=Cmd: -1 Opaque: -1 Keys: , topWop=Cmd: 0 Opaque: 1865615 Key: Anotherkey_V11, toWrite=0, interested=5}, attempt 0.
[ n.s.m.p.b.BinaryMemcachedNodeImpl] - Discarding partially completed op: Cmd: -1 Opaque: -1 Keys:
[5ad1c8ab56d9445fa5aa4b0034c8c93d n.s.m.MemcachedConnection] - Could not redistribute to another node, retrying primary node for Anotherkey2_V11.
[ n.s.m.MemcachedConnection] - Closing, and reopening {QA sa=/serverip_2:serverport, #Rops=2, #Wops=220, #iq=0, topRop=Cmd: 1 Opaque: 1865530 Key: cckey_V11 Cas: 0 Exp: 0 Flags: 1 Data Length: 222, topWop=Cmd: 0 Opaque: 1865532 Key: cckey_V11, toWrite=0, interested=8}, attempt 0.
[ n.s.m.p.b.BinaryMemcachedNodeImpl] - Discarding partially completed op: Cmd: 1 Opaque: 1865530 Key: cckey_V11 Cas: 0 Exp: 0 Flags: 1 Data Length: 222
[ n.s.m.MemcachedConnection] - Closing, and reopening {QA sa=/serverip_3:serverport, #Rops=2, #Wops=180, #iq=0, topRop=Cmd: -1 Opaque: -1 Keys: , topWop=Cmd: 0 Opaque: 1865589 Key: cckey2_V11, toWrite=0, interested=5}, attempt 0.
[ n.s.m.p.b.BinaryMemcachedNodeImpl] - Discarding partially completed op: Cmd: -1 Opaque: -1 Keys:
Also I noticed the value DEFAULT_OP_QUEUE_MAX_BLOCK_TIME in the spymemcache code which coincidentally = 10 seconds. Which is quite strange, anyone know what this value is for?
DefaultConnectionFactory
public static final long DEFAULT_OP_QUEUE_MAX_BLOCK_TIME =
TimeUnit.SECONDS.toMillis(10);
Any ideas or help would be most welcome.