elasticsearch - Elasicsearch nodes disconnecting -


we have issue nodes in cluster leaves cluster without apparent reason.

we run on elasticsearch v0.20.6, jvm 7u25. use unicast discovery.

this embedded es instance, 7 nodes in cluster. nodes 47, 48, 49 , 50 on 1 location (network), 24, 25 , 26 on another.

the same thing happens after while every time, index files deleted between tests. 1 of 24, 25, 26 nodes thinks master (which again leads split-brain scenario - ok , understand why happens, question why disconnect happening.

first, node47 elected master. other nodes joins, , things runs smooth couple of hours or so.

then suddenly, here first traces of visibly going wrong, around 19:10:

node47: 2013-08-14 19:09:49,243 debug [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][t#3]{new i/o worker #3}) [local] disconnected [[local][vbxjxeqgriynfzvk-1jciw][inet[/**node24**:8800]]{local=false}], channel closed event 2013-08-14 19:09:54,109 debug [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][t#3]{new i/o worker #3}) [local] disconnected [[local][v7fxnzilr-gviyz2dowv2w][inet[/**node26**:8800]]{local=false}], channel closed event 2013-08-14 19:10:06,008 debug [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][t#4]{new i/o worker #4}) [local] disconnected [[local][da-t28gdrtwgadrkcvxs-w][inet[/**node25**:8800]]{local=false}], channel closed event 2013-08-14 19:10:34,253 trace [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][generic][t#19]) [local] [node  ] [[local][vbxjxeqgriynfzvk-1jciw][inet[/**node24**:8800]]{local=false}] transport disconnected (with verified connect) 2013-08-14 19:10:34,259 debug [org.elasticsearch.transport.netty] (elasticsearch[local][generic][t#24]) [local] connected node [[local][v7fxnzilr-gviyz2dowv2w][inet[/**node26**:8800]]{local=false}] 2013-08-14 19:10:34,259 debug [org.elasticsearch.transport.netty] (elasticsearch[local][generic][t#25]) [local] connected node [[local][da-t28gdrtwgadrkcvxs-w][inet[/**node25**:8800]]{local=false}] 2013-08-14 19:10:34,273 debug [org.elasticsearch.transport.netty] (elasticsearch[local][generic][t#26]) [local] connected node [[local][vbxjxeqgriynfzvk-1jciw][inet[/**node24**:8800]]{local=false}] 2013-08-14 19:10:34,290 debug [org.elasticsearch.transport.netty] (elasticsearch[local][generic][t#27]) [local] disconnected [[local][vbxjxeqgriynfzvk-1jciw][inet[/**node24**:8800]]{local=false}]   node24: 2013-08-14 19:10:35,167 debug [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][t#4]{new i/o worker #4}) [local] [master] pinging master [local][y01tgbuzrg-jiipq7nqlzg][inet[/**node47**:8800]]{local=false} not exists on it, act if master failure 2013-08-14 19:10:35,170 debug [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][t#4]{new i/o worker #4}) [local] [master] stopping fault detection against master [[local][y01tgbuzrg-jiipq7nqlzg][inet[/**node47**:8800]]{local=false}], reason [master failure, not exists on master, act master failure] 2013-08-14 19:10:35,171 info  [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][t#1]) [local] master_left [[local][y01tgbuzrg-jiipq7nqlzg][inet[/**node47**:8800]]{local=false}], reason [do not exists on master, act master failure] 2013-08-14 19:10:35,174 debug [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][clusterservice#updatetask][t#1]) [local] [master] restarting fault detection against master [[local][jrrrd5y8r8whn1zakjynbw][inet[/**node45**:8800]]{local=false}], reason [possible elected master since master left (reason = not exists on master, act master failure)] 2013-08-14 19:10:35,181 debug [org.elasticsearch.transport.netty] (elasticsearch[local][generic][t#1]) [local] disconnected [[local][y01tgbuzrg-jiipq7nqlzg][inet[/**node47**:8800]]{local=false}] 2013-08-14 19:10:36,233 debug [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][t#4]{new i/o worker #4}) [local] [master] pinging master [local][jrrrd5y8r8whn1zakjynbw][inet[/**node45**:8800]]{local=false} no longer master 2013-08-14 19:10:36,235 info  [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][t#5]) [local] master_left [[local][jrrrd5y8r8whn1zakjynbw][inet[/**node45**:8800]]{local=false}], reason [no longer master] 2013-08-14 19:10:36,235 debug [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][t#4]{new i/o worker #4}) [local] [master] stopping fault detection against master [[local][jrrrd5y8r8whn1zakjynbw][inet[/**node45**:8800]]{local=false}], reason [master failure, no longer master] 2013-08-14 19:10:36,241 debug [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][clusterservice#updatetask][t#1]) [local] [master] restarting fault detection against master [[local][v7fxnzilr-gviyz2dowv2w][inet[/**node26**:8800]]{local=false}], reason [possible elected master since master left (reason = no longer master)] 2013-08-14 19:10:36,245 debug [org.elasticsearch.transport.netty] (elasticsearch[local][generic][t#5]) [local] disconnected [[local][jrrrd5y8r8whn1zakjynbw][inet[/**node45**:8800]]{local=false}] 2013-08-14 19:10:37,359 debug [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][t#3]{new i/o worker #3}) [local] [master] pinging master [local][v7fxnzilr-gviyz2dowv2w][inet[/**node26**:8800]]{local=false} no longer master 2013-08-14 19:10:37,361 info  [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][t#10]) [local] master_left [[local][v7fxnzilr-gviyz2dowv2w][inet[/**node26**:8800]]{local=false}], reason [no longer master] 2013-08-14 19:10:37,363 debug [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][t#3]{new i/o worker #3}) [local] [master] stopping fault detection against master [[local][v7fxnzilr-gviyz2dowv2w][inet[/**node26**:8800]]{local=false}], reason [master failure, no longer master] 2013-08-14 19:10:37,393 debug [org.elasticsearch.transport.netty] (elasticsearch[local][generic][t#10]) [local] disconnected [[local][v7fxnzilr-gviyz2dowv2w][inet[/**node26**:8800]]{local=false}] 

as far can read of logs; whats happening:

19:09:49,243 - channel closed event received node24 node47 (master) , disconnected 19:10:34,273 - connection node24 done, 19:10:34,290 - "disconnected" node24 19:10:35,167 - node24 pings master (node47) master not have node24 in list of nodes, , threats master failure.

all of happening within second - alas, no timeouts in work here know of. also, there no large gc or slowdown measurable in period or before.

im @ loss; why happen? if network issues; should tested on network side?

to answer myself actual reason behavior;

a tcp-connection between 2 nodes (while keeping connection other nodes) disconnected. recreated using utility tcpkill.

the elasticsearch zen discovery sadly not handle errors good, , sorts of strange outcomes possible. node looses connection master election, , may confuse other nodes.


Comments

Popular posts from this blog

java - JavaFX 2 slider labelFormatter not being used -

Detect support for Shoutcast ICY MP3 without navigator.userAgent in Firefox? -

web - SVG not rendering properly in Firefox -