Need some help troubleshooting Cassandra DB lockups

From: Lynn Dixon 
------------------------------------------------------
Hello all.
I have a machine that is running a Cassandra DB in a cluster of 4 pretty
beefy machines.  Occasionally we have one or two nodes in the cluster that
will just go wonky and compeltely lock the machine up.  It looks so badly,
that the console is damn near unrepsonsive.  There will be high Load times,
but very little CPU activity.  Cassandra is running via java.

Here is an output from strace of java PID that is running Cassandra:
http://pastebin.com/D2r3CeDx

I know very little about Cassandra, and the folks that wrote the custom
appliation that uses this DB seems to know very little about
troubleshooting Cassandra.

Is there anyone out there on the LUG that is familiar that would mind
helping me?

Thanks!
Lynn

=============================================================== From: Aaron welch ------------------------------------------------------ Cassandra optimization and troubleshooting is a rare art. I would suggest using one of these tools to watch the systems to see when they begin to fail. http://wiki.apache.org/cassandra/Administration%20Tools I worked with a local startup to get Cassandra running for some big data stuff, but they finally wrote it off because it was too much of a pain to manage and operate. -AW

=============================================================== From: Mike Harrison ------------------------------------------------------ Disclaimer: I've spent the day with 3 Frenchmen whome barely speak English or Linux, and am drinking to make the headache go away. We installed VirtualBox, and Ubuntu 12.04 with Lamp Stack under a desktop on a crappy

=============================================================== From: Justin McAteer ------------------------------------------------------ The first line of your strace output shows a serious problem. Not only is t= he process spending 75% of it's time in lock calls=2C but 35% of the calls = are returning an error.=20 My first wild guess=2C based on the very limited information is to check th= e current number of open file handles=3B both for the process/shell and sys= tem wide. See here: http://www.cyberciti.biz/faq/linux-increase-the-maximum= -number-of-open-files/ It might be wise to first look into the error codes associated with all of = those errors=2C it may be enlightening. If you run a full strace and captur= e to a file=2C you can grep for '=3D -1' and maybe 'futex' and you should w= ill probably see thousands of instances of the same error. Thanks=2C=0A= Justin McAteer=0A= Tel: (256) 694-9195 =

=============================================================== From: Lynn Dixon ------------------------------------------------------ Mike, These machines are quite beefy and should have plenty of resources. They are Cisco UCS blades with each pair of machines in a sepate pod. 4 machines total. Each one has 2 12-core processors, and 32 (I think) gig ram. They are not using local disks, but are attached to SSD based LUNs with each lun having as many as 4 paths in Active-Active to an EMC VMax SAN. And these are just the QA machines, PROD is about twice as beefy. Justin, One problem I am having is when this problem occurs, the machine is damn near unresponsive. Just logging in to the console can take 10 minutes, and then trying to run any commands will result in another 10-20 minute wait. This makes diagnosing the problem incredibly frustrating. Here is an album of screenshots I snapped from the console during one of its hangups: http://imgur.com/a/4UMTI The machine doesnt have any disk activity going on, and there are no writes waiting in que. The only way to remedy the problem is to do a kill -9 on the pid of the java application thats containing Cassandra. Once thats killed off, the machine returns to normal. In an effort to make the machine responsive when a hangup occurs, I configured the user thats launching the Cassandra instance to run all their programs with a niceness of 10. That didn't make a difference. I admittingly know very little about strace so I am going to have to dig into it. Here is the filesystem limits: [root@SCHGCDBQ402A ~]# cat /proc/sys/fs/file-max 1613971 [root@SCHGCDBQ402A ~]# ulimit -Hn 4096 [root@SCHGCDBQ402A ~]# ulimit -Sn 1024 [root@SCHGCDBQ402A ~]# su - liquid [liquid@SCHGCDBQ402A ~]$ ulimit -Hn 1000000 [liquid@SCHGCDBQ402A ~]$ ulimit -Sn 1000000 [liquid@SCHGCDBQ402A ~]$ The user "liquid" is the user that is launching the process. The next time one of the nodes hang (which will probably be tomorrow) I will try to grab a full strace to a file. On Tue, Feb 18, 2014 at 10:32 PM, Justin McAteer

=============================================================== From: Stephen Kraus ------------------------------------------------------ Could it have too many cores assigned to Cassandra? I know SQL Server gets pissy with more than 4 cores sometimes.

=============================================================== From: Justin McAteer ------------------------------------------------------ Have you tried suspending the process without killing it? This might make i= t easier for you to work. ie: kill -SIGSTOP You can attach strace to the stopped process and it should resume automatic= ally. This command should attach=2C start the process back and capture only the f= utex calls and show the return code and any associated error message. strace -p -e trace=3Dfutex -s 80 -o output.txt Thanks=2C=0A= Justin McAteer=0A= Tel: (256) 694-9195 =

=============================================================== From: Justin McAteer ------------------------------------------------------ =0A= =0A= =0A= Have you tried suspending the process without killing it? This might make i= t easier for you to work. ie: kill -SIGSTOP You can attach strace to the stopped process and it should resume automatic= ally. This=0A= command should attach=2C start the process back and capture only the =0A= futex calls and show the return code and any associated error message. strace -p -e trace=3Dfutex -s 80 -o output.txt Thanks=2C=0A= Justin McAteer=0A= Tel: (256) 694-9195=0A= =

=============================================================== From: Mike Harrison ------------------------------------------------------ Kudos.. Then I stand corrected.. :) Thou hast what should be sufficient hardware to extract magic pixie dust from seawater.

=============================================================== From: Aaron welch ------------------------------------------------------ Did you change the default heap size to correspond to the amount of RAM in each machine (set to half)? What is the active dataset size? When you are doing write to disk ops you could be exceeding the heap size and dumping to disk too often. The number of open files would point to that. Also: Nodes seem to freeze after some period of time=B6 Check your system.log for messages from the GCInspector. If the GCInspector is indicating that either the ParNew or ConcurrentMarkSweep collectors took longer than 15 seconds, there is a very high probability that some portion of the JVM is being swapped out by the OS. One way this might happen is if the mmap DiskAccessMode is used without JNA support. The address space will be exhausted by mmap, and the OS will decide to swap out some portion of the JVM that isn't in use, but eventually the JVM will try to GC this space. Adding the JNA libraries will solve this (they cannot be shipped with Cassandra due to carrying a GPL license, but are freely available) or the DiskAccessMode can be switched to mmap

=============================================================== From: Billy ------------------------------------------------------ The biggest problem I see with the symptoms is that your box locks up -- eve= n the OS. Java threads can do that, but only if they are misbehaved and not sleeping p= roperly. Still the preemptive CPU scheduler in the Linux kernel should handl= e that better. Nice should also make a difference -- if this was a purely CP= U bound issue. Thus, it might not be CPU bound. Check swap usage. This could be the swap sp= iral of death. Load average is defined as the number of tasks sitting in the run queue. Div= ide that number by your cores to get a rough estimate of how many tasks had t= o wait for a CPU time slice. Anything in the 10x your cores means you're having IO issues (like swap or a= ccess to the OS volume for things like shared.so). I've seen this via an app using a tmpfs mount for scratch files that happene= d to be larger than expected at times the jvm needed to allocate it's full m= ax heap from the OS. Caused swap chaos. Was an easy fix once we started moni= toring the swap usage by time (zabbix). We run zabbix on our systems, and with it, we were able to notify our storag= e team about an impacting issue. The symptom was a hung process and strange o= ut of order entries in the log files. However, it was in reality the storage= subsystem for the virtual machine's (like VMware) host system. They had a b= ug in their zstor appliance, which we got the vender to patch. However, if w= e hadn't been running zabbix, we would have assumed it was our app. This storage issue manifested itself as 100% IO wait CPU utilization across m= ultiple VMs at the same time. Once we identified that they shared the same s= torage pool, it was easy to track down the issue. So, I would also look outside the java process for this one. Maybe memory co= nstraints, swap or another storage type issue. GC pauses shouldn't hang the machine. The jvm, yes. Machine most likely no. A= full GC cycle (the slow, full one that locks up the jvm) is normally synchr= onous, in that only a few special system jvm threads are allowed to run -- e= verything else in the jvm is suspended. This allows the jvm to move memory p= ointers around and calculated strong references from its list of objects. So= , even if the GC was in full tilt, there should only be a few GC java thread= s running so that your OS should still be responsive. --b=20 Sent from my iPhone each machine (set to half)? What is the active dataset size? When you are= doing write to disk ops you could be exceeding the heap size and dumping to= disk too often. The number of open files would point to that. r is indicating that either the ParNew or ConcurrentMarkSweep collectors too= k longer than 15 seconds, there is a very high probability that some portion= of the JVM is being swapped out by the OS. One way this might happen is if t= he mmap DiskAccessMode is used without JNA support. The address space will b= e exhausted by mmap, and the OS will decide to swap out some portion of the J= VM that isn't in use, but eventually the JVM will try to GC this space. Addi= ng the JNA libraries will solve this (they cannot be shipped with Cassandra d= ue to carrying a GPL license, but are freely available) or the DiskAccessMod= e can be switched to mmap