Coherence 3.5: Service Guardian (Deadlock Detection)
This article is part 2 of a 3 part series on my favorite new features in Coherence 3.5.
One of the great benefits of using a modern JVM is deadlock detection. At my previous job I remember helping to track down an intermittent issue with our Swing desktop client that was eventually solved by providing instructions to our support/QA staff on how to generate a thread dump when the issue surfaced (which is much harder on Windows than on Unix/Linux based OSes.) Once they sent us the thread dump (which so conveniently printed the threads that were deadlocked at the bottom), fixing the issue was trivial.
Deadlocks can and do happen in distributed systems, and unfortunately there isn’t a good mechanism to detect distributed deadlocks. However, Oracle Coherence 3.5 does bring us closer with a new feature we call the Service Guardian. The concept behind the guardian is to ensure that each of the threads under our control are responsive; and when they’re not then the cluster node should take action. Out of the box you can configure it to remove the node from the cluster (default) or shut down the JVM. You can also provide an implementation of ServiceFailurePolicy
to provide custom handling of detected deadlocks.
Deadlocks can have especially bad consequences in a distributed system since there are inherent dependencies between nodes. In my experience, I’ve seen deadlocks in clusters due to one of three reasons:
Bugs in customer code
Concurrent programming is difficult enough; mix it in with distributed computing and you can get into some sticky situations. Several times in the past I’ve seen deadlocks occur within event handling code. Here’s one way that event handlers can deadlock:
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | /** * @author pperalta Jul 20, 2009 */ public class GuardianDemo implements MapListener { public static void main(String[] args) { NamedCache cache = CacheFactory.getCache("test"); cache.addMapListener(new GuardianDemo()); while (true) { int nKey = RANDOM.nextInt(10); try { cache.lock(nKey, -1); List listValue = (List) cache.get(nKey); if (listValue == null) { listValue = new ArrayList(); } listValue.add(System.currentTimeMillis()); cache.put(nKey, listValue); } finally { cache.unlock(nKey); } } } public void entryInserted(MapEvent evt) { } public void entryUpdated(MapEvent evt) { NamedCache cache = (NamedCache) evt.getSource(); Object nKey = evt.getKey(); try { cache.lock(nKey, -1); List listValue = (List) cache.get(nKey); if (listValue.size() > 0) { Object lValue = listValue.remove(0); cache.put(nKey, listValue); System.out.println("Removed " + lValue + " from " + nKey); } } finally { cache.unlock(nKey); } } public void entryDeleted(MapEvent evt) { } private static Random RANDOM = new Random(System.currentTimeMillis()); } |
When registering a map listener with Coherence, a background thread will be spawned to handle events. Upon receiving an event, Coherence will queue it up for the event handler (the customer provided implementation of MapListener
) to process. If we notice that events are being handled slower than they are being generated, then we will attempt to throttle the creation of new events so as to not allow the event queue to grow unbounded (and eventually exhaust the heap.)
A bit of a digression: the event throttling is not a new feature of Coherence; it has been around since at least 3.2.
When I ran this code with Coherence 3.4, it ran for a while but eventually stopped:
Oracle Coherence Version 3.4.2/411p7 Grid Edition: Development mode Copyright (c) 2000-2009 Oracle. All rights reserved. ... Removed 1248134614674 from 9 Removed 1248134614692 from 9 Removed 1248134614697 from 9 Removed 1248134614699 from 9 Removed 1248134614703 from 9 Removed 1248134614706 from 9 Removed 1248134614717 from 9 Removed 1248134614708 from 6 Removed 1248134614713 from 3 Removed 1248134614719 from 6 Removed 1248134614727 from 6 Removed 1248134614723 from 3 Removed 1248134614701 from 5 Removed 1248134614709 from 8 Removed 1248134614732 from 8 Removed 1248134614736 from 3 Removed 1248134614725 from 7 Removed 1248134614729 from 5 Removed 1248134614745 from 3 Removed 1248134614733 from 8 ... |
When it stopped running, I captured a thread dump:
"DistributedCache:EventDispatcher" daemon prio=5 tid=0x01019e40 nid=0x83e200 in Object.wait() [0xb1113000..0xb1113d90] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:474) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:31) - locked <0x295b8d88> (a com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$LockRequest$Poll) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:11) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.lock(DistributedCache.CDB:37) at com.tangosol.util.ConverterCollections$ConverterConcurrentMap.lock(ConverterCollections.java:2024) at com.tangosol.util.ConverterCollections$ConverterNamedCache.lock(ConverterCollections.java:2539) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap.lock(DistributedCache.CDB:1) at com.tangosol.coherence.component.util.SafeNamedCache.lock(SafeNamedCache.CDB:1) at com.tangosol.examples.guardian.GuardianDemo.entryUpdated(GuardianDemo.java:56) at com.tangosol.util.MapEvent.dispatch(MapEvent.java:195) at com.tangosol.util.MapEvent.dispatch(MapEvent.java:164) at com.tangosol.util.MapListenerSupport.fireEvent(MapListenerSupport.java:556) at com.tangosol.coherence.component.util.SafeNamedCache.translateMapEvent(SafeNamedCache.CDB:7) at com.tangosol.coherence.component.util.SafeNamedCache.entryUpdated(SafeNamedCache.CDB:1) at com.tangosol.util.MapEvent.dispatch(MapEvent.java:195) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap$ProxyListener.dispatch(DistributedCache.CDB:22) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap$ProxyListener.entryUpdated(DistributedCache.CDB:1) at com.tangosol.util.MapEvent.dispatch(MapEvent.java:195) at com.tangosol.coherence.component.util.CacheEvent.run(CacheEvent.CDB:18) at com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.onNotify(Service.CDB:19) at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:37) at java.lang.Thread.run(Thread.java:613) ... "main" prio=5 tid=0x01001480 nid=0xb0801000 waiting on condition [0xb07ff000..0xb0800148] at java.lang.Thread.sleep(Native Method) at com.tangosol.coherence.component.util.Daemon.sleep(Daemon.CDB:9) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$EventDispatcher.drainOverflow(Grid.CDB:15) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.post(Grid.CDB:17) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.send(Grid.CDB:1) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:12) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:11) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.unlock(DistributedCache.CDB:32) at com.tangosol.util.ConverterCollections$ConverterConcurrentMap.unlock(ConverterCollections.java:2032) at com.tangosol.util.ConverterCollections$ConverterNamedCache.unlock(ConverterCollections.java:2555) at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap.unlock(DistributedCache.CDB:1) at com.tangosol.coherence.component.util.SafeNamedCache.unlock(SafeNamedCache.CDB:1) at com.tangosol.examples.guardian.GuardianDemo.main(GuardianDemo.java:40) |
We can see that the event dispatcher thread is waiting to acquire a lock for a key. However, the main thread has that key locked and (in a bit of an ironic twist) is attempting to release the lock. However, the throttling mechanism has kicked in, and it won’t allow for any more operations on the cache until the event queue is drained, which will never happen since the queue responsible for draining the event queue is stuck waiting for a lock to be released.
Now, let’s run it with Coherence 3.5:
Oracle Coherence Version 3.5/459 Grid Edition: Development mode Copyright (c) 2000, 2009, Oracle and/or its affiliates. All rights reserved. ... Removed 1248136418346 from 2 Removed 1248136418361 from 2 Removed 1248136418363 from 6 Removed 1248136418365 from 3 Removed 1248136418366 from 6 Removed 1248136418369 from 2 Removed 1248136418367 from 3 Removed 1248136418371 from 6 Removed 1248136418376 from 6 Removed 1248136418389 from 2 Removed 1248136418383 from 6 Removed 1248136418384 from 3 ... Removed 1248136419975 from 3 Removed 1248136420113 from 2 Removed 1248136420114 from 7 Removed 1248136420116 from 2 2009-07-20 20:33:40.473/6.683 Oracle Coherence GE 3.5/459 <Warning> (thread=main, member=1): The event queue appears to be stuck. Removed 1248136420076 from 12009-07-20 20:33:40.475/6.685 Oracle Coherence GE 3.5/459 <Error> (thread=main, member=1): Full Thread Dump ... Thread[DistributedCache:EventDispatcher,5,Cluster] java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:474) com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:31) com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:11) com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.lock(DistributedCache.CDB:37) com.tangosol.util.ConverterCollections$ConverterConcurrentMap.lock(ConverterCollections.java:2024) com.tangosol.util.ConverterCollections$ConverterNamedCache.lock(ConverterCollections.java:2539) com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap.lock(DistributedCache.CDB:1) com.tangosol.coherence.component.util.SafeNamedCache.lock(SafeNamedCache.CDB:1) com.tangosol.examples.guardian.GuardianDemo.entryUpdated(GuardianDemo.java:56) com.tangosol.util.MapEvent.dispatch(MapEvent.java:210) com.tangosol.util.MapEvent.dispatch(MapEvent.java:166) com.tangosol.util.MapListenerSupport.fireEvent(MapListenerSupport.java:556) com.tangosol.coherence.component.util.SafeNamedCache.translateMapEvent(SafeNamedCache.CDB:7) com.tangosol.coherence.component.util.SafeNamedCache.entryUpdated(SafeNamedCache.CDB:1) com.tangosol.util.MapEvent.dispatch(MapEvent.java:210) com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap$ProxyListener.dispatch(DistributedCache.CDB:22) com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap$ProxyListener.entryUpdated(DistributedCache.CDB:1) com.tangosol.util.MapEvent.dispatch(MapEvent.java:210) com.tangosol.coherence.component.util.CacheEvent.run(CacheEvent.CDB:18) com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.onNotify(Service.CDB:26) com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42) java.lang.Thread.run(Thread.java:613) ... Thread[main,5,main] java.lang.Thread.dumpThreads(Native Method) java.lang.Thread.getAllStackTraces(Thread.java:1460) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:585) com.tangosol.net.GuardSupport.logStackTraces(GuardSupport.java:791) com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.drainOverflow(Service.CDB:45) com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$EventDispatcher.drainOverflow(Grid.CDB:9) com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.post(Grid.CDB:17) com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.send(Grid.CDB:1) com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:12) com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:11) com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.unlock(DistributedCache.CDB:32) com.tangosol.util.ConverterCollections$ConverterConcurrentMap.unlock(ConverterCollections.java:2032) com.tangosol.util.ConverterCollections$ConverterNamedCache.unlock(ConverterCollections.java:2555) com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap.unlock(DistributedCache.CDB:1) com.tangosol.coherence.component.util.SafeNamedCache.unlock(SafeNamedCache.CDB:1) com.tangosol.examples.guardian.GuardianDemo.main(GuardianDemo.java:40) |
Here the guardian took the following actions:
- Detected the stuck event dispatcher thread:
<Warning> (thread=main, member=1): The event queue appears to be stuck.
- Printed the stacks of each thread
- Restarted the cluster threads in order to keep the node up and running
Much nicer than having a deadlocked node!
Bugs in Coherence code
Despite our best efforts, bugs (including deadlocks) do occasionally appear in Coherence (just like any other product.) In particular, the kind of deadlock that has the worse consequences is a deadlock that involves a Service thread. Everything in Coherence (the clustering logic, replicated caches, distributed caches, remote invocations, statistics, etc) is implemented internally using queues and threads that are responsible for processing messages in queues. These are the service threads, and they are the lifeblood of Coherence. If this type of defect should slip into any future versions of Coherence, the guardian will detect this condition and take corrective action to allow the node (and the cluster) to continue to function.
Bugs in the JVM/Operating System
In the absence of bugs in customer or Coherence code, we do occasionally see bugs in the JVM and/or the operating system that result in locked up service threads. Perhaps the most notorious of these is with early versions of NPTL on Linux. In a nutshell, we saw that threads occasionally missed notifications (in other words, threads that were in Object.wait()
would never receive the Object.notify()
or Object.notifyAll()
that we sent to it.) I’ve also seen older JVMs with buggy implementations of the wait/notify mechanism with the same results.
One of our goals with Coherence is to keep your application up and running at all times, even when components of the system fail (hardware, databases, etc.) This is yet one more tool in our arsenal to bring us closer to that goal.