NFJS Boston Day 2

Here are my highlights of No Fluff Just Stuff Day 2:

Garbage Collector Friendly Programming

Garbage collection is an interesting topic to me for several reasons, the main reason being that poor GC performance is very harmful in distributed environments. When you have a peer to peer system such as Coherence, any one node can directly communicate with any other node at any point in time to service a request. If a node that needs to service several requests is in a long GC pause, it isn’t just that node that is affected. Every JVM that is waiting for a response from that node also experiences high latency, thus causing a cascading effect. (More on what to do about this later.)

Brian Goetz described the evolution of GC in Java, starting with the single threaded mark and sweep algorithms up to the modern generational collector. What they each have in common is the tracking of allocation roots (which include static variables and local variables allocated on thread stacks) and the traversal of object references starting at these roots. The implementation of the generational collector is (roughly) as follows:

  • New objects are allocated in the young generation space
  • When a minor GC occurs, objects that have references pointing to them are copied to the survival space. The remaining objects are removed
  • Eventually objects that live long enough in the survival space are moved to the old generation.
  • If a minor collection fails to free up enough space in the young generation area, a full collection (which is much more expensive) will occur in the old generation space.

An implementation detail is that the JVM must track references from the old generation to the young generation in order to know which GC roots to traverse when performing a minor collection. This means that the more “old” objects there are pointing to “new” objects, the more work the collector has to do. In a practical sense, this means that allocating new objects is preferred to reference field updates.

It took a while to wrap my head around that last statement, so let me attempt to demonstrate. Let’s say I have a map that contains objects with many fields. If I wanted to update some of those fields I could do it like this:

Map map = ...
MyObject o = map.get(key);
o.setField1(f1);
o.setField3(f3);
o.setField5(f5);

Or I could do it like this:

Map map = ...
MyObject oOld = map.get(key);
MyObject oNew = new MyObject(f1, o.getField2(), f3, o.getField4(), f5);
map.put(key, oNew);

The first example updates three reference fields (assuming the object in the map is in the old generation), whereas the second is updating one (the reference held by Map.Entry) – and as an added bonus the second implementation is thread safe (assuming that MyObject is immutable and the Map implementation is also thread safe.) If anyone has a better (or more correct) example of this concept, I’d be happy to see it!

Many other concepts were covered including: why you should use finally instead of finalizers to clean up, weak references, soft references, and tracking down memory leaks. Capturing heap dumps to track memory usage is a technique I recommend to customers (this works much better than speculating/wild guesses about where unexpected memory allocation is coming from) – I especially recommend configuring the JVM to generate a heap dump when an OutOfMemoryError is thrown. My favorite tool to read heap dumps is Eclipse MAT. Heap histograms are also a nice light weight approach to analyze memory problems.

When customers ask for suggestions on GC tuning, my recommendation is to keep it simple: fixed size 1 GB heaps (on the Sun VM), and don’t use more than 75% of the heap. I usually don’t recommend any specific tuning parameters, as the GC algorithms are constantly improving and any exotic flags that may have worked in older JVMs (assuming they helped in the first place) may not work so well in newer JVMs. The best advice I can give is to not fill up the heap as this will cause more frequent full collections.

Inside the Modern JVM

NFJS tends to cover languages in the JVM other than Java (such as Groovy, Scala, JRuby, etc.) This is a testament to the strength and viability of the modern JVM. Brian covered some of the advancements and (quite frankly) rocket science that goes into the JVM, HotSpot in particular. No matter what happens to Java (which isn’t going away any time soon), the JVM will be around for a very long time.

In a nutshell: why is Java, a supposedly interpreted language, faster than C++ in many cases? The answer is that the JVM determines which optimizations to make at run time instead of compile time, which is the opposite approach of C++ and other native languages. Optimization at runtime is far more effective, since the JVM has hard statistics of real world usage to draw on as opposed to the speculation and guessing that happens when everything has to be compiled to machine code before execution.

The overall theme of this talk (and the previous ones) is to write simple clean code – the runtime recognizes common usage patterns in Java and is built to optimize these patterns.

Java Collections

Ted Neward gave an engaging and entertaining talk on the Java Collections API. To be honest I was familiar with most of the material, but he is a fun speaker to watch, in spite of the fact that he gave me a good ribbing for showing up to his talk after it had already started! He is quite biased against arrays and towards collections, which made me think back to a web/remote services API I designed a few years ago. I exclusively used arrays as the collection type for this API for two reasons:

  1. To make SOAP/cross language interoperability much simpler (least common denominator – every language does arrays)
  2. There were no generics at the time; using arrays instead of generics in the interfaces meant that I could explicitly define the type of the array

The second item is not as important anymore now that we have generics, so I’m inclined to agree that arrays should be used sparingly nowadays.

Another interesting point is that iteration over collections should be done using a functor instead of a plain iterator. For example:

List<Name> names = ...; 
MyListOps.apply(new MyApplyFn<Name>() { 
  public void apply(Name n) { 
    // use n 
  } 
}, names);

This allows the possibility of processing the collection in multiple threads.

What’s coming in Java 7

This was Ted Neward’s next talk which was just as interesting and opinionated. Here are the highlights:

  • The release is targeted for early 2010
  • There is no official JSR for Java 7
  • Most of the information on what is going into Java 7 can be found on the blog of Joe Darcy of Sun.
  • Alex Miller has a huge page on his blog detailing what is in and what is out. This is information that is is aggregating off the web.

One of the most compelling additions to Java 7 is JSR 292, which introduces the bytecode invokedynamic. The implications of this addition are narrated by Charles Nutter of JRuby. There are other syntactic conveniences making it in; however it will not include closures (a fairly controversial topic.)

NFJS Boston Day 1

Yesterday was day 1 of No Fluff Just Stuff in Boston. Here is a highlight of the sessions I attended.

JSF 2.0

David Geary presented an overview of JSF 2.0. The reference implementation is known as Project Mojarra which falls under the GlassFish family. The current reference implementation as well as the 2.0 RC can be downloaded from the project site.

Here are the highlights of what is new:

  • Facelets (templated XHTML with an expression language) instead of JSP
  • Annotations and convention instead of XML for configuration
  • Bookmark friendly views
  • Improved error messages
  • Richer event model
  • Ajax integration

I personally don’t do any web development, however I was interested in this topic as some of our Coherence*Web customers do use JSF.

The Java Memory Model

The first time I heard about the JMM was during my interview with Tangosol which incidentally occurred almost three years ago to the day. It isn’t often that you learn so much at a job interview. This was a clear indication that a job at Tangosol would result in me learning many new things from people much smarter than me. Consequently, many of the items covered during this talk I had at least passing familiarity with. However I was glad to see that Brian Goetz is as clear of a speaker as he is a writer – which is a real treat. The ability to transfer knowledge on complex topics not well understood by most people in the industry in such a straightforward manner is not something to be taken for granted.

I’ll try to provide a succinct 60 second overview of the JMM and why it matters if you’re a Java developer. It can be boiled down to a simple question:

If you assign a variable as such in thread A:

x = 5;

Under which circumstances will the following evaluate to true in thread B?

x == 5

If the write and read of variable x is done outside of a synchronization block or if x is not declared as volatile, then there is no guarantee that thread B will see the updated value. Straight from the slide deck:

  • The memory effects of one thread may not be immediately visible to other threads
  • Modern microprocessors exhibit a higher level of asynchronous and nondeterministic behavior than “when we were kids”
  • Compilers may reorder instructions (if permitted by language semantics) to achieve higher performance

In a nutshell, modern multi processor and multi core machines perform aggressive caching to improve performance, at the cost of non deterministic behavior as described above. Therefore, any time variables are shared between threads, not only do you have to worry about ensuring that threads don’t step over each other (causing data corruption), but you also have to make sure that the updated value is visible to all threads.

Therefore the synchronized keyword actually performs double duty: it defines boundaries for critical sections, and it also ensures that variables written are immediately visible to all threads. The volatile keyword indicates that a variable should not be cached – its value will be immediately visible to all threads whenever it is updated.

These IBM developerWorks articles describe the JMM in much greater detail:
Fixing the Java Memory Model, Part 1
Fixing the Java Memory Model, Part 2

Are All Web Applications Broken?

This talk builds upon the previous JMM talk and looks at a practical example of where this knowledge becomes important: web applications. Some web applications (i.e. servlets or web frameworks that run on servlets) are stateless, at least at the web tier. Many use the database for state, in which case the database is handling concurrency.

However, some web applications do track state internally, either as a member variable in the servlet (uncommon) or in the SessionContext (more common.) In this case it is definitely up to the developer to handle concurrency correctly, keeping in mind the lessons of the JMM.

The more subtle (and common) pitfall is in handling HttpSession objects. The common assumption is that access to a session object does not necessarily need to be thread safe since each session is scoped to a specific user who will presumably make one request at a time. However this assumption does not hold true in (at least) the following cases:

  • Your app uses frames
  • Your app uses Ajax
  • Your app is a portlet
  • Your user has an itchy trigger finger and likes to triple click on links

In these cases, you can easily have multiple threads accessing a session. Setting and getting attributes in a session is likely thread safe, but there’s no guarantee that the objects in the session themselves are thread safe. This is especially evident when distributing sessions, as this requires the container to serialize session attributes. Best practice for objects in a session are as follows:

  • Use immutable variables (may have to combine with atomic operations via AtomicReference if there are check-then-act or read-modify-write actions.)
  • Use thread safe objects (i.e. ConcurrentHashMap vs HashMap)
  • Don’t put plain JavaBeans in a session!

For more, see the IBM developerWorks article.

How to get an OutOfMemoryError without trying

In the past few months, I’ve seen customers run into mysterious OutOfMemoryErrors that seem to come out of nowhere. For the most part their apps are working fine, then out of the blue the heap blows up, and it is never reproducible. In each case, the culprit turned out to be something like the following:

public static void main(String[] asArgs)
    {
    final int    nCount = 5;
    final int    nRange = 1000;
    final Map    map    = new HashMap();
    final Random random = new Random();
 
    final Runnable r = new Runnable()
        {
        public void run()
            {
            while (true)
                {
                int nKey = random.nextInt(nRange);
                if (random.nextBoolean())
                    {
                    map.put(nKey, System.currentTimeMillis());
                    }
                else
                    {
                    map.remove(nKey);
                    }
                }
            }
        };
 
    Thread[] threads = new Thread[nCount];
 
    System.out.println("Starting " + nCount +
            " threads, range = " + nRange);
 
    for (int i = 0; i < threads.length; i++)
        {
        threads[i] = new Thread(r, "Thread " + i);
        threads[i].start();
        }
    }

See the bug? The problem here is with multiple threads using a java.util.HashMap in the absence of synchronization. One would imagine that at worst this usage would result in inaccurate data in the map. However, this turns out not to be the case.

Running under Java 1.5 in OS X, this runs for a few seconds before it gets stuck in an infinite loop (evidenced by the CPU spiking to 100%):

"Thread 4" prio=5 tid=0x0100c350 nid=0x853000 runnable [0xb0e8e000..0xb0e8ed90]
        at java.util.HashMap.put(HashMap.java:420)
        at com.tangosol.examples.misc.HashMapTest$1.run(HashMapTest.java:30)
        at java.lang.Thread.run(Thread.java:613)
 
"Thread 3" prio=5 tid=0x0100bde0 nid=0x852200 runnable [0xb0e0d000..0xb0e0dd90]
        at java.util.HashMap.removeEntryForKey(HashMap.java:614)
        at java.util.HashMap.remove(HashMap.java:584)
        at com.tangosol.examples.misc.HashMapTest$1.run(HashMapTest.java:34)
        at java.lang.Thread.run(Thread.java:613)
 
"Thread 2" prio=5 tid=0x0100ba20 nid=0x851200 runnable [0xb0d8c000..0xb0d8cd90]
        at java.util.HashMap.removeEntryForKey(HashMap.java:614)
        at java.util.HashMap.remove(HashMap.java:584)
        at com.tangosol.examples.misc.HashMapTest$1.run(HashMapTest.java:34)
        at java.lang.Thread.run(Thread.java:613)
 
"Thread 1" prio=5 tid=0x0100b610 nid=0x850400 runnable [0xb0d0b000..0xb0d0bd90]
        at java.util.HashMap.removeEntryForKey(HashMap.java:614)
        at java.util.HashMap.remove(HashMap.java:584)
        at com.tangosol.examples.misc.HashMapTest$1.run(HashMapTest.java:34)
        at java.lang.Thread.run(Thread.java:613)
 
"Thread 0" prio=5 tid=0x0100b430 nid=0x84f600 runnable [0xb0c8a000..0xb0c8ad90]
        at java.util.HashMap.removeEntryForKey(HashMap.java:614)
        at java.util.HashMap.remove(HashMap.java:584)
        at com.tangosol.examples.misc.HashMapTest$1.run(HashMapTest.java:34)
        at java.lang.Thread.run(Thread.java:613)

Under 1.6, it runs for about a minute before I get:

java.lang.OutOfMemoryError: Java heap space
        at java.util.HashMap.resize(HashMap.java:462)
        at java.util.HashMap.addEntry(HashMap.java:755)
        at java.util.HashMap.put(HashMap.java:385)
        at com.tangosol.examples.misc.HashMapTest$1.run(HashMapTest.java:30)
        at java.lang.Thread.run(Thread.java:637)

I configured the VM to generate a heap dump upon OutOfMemoryError. Here are some screenshots from Eclipse MAT:

MAT 1

MAT 2

Both of these behaviors can be explained by race conditions that corrupt the HashMap internal data structures causing infinite loops, the latter case resulting in an OOME. This behavior is described in this Stack Overflow thread, which links to this blog post describing one of the possible race conditions in detail.

The lessons to be learned here are:

  • When using non thread safe data structures, make sure that only one thread will access them at a time, or switch to a thread safe data structure.
  • Configure JVMs in production to generate a heap dump upon an OutOfMemoryError (this has helped us track down various OOMEs for customers), and consider configuring the JVM to shut down if this error is thrown. The Coherence production checklist provides information on how to configure these settings on various JVMs.

Coherence 3.5: POF Extractor/Updater

This article is part 3 of a 3 part series on my favorite new features in Coherence 3.5.

Ever since Coherence added support for .NET (and more recently C++, which can be implied when discussing .NET below) clients, we’ve always been asked this question:

When do I have to provide both .NET and Java implementations of my classes?

With each new release of Coherence, it becomes less of a requirement to provide .NET and Java implementations of cache objects. Here is a timeline of the evolution of multi language support:

Coherence 3.2/3.3

Support for .NET clients. .NET objects are serialized into a platform neutral serialization format (POF) and sent over TCP to a proxy server. The proxy server deserializes these objects and serializes them into Java format before sending into the grid for storage, thus the requirement for .NET and Java versions of each type.

Coherence 3.4

Support for .NET and C++ clients. Grid is enhanced to allow for POF binaries to be stored natively in the grid, thus removing the deserialization/serialization step previously required in the proxy servers. .NET and Java versions of cached objects are required for:

  • Entry processors
  • Queries
  • Cache Store
  • Key association

For applications with .NET clients that only do puts and gets, there is no need for Java versions of their objects in the grid.

Coherence 3.5

New in 3.5 is the ability for cache servers to extract and update data in POF binaries without deserializing the binary into an object. This is done via PofExtractors and PofUpdaters. A PofExtractor is an implementation of ValueExtractor, which is an interface that defines how to extract data from objects. The most common extractor in use today is ReflectionExtractor, which simply means that the provided method will be invoked on the target object, and the result from that method is returned.

This means that operations that rely on extractors (such as queries and some entry processors) can now be executed on the server side without needing Java classes to represent the data types.

Here is an example. Let’s say you have the following type (I wrote it in Java, but this could also be done in .NET)

public class Person
        implements PortableObject
    {
    public Person()
        {
        }
 
    public Person(String sFirstName, String sLastName, String sEmail)
        {
        m_sFirstName = sFirstName;
        m_sLastName = sLastName;
        m_sEmail = sEmail;
        }
 
    // getters and setters omitted.. 
 
    public void readExternal(PofReader in)
            throws IOException
        {
        m_sFirstName = in.readString(FIRST_NAME);
        m_sLastName  = in.readString(LAST_NAME);
        m_sEmail     = in.readString(EMAIL);
        }
 
    public void writeExternal(PofWriter out)
            throws IOException
        {
        out.writeString(FIRST_NAME, m_sFirstName);
        out.writeString(LAST_NAME, m_sLastName);
        out.writeString(EMAIL, m_sEmail);
        }
 
    private String m_sFirstName;
    private String m_sLastName;
    private String m_sEmail;
 
    public static final int FIRST_NAME = 0;
    public static final int LAST_NAME  = 1;
    public static final int EMAIL      = 2;
    }

Now for some sample code on executing a query:

NamedCache pofCache = CacheFactory.getCache("pof");
 
// These names are fictitious: any resemblence to real people
// is coincidental!
pofCache.put(1, new Person("Bob", "Smith", "bob.smith@google.com"));
pofCache.put(2, new Person("Jane", "Doe", "jane.doe@yahoo.com"));
pofCache.put(3, new Person("Fred", "James", "fred.james@oracle.com"));
pofCache.put(4, new Person("Amy", "Jones", "amy.jones@oracle.com"));
pofCache.put(5, new Person("Ted", "Black", "ted.black@google.com"));
 
// Query for oracle.com addresses
Set keys = pofCache.keySet(new LikeFilter(new PofExtractor(Person.EMAIL), 
        "%@oracle.com", '\\', false));
 
assert keys.size() == 2;
assert keys.contains(3);
assert keys.contains(4);

The cache configuration (note the system-property override in the serializer config; this comes into play later):

<?xml version="1.0"?>
 
<!DOCTYPE cache-config SYSTEM "cache-config.dtd">
 
<cache-config>
 
  <caching-scheme-mapping>
    <cache-mapping>
      <cache-name>pof</cache-name>
      <scheme-name>pof</scheme-name>
    </cache-mapping>
  </caching-scheme-mapping>
 
  <caching-schemes>
    <distributed-scheme>
      <scheme-name>pof</scheme-name>
      <serializer>
        <class-name>com.tangosol.io.pof.ConfigurablePofContext</class-name>
        <init-params>
          <init-param>
            <param-type>string</param-type>
            <param-value system-property="pof.config">pof-config.xml</param-value>
          </init-param>
        </init-params>
      </serializer>
      <service-name>PofDistributedService</service-name>
      <backing-map-scheme>
        <local-scheme />
      </backing-map-scheme>
      <autostart>true</autostart>
    </distributed-scheme>
 
  </caching-schemes>
 
</cache-config>

And, the POF configuration on the client:

<!DOCTYPE pof-config SYSTEM "pof-config.dtd">
 
<pof-config>
  <user-type-list>
    <include>coherence-pof-config.xml</include>
 
    <user-type>
      <type-id>10001</type-id>
      <class-name>com.tangosol.examples.pof.Person</class-name>
    </user-type>
 
  </user-type-list>
</pof-config>

To run this, I set up a cache server without adding any extra classes to the classpath. I only provided the above cache configuration, and I supplied the following to the command line:

-Dpof.config=coherence-pof-config.xml

Why did I do this? This is because the server side does not need to know about the client’s POF configuration since it does not need to deserialize the objects. Therefore I’m simply supplying the default cache configuration that ships with Coherence.

Given the addition of this new feature, we can modify the list from 3.4 as such:

  • Entry processors
  • Queries
  • Cache Store
  • Key association

To summarize, the introduction of POF extractors and updaters means that .NET clients only need Java implementations of their respective classes when performing CacheStore operations and/or key association.

Coherence 3.5: Service Guardian (Deadlock Detection)

This article is part 2 of a 3 part series on my favorite new features in Coherence 3.5.

One of the great benefits of using a modern JVM is deadlock detection. At my previous job I remember helping to track down an intermittent issue with our Swing desktop client that was eventually solved by providing instructions to our support/QA staff on how to generate a thread dump when the issue surfaced (which is much harder on Windows than on Unix/Linux based OSes.) Once they sent us the thread dump (which so conveniently printed the threads that were deadlocked at the bottom), fixing the issue was trivial.

Deadlocks can and do happen in distributed systems, and unfortunately there isn’t a good mechanism to detect distributed deadlocks. However, Oracle Coherence 3.5 does bring us closer with a new feature we call the Service Guardian. The concept behind the guardian is to ensure that each of the threads under our control are responsive; and when they’re not then the cluster node should take action. Out of the box you can configure it to remove the node from the cluster (default) or shut down the JVM. You can also provide an implementation of ServiceFailurePolicy to provide custom handling of detected deadlocks.

Deadlocks can have especially bad consequences in a distributed system since there are inherent dependencies between nodes. In my experience, I’ve seen deadlocks in clusters due to one of three reasons:

Bugs in customer code

Concurrent programming is difficult enough; mix it in with distributed computing and you can get into some sticky situations. Several times in the past I’ve seen deadlocks occur within event handling code. Here’s one way that event handlers can deadlock:

14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/**
 * @author pperalta Jul 20, 2009
 */
public class GuardianDemo
        implements MapListener
    {
    public static void main(String[] args)
        {
        NamedCache cache = CacheFactory.getCache("test");
        cache.addMapListener(new GuardianDemo());
        while (true)
            {
            int nKey = RANDOM.nextInt(10);
            try
                {
                cache.lock(nKey, -1);
                List listValue = (List) cache.get(nKey);
                if (listValue == null)
                    {
                    listValue = new ArrayList();
                    }
                listValue.add(System.currentTimeMillis());
                cache.put(nKey, listValue);
                }
            finally
                {
                cache.unlock(nKey);
                }
            }
        }
 
    public void entryInserted(MapEvent evt)
        {
        }
 
    public void entryUpdated(MapEvent evt)
        {
        NamedCache cache = (NamedCache) evt.getSource();
        Object     nKey  = evt.getKey();
 
        try
            {
            cache.lock(nKey, -1);
            List listValue = (List) cache.get(nKey);
            if (listValue.size() > 0)
                {
                Object lValue = listValue.remove(0);
                cache.put(nKey, listValue);
                System.out.println("Removed " + lValue + " from " + nKey);
                }
            }
        finally
            {
            cache.unlock(nKey);
            }
        }
 
    public void entryDeleted(MapEvent evt)
        {
        }
 
    private static Random RANDOM = new Random(System.currentTimeMillis());
    }

When registering a map listener with Coherence, a background thread will be spawned to handle events. Upon receiving an event, Coherence will queue it up for the event handler (the customer provided implementation of MapListener) to process. If we notice that events are being handled slower than they are being generated, then we will attempt to throttle the creation of new events so as to not allow the event queue to grow unbounded (and eventually exhaust the heap.)

A bit of a digression: the event throttling is not a new feature of Coherence; it has been around since at least 3.2.

When I ran this code with Coherence 3.4, it ran for a while but eventually stopped:

Oracle Coherence Version 3.4.2/411p7
 Grid Edition: Development mode
Copyright (c) 2000-2009 Oracle. All rights reserved.
...
Removed 1248134614674 from 9
Removed 1248134614692 from 9
Removed 1248134614697 from 9
Removed 1248134614699 from 9
Removed 1248134614703 from 9
Removed 1248134614706 from 9
Removed 1248134614717 from 9
Removed 1248134614708 from 6
Removed 1248134614713 from 3
Removed 1248134614719 from 6
Removed 1248134614727 from 6
Removed 1248134614723 from 3
Removed 1248134614701 from 5
Removed 1248134614709 from 8
Removed 1248134614732 from 8
Removed 1248134614736 from 3
Removed 1248134614725 from 7
Removed 1248134614729 from 5
Removed 1248134614745 from 3
Removed 1248134614733 from 8
...

When it stopped running, I captured a thread dump:

"DistributedCache:EventDispatcher" daemon prio=5 tid=0x01019e40 nid=0x83e200 in Object.wait() [0xb1113000..0xb1113d90]
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:474)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:31)
	- locked <0x295b8d88> (a com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$LockRequest$Poll)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:11)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.lock(DistributedCache.CDB:37)
	at com.tangosol.util.ConverterCollections$ConverterConcurrentMap.lock(ConverterCollections.java:2024)
	at com.tangosol.util.ConverterCollections$ConverterNamedCache.lock(ConverterCollections.java:2539)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap.lock(DistributedCache.CDB:1)
	at com.tangosol.coherence.component.util.SafeNamedCache.lock(SafeNamedCache.CDB:1)
	at com.tangosol.examples.guardian.GuardianDemo.entryUpdated(GuardianDemo.java:56)
	at com.tangosol.util.MapEvent.dispatch(MapEvent.java:195)
	at com.tangosol.util.MapEvent.dispatch(MapEvent.java:164)
	at com.tangosol.util.MapListenerSupport.fireEvent(MapListenerSupport.java:556)
	at com.tangosol.coherence.component.util.SafeNamedCache.translateMapEvent(SafeNamedCache.CDB:7)
	at com.tangosol.coherence.component.util.SafeNamedCache.entryUpdated(SafeNamedCache.CDB:1)
	at com.tangosol.util.MapEvent.dispatch(MapEvent.java:195)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap$ProxyListener.dispatch(DistributedCache.CDB:22)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap$ProxyListener.entryUpdated(DistributedCache.CDB:1)
	at com.tangosol.util.MapEvent.dispatch(MapEvent.java:195)
	at com.tangosol.coherence.component.util.CacheEvent.run(CacheEvent.CDB:18)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.onNotify(Service.CDB:19)
	at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:37)
	at java.lang.Thread.run(Thread.java:613)
 
...
 
"main" prio=5 tid=0x01001480 nid=0xb0801000 waiting on condition [0xb07ff000..0xb0800148]
	at java.lang.Thread.sleep(Native Method)
	at com.tangosol.coherence.component.util.Daemon.sleep(Daemon.CDB:9)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$EventDispatcher.drainOverflow(Grid.CDB:15)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.post(Grid.CDB:17)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.send(Grid.CDB:1)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:12)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:11)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.unlock(DistributedCache.CDB:32)
	at com.tangosol.util.ConverterCollections$ConverterConcurrentMap.unlock(ConverterCollections.java:2032)
	at com.tangosol.util.ConverterCollections$ConverterNamedCache.unlock(ConverterCollections.java:2555)
	at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap.unlock(DistributedCache.CDB:1)
	at com.tangosol.coherence.component.util.SafeNamedCache.unlock(SafeNamedCache.CDB:1)
	at com.tangosol.examples.guardian.GuardianDemo.main(GuardianDemo.java:40)

We can see that the event dispatcher thread is waiting to acquire a lock for a key. However, the main thread has that key locked and (in a bit of an ironic twist) is attempting to release the lock. However, the throttling mechanism has kicked in, and it won’t allow for any more operations on the cache until the event queue is drained, which will never happen since the queue responsible for draining the event queue is stuck waiting for a lock to be released.

Now, let’s run it with Coherence 3.5:

Oracle Coherence Version 3.5/459
 Grid Edition: Development mode
Copyright (c) 2000, 2009, Oracle and/or its affiliates. All rights reserved.
...
Removed 1248136418346 from 2
Removed 1248136418361 from 2
Removed 1248136418363 from 6
Removed 1248136418365 from 3
Removed 1248136418366 from 6
Removed 1248136418369 from 2
Removed 1248136418367 from 3
Removed 1248136418371 from 6
Removed 1248136418376 from 6
Removed 1248136418389 from 2
Removed 1248136418383 from 6
Removed 1248136418384 from 3
...
Removed 1248136419975 from 3
Removed 1248136420113 from 2
Removed 1248136420114 from 7
Removed 1248136420116 from 2
2009-07-20 20:33:40.473/6.683 Oracle Coherence GE 3.5/459 <Warning> (thread=main, member=1): The event queue appears to be stuck.
Removed 1248136420076 from 12009-07-20 20:33:40.475/6.685 Oracle Coherence GE 3.5/459 <Error> (thread=main, member=1): Full Thread Dump
...
Thread[DistributedCache:EventDispatcher,5,Cluster]
        java.lang.Object.wait(Native Method)
        java.lang.Object.wait(Object.java:474)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:31)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:11)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.lock(DistributedCache.CDB:37)
        com.tangosol.util.ConverterCollections$ConverterConcurrentMap.lock(ConverterCollections.java:2024)
        com.tangosol.util.ConverterCollections$ConverterNamedCache.lock(ConverterCollections.java:2539)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap.lock(DistributedCache.CDB:1)
        com.tangosol.coherence.component.util.SafeNamedCache.lock(SafeNamedCache.CDB:1)
        com.tangosol.examples.guardian.GuardianDemo.entryUpdated(GuardianDemo.java:56)
        com.tangosol.util.MapEvent.dispatch(MapEvent.java:210)
        com.tangosol.util.MapEvent.dispatch(MapEvent.java:166)
        com.tangosol.util.MapListenerSupport.fireEvent(MapListenerSupport.java:556)
        com.tangosol.coherence.component.util.SafeNamedCache.translateMapEvent(SafeNamedCache.CDB:7)
        com.tangosol.coherence.component.util.SafeNamedCache.entryUpdated(SafeNamedCache.CDB:1)
        com.tangosol.util.MapEvent.dispatch(MapEvent.java:210)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap$ProxyListener.dispatch(DistributedCache.CDB:22)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap$ProxyListener.entryUpdated(DistributedCache.CDB:1)
        com.tangosol.util.MapEvent.dispatch(MapEvent.java:210)
        com.tangosol.coherence.component.util.CacheEvent.run(CacheEvent.CDB:18)
        com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.onNotify(Service.CDB:26)
        com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
        java.lang.Thread.run(Thread.java:613)
...
Thread[main,5,main]
        java.lang.Thread.dumpThreads(Native Method)
        java.lang.Thread.getAllStackTraces(Thread.java:1460)
        sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        java.lang.reflect.Method.invoke(Method.java:585)
        com.tangosol.net.GuardSupport.logStackTraces(GuardSupport.java:791)
        com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.drainOverflow(Service.CDB:45)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$EventDispatcher.drainOverflow(Grid.CDB:9)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.post(Grid.CDB:17)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.send(Grid.CDB:1)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:12)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:11)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.unlock(DistributedCache.CDB:32)
        com.tangosol.util.ConverterCollections$ConverterConcurrentMap.unlock(ConverterCollections.java:2032)
        com.tangosol.util.ConverterCollections$ConverterNamedCache.unlock(ConverterCollections.java:2555)
        com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap.unlock(DistributedCache.CDB:1)
        com.tangosol.coherence.component.util.SafeNamedCache.unlock(SafeNamedCache.CDB:1)
        com.tangosol.examples.guardian.GuardianDemo.main(GuardianDemo.java:40)

Here the guardian took the following actions:

  1. Detected the stuck event dispatcher thread: <Warning> (thread=main, member=1): The event queue appears to be stuck.
  2. Printed the stacks of each thread
  3. Restarted the cluster threads in order to keep the node up and running

Much nicer than having a deadlocked node!

Bugs in Coherence code

Despite our best efforts, bugs (including deadlocks) do occasionally appear in Coherence (just like any other product.) In particular, the kind of deadlock that has the worse consequences is a deadlock that involves a Service thread. Everything in Coherence (the clustering logic, replicated caches, distributed caches, remote invocations, statistics, etc) is implemented internally using queues and threads that are responsible for processing messages in queues. These are the service threads, and they are the lifeblood of Coherence. If this type of defect should slip into any future versions of Coherence, the guardian will detect this condition and take corrective action to allow the node (and the cluster) to continue to function.

Bugs in the JVM/Operating System

In the absence of bugs in customer or Coherence code, we do occasionally see bugs in the JVM and/or the operating system that result in locked up service threads. Perhaps the most notorious of these is with early versions of NPTL on Linux. In a nutshell, we saw that threads occasionally missed notifications (in other words, threads that were in Object.wait() would never receive the Object.notify() or Object.notifyAll() that we sent to it.) I’ve also seen older JVMs with buggy implementations of the wait/notify mechanism with the same results.

One of our goals with Coherence is to keep your application up and running at all times, even when components of the system fail (hardware, databases, etc.) This is yet one more tool in our arsenal to bring us closer to that goal.