Thoughts on software development and other stuff

Coherence 3.5: Improved Partition Distribution Algorithm

without comments

This article is part 1 of a 3 part series on my favorite new features in Coherence 3.5.

A partition is a unit of storage and transfer in a Coherence partitioned/distributed cache. Out of the box we ship with 257 partitions, which means that clusters with 16 or fewer storage enabled nodes do not need to have this setting modified. Larger clusters require more partitions, as this will ensure even distribution of data across the grid. Prior to Coherence 3.5, our guideline for setting the partition count was to take the number of storage enabled nodes, square it, and round up to the next prime number. (See partition-count setting here).

When new nodes are added to a cluster, the senior node (the oldest in the grid) is responsible for coordinating ownership of the partitions. Note that this is not a single point of failure since the departure of a senior node will result in the promotion of the next oldest node which will seamlessly take over the responsibilities of the senior. However, when a large cluster with a large partition count is starting up, the senior node does have a lot of (CPU) work to do regarding the management of partitions, and this becomes noticeable when starting up many nodes simultaneously. For this reason, large clusters are typically started up a few nodes at a time, the exact number depending on the CPU capabilities of the box.

I decided to test this out with a hypothetical cluster consisting of 100 storage nodes. Using the old partitioning guidelines, this would require 10,007 partitions. I started up 4 nodes at the same (since there are four cores on this box) time with this partition count:

Coherence 3.4 with 10,007 partitions

As you can see, we experienced about 95% CPU usage for about 15 seconds, followed by a couple of minutes of 5% usage.

In Coherence 3.5, we improved the distribution algorithm to require fewer partitions for large clusters. The new guidelines indicate that each partition should store no more than 50MB. In my hypothetical example, I’m running 100 nodes, each with a 512MB heap. Taking backups and headroom into account I can store 170MB of primary data on each node. If I take a conservative approach and store about 25MB per partition, that takes me to 7 partitions per node. 7 * 100 rounded up to the next prime is 701.

To summarize, we’re going from 10,001 partitions in 3.4 to 701 partitions in 3.5.

Here is the same CPU chart when starting up 4 nodes in 3.5 with this partition count:


This time we experience a very brief momentary spike of 65% CPU usage, after which we are using no CPU. Both charts cover about a four minute period.

This enhancement will drastically reduce the amount of time required to start up large clusters, and it will result in less CPU consumption by the senior member. The larger the cluster, the larger the benefit!

Written by Patrick Peralta

July 20th, 2009 at 4:46 pm