Thoughts on software development and other stuff

An Introduction to Data Grids for Database Developers

with 3 comments

A little over a month ago I attended Collaborate 09 in Orlando. While having lunch one day I was lucky enough to run into The Oracle Nerd. He provided a good description of our encounter (hint: I’m the “engineer on the Coherence team” he mentions.) I first encountered his blog via this thread, which turned out to be his first exposure to data grids.

Expanding on that, we both agreed that a writeup on data grids for DBAs (or, as he prefers to be called, a [lower case] dba) would be useful. Little did he know what an awful procrastinator I am. :) I’m going to limit the scope of this introduction to applications that use relational databases, as opposed to addressing applications that don’t use any kind of relational database or grid-oriented apps (that will have to be a topic for another time.)

Smarter Caching

An obvious (or maybe not so obvious depending on who you ask) first step in scaling a database application is to cache as much as you can. This is fairly easy to do if you have a single app server hitting a database. It becomes more interesting however as you add more app servers to the mix. For instance:

  • Is it OK if the caches on your app servers are out of sync?
  • What happens if one of the app servers wants to update an item in the cache?
  • How do you minimize the number of database hits to refresh the cache?
  • What if you don’t have enough memory on the app server to cache everything?

This is where a data grid can come in handy. Each of these items are easily addressed by Coherence:

The view of a Coherence cache will always be consistent across all nodes. If an app server updates the cache, all nodes will have instant access to that data. In fact, servers can register to receive notifications when data does change.

Most caching in app servers uses a pattern we call “cache aside,” meaning that it is up to the application to

  • check to see if the data is cached
  • if not, then load it from the database, place it into the cache, and return the item to the caller

A better approach is to use the “read through” pattern, meaning that it is up to the cache to load the data from the database upon a cache miss. The benefits to this approach are

  • application code is much simpler; it assumes that all data can be read through the cache API
  • if multiple threads (in a single JVM or across multiple JVMs) access the same item that is not in the cache, a single thread will read through to the database to load that item

This is a big win for the database; instead of answering the same question repeatedly, the database can answer the question once and all app servers benefit.

If expiry is desired for a cache, Coherence can be configured to perform “refresh ahead” on a cache. For example, if a cache is configured to expire after 5 minutes, you can configure a refresh ahead value (say 3 minutes) to determine when the value will be reloaded from the database. If an item is >3 minutes old (but not yet expired) and a thread requests that value, the currently cached value will be returned, and the value will be asynchronously refreshed in the background.

If more storage capacity is required, all you have to do is add servers. Generally there is no extra configuration required.

All in all, it provides a very sophisticated cache that can drastically reduce the number of SELECTs issued against a database.

Scaling Writes

This next data grid feature may be a bit more controversial for database developers. In the previous example, we can assume that the database is still the system of record (a.k.a the source of Truth.) For situations where we always want the database to hold the Truth, data grids can have caches configured to use the “write through” topology. This means that updates made to the cache will be synchronously written to the database, just like any other database app. However, if you have dozens of app servers with dozens of threads each writing to the database simultaneously, scalability will definitely be a concern. In this case, the cache can be configured to write the updates to the database asynchronously; this is known as “write behind” topology. Here are some of the objections that I’ve heard (and my response):

  • What happens if I lose a server? Coherence maintains a backup of each entry in the cache, and it keeps track of items that have not been flushed to the database. If a JVM or a machine is lost, the data (and the fact that it still needs to be flushed) will not be lost.
  • Are there any ordering guarantees? What about referential integrity? Let’s say you had a cache for Person and a cache for Address. If Address has a foreign key dependency on Person, then write behind is not a good fit. There is no guarantee that Person will make it out to the database before Address. In this case you’d have to combine the two into an object, and the write behind routine would know to insert Person before Address.

This may change someday, but the reality of write-behind today is that the database write should never fail (short of the database itself failing, in which case the item can simply be requeued until the database comes back up.) Read only caches can generally be easily retrofitted into an existing application; the same is not always true for write behind.

However, it is a very powerful apparatus in the data grid toolbox to help scale database applications.

Transient Data That You Don’t Want To Lose

Sometimes an application has transient data that does not need to be stored in a database, but it ends up being stored in the database anyway (because you don’t want to lose it.) A good example of this is HTTP sessions. You don’t want to lose a session if a web server goes down, but the database just seems like overkill for this. The scalability of the application will be limited, not to mention the scripts that will have to be written to clear out expired sessions.

For the specific case of HTTP sessions, Coherence*Web provides a solution to store sessions in Coherence caches. This is an OOTB solution; it will work with just about any J2EE compliant web application. It also works across many popular web containers, including open source and proprietary (Oracle and non Oracle) containers.

Hopefully this is a good broad introduction to data grids for database developers. Comments or follow up questions are welcome.

Written by Patrick Peralta

June 13th, 2009 at 11:16 pm