Charles Lamb

Subscribe to Charles Lamb feed
Consulting MTS
Updated: 4 hours 58 min ago

Cisco and Oracle NoSQL Database

Sun, 2011-10-09 18:00

The Oracle NoSQL Database development team has been working closely with the Cisco UCS team.  This is a great partnership in that we work closely on performance and scalability testing using their UCS C-Series Rack-Mount Servers and Cisco Nexus 550 Series Switching and have access Cisco’s large cluster to run tests at massive scales and proof of concepts.

I am planning to write some blog entries describing the results. Cisco has produced a solution brief about Oracle NoSQL Database on the UCS platform.

Oracle NoSQL Database vs Berkeley DB Java Edition

Thu, 2011-10-06 06:34

I've been watching the twitter-sphere for comments about Oracle NoSQL Database.  There are a number of common questions and misconceptions floating around that I'll address here:

Misconception #1: "Oracle NoSQL Database is just Berkeley DB Java Edition rebranded."; "Oracle NoSQL Database sounds like it's just Berkeley DB with extra bits."

When we built NoSQL Database, we recognized that Berkeley DB Java Edition HA provided us with lots of necessary, but not sufficient, elements for a NoSQL store.  For instance, JE/HA gives us:

  • ACID Transactions
  • Persistence
  • High Availability
  • High Throughput
  • Large Capacity
  • Lights out administration

And you could even argue that its key/value data model is already "NoSQL".  But we believe that NoSQL means something more to most people.  Like

  • Data distribution
  • Dynamic partitioning (aka "sharding")
  • Load balancing
  • Monitoring and Administration
  • Predictable latency
  • Multi-node backup

So although NoSQL Database is built using BDB JE/HA as the underlying, battle-tested, storage system (why reinvent the wheel?), NoSQL Database adds a large amount of infrastructure on top of it to bring it into the NoSQL realm.  As my colleague Chao Huang says, "BDB JE is like an engine. NoSQL Database is the car built with the engine."

Misconception #2: "Oracle NoSQL Database has the same API as Berkeley DB Java Edition"

I realize that at the time of this writing we have not released the software so the reader has no way of looking at the javadoc to see the actual NoSQL Database API, but suffice it to say that the API is not the same as BDB JE.  The interface is Java, and it provides CRUD, iteration, and CAS (aka "RMW") capabilities on key/value pairs.  There is also a major/minor key capability.  All key/value pairs with the same major key reside on the same "Rep Group" (a Rep Group is just a BDB JE HA replication group of a master and N replicas).  That way, records can be clustered (e.g. put all records related to "Fred" on the same node).  One other (slight) difference between the BDB JE and NoSQL Database APIs is that the former uses byte[] for keys and the latter uses Strings for keys.  Both use byte[] for the data portion.

(Non-) Misconception #3: "Oracle is adding network bindings to Berkeley DB Java, branding it Oracle NoSQL. I am curious how easy setup and develoment will be."

Let me address the second question first (ease of setup/development).  Although this isn't a misconception, it is a good question.  In general it is difficult for the average developer who wants to try out a large distributed store to find sufficient hardware to get a reasonable sized cluster going.   Well, maybe it's difficult not for you, but it sure is for all of us -- we have to claw and scratch for every machine we use(*).  So George (one of developers) put together what we call "kvlite", a single process version of Oracle NoSQL Database.  kvlite is really easy to start up (one simple command line invocation) and gives the user a good way of trying out the API without a lot of muss and fuss.  The "server side" is in no way tuned for performance, but it lets you get things going really quickly so you can kick the tires, try out your application code, etc. while your sysadmins and IT folks scrounge the real hardware for you to use for deployment.

(*) We actually have several large clusters to do development and performance testing at our disposal.

And now the first part of the question (adding network bindings to Berkeley DB Java Edition).  Hmm, that's kind of, sort of true.  Let me try to reframe the statement.  BDB JE HA allows a user to perform operations on either the master (for updates and reads) or the replicas (for reads).  The most common objection that we encounter is that the application has to "know" which nodes are the master and the replicas (for routing updates and read requests appropriately).  There is no network layer in BDB JE/HA to handle this for you.  Oracle NoSQL Database provides this capability.  You link in the kvclient.jar (the "driver") to your application, and presto, you can make your CRUD (or iteration) method calls on your K/V Store.  The kvclient.jar figures out which node to route the request to (it knows which Rep Group holds the key value pair and which node in that Rep Group is the master).  So in that sense, it adds a network layer to BDB, but the API is different from BDB so I wouldn't exactly call it a network binding.  There's a lot of infrastructure and intelligence (e.g. load balancing) built into the kvclient "driver".

Steve Jobs, 1955 - 2011

Thu, 2011-10-06 05:48

I respectfully contemplate the impact Steve Jobs has had on our industry and the world.

Oracle NoSQL Database

Mon, 2011-10-03 07:27

Today at Oracle OpenWorld, we are announcing Oracle NoSQL Database.  From the datasheet:

Oracle NoSQL Database provides network-accessible multi-terabyte distributed key/value pair storage that offers predictable latency. That is, it services network requests to store and retrieve data which is organized into key-value pairs. It offers full Create, Read, Update and Delete (CRUD) operations, with adjustable durability guarantees.  Oracle NoSQL Database is designed to be a highly available and extremely scalable system, with predictable levels of throughput and latency, while requiring minimal administrative interaction.

My colleagues and I have been working hard to bring this project to fruition and it's truly exciting for all of us to see it roll out the door (as well as to be able to finally talk about it in public).  It will come in two versions, an Open Source Community Edition, and a value-add "Enterprise Edition".  Initially, both Editions will have the same feature set, but in subsequent releases there will be differentiation between the two. My colleague Margo Seltzer has written a fine whitepaper which describes the system.  If you have the time, it's an easy read.

In future posts to this blog I hope to talk about some of the great performance and scaling numbers we're seeing in our tests.  To demonstrate the system's capabilities, we've been working with two very fine corporate partners to run tests on clusters of up to 192 nodes.

We also announced the Oracle Big Data Appliance, an "engineered system" which will run (among other things) Oracle NoSQL Database.

Berkeley DB Java Edition 4.1.7

Thu, 2011-01-06 02:51

Berkeley DB Java Edition 4.1.7 is a patch release consisting three important fixes. We strongly recommend that users of the 4.1.x upgrade to this release.

These fixes include:

[#19346] - Fix a bug that could cause an EnvironmentFailureException with LOG_FILE_NOT_FOUND during recovery, meaning that the JE environment cannot be opened. We strongly recommend that all applications using JE 4.1.6 upgrade immediately. The bug was introduced in JE 4.1.6 and is not present in earlier releases.

[#19312] - Fixed a bug that prevents using a DPL class converter mutation for a proxy class. Previously, an exception such as the following was thrown:

Exception in thread "main" java.lang.ClassCastException:
com.sleepycat.persist.raw.RawObject cannot be cast to
at com.sleepycat.persist.impl.ProxiedFormat.newInstance(
at com.sleepycat.persist.impl.RecordInput.readObject(
at com.sleepycat.persist.impl.ReflectionAccessor$
at com.sleepycat.persist.impl.ReflectionAccessor.readNonKeyFields(

Thanks to James Li for reporting this on OTN and helping us to identify the problem in the source code.

[#19321] - Fix a bug that caused a hard deadlock when attempting to abort a transaction in one thread, while performing operations using the transaction in another thread. Now, rather than a hard deadlock, an IllegalStateException will be thrown in this circumstance. Thanks to Jervin on OTN for reporting this.

The complete list of changes is in the change log page.

Product documentation can be found at:

Download the source code including the pre-compiled JAR, complete documentation, and the entire test suite as a single package.

Berkeley DB Java Edition 4.1 Improvements

Fri, 2010-10-29 05:16

The new release of Berkeley DB Java Edition (JE), Release 4.1, includes several new features which drastically improve out-of-cache performance. When a JE application has a data set that does not fit entirely in cache, and there is no particular working set of data that fits in cache, the application will see the best performance when as much of the internal btree nodes (the index) are kept in cache. 4.1 includes improvements to make JE's cache management more efficient, and provides statistics to help determine cache efficiency.

It's worth giving a shout-out to the Sun ISV Engineering lab people who were invaluable in this effort. They let us use a lot of their big-iron hardware for 3 months of intense tuning and performance analysis, all before the merger was completed.

The first important new feature is Concurrent Eviction. In past releases, cache eviction has been carried out by JE daemon threads, application threads which call JE operations, and an optional single evictor thread. The actual eviction operation was serialized, and could create a bottleneck where many threads could be seen waiting upon the method

In 4.1.6, cache eviction is no longer serialized and can be executed concurrently. In addition, JE now has a dedicated configurable thread pool which will do cache eviction when memory limits are reached. Eviction is done by this dedicated pool, by other JE daemon threads, and by application threads. JE attempts to skew the eviction workload toward the pool and daemon threads, in order to offload application threads.

The second important feature is in-memory btree Internal Node (IN) compression which is targeted at reducing the memory needed to hold these nodes in cache. One of the optimizations reduces the in-memory footprint of an IN when only a small portion of it has been referenced, as would be the case when data records are accessed in a random order, or when a subset of data is accessed. It does not help if the application is doing (e.g.) a database-wide cursor traversal. A second optimization in this area occurs when a key's length is less than or equal to 16 bytes, which can be true either when the key is naturally small, or when key prefixing is enabled through Database.setKeyPrefix().

A user ran tests comparing JE 4.0.103 and JE 4.1 in a read-only workload and shared the results with us. When the database fits completely in the cache (4 GB of memory), performance is about the same. Dropping the cache to 2 GB (all INs still fit into memory), performance (throughput and latency) improves 5%. When the cache is further reduced to various values between 1 GB and 512 MB (only some of the INs fit in memory), performance improves more than 3x.

One other interesting note about these tests is that the test configuration has enough memory to hold the database in the file system cache (even though they did not allocate enough memory to the JE cache to hold all of the database). The net of this is that there is no "true" IO occurring, but rather all IO is only to the file system cache. By putting the data in the file system cache rather than on the Java heap (and therefore the JE cache), GC overhead is reduced while still maintaining "in-memory" performance (since the data was in the file system cache).

What is also interesting about this is that the existing JE-tuning adage that out-of-cache scenarios should adjust the je.evictor.lruOnly and je.evictor.nodesPerScan parameters is changing. By varying these values in 4.1 from the recommended norms (false and 100, respectively), the user is able to achieve even better performance. We will of course be updating our FAQ entries to state the new recommended values.

Naturally we're very excited about these results and want to share them with you. Stay tuned for more news when we have the results of read/write workloads.

Berkeley DB Java Edition on ZFS Tuning Note

Tue, 2010-08-24 02:27

I have been spending some time working on tuning continuous write load tuning on a Solaris 10/ZFS based system. One of the issues I observed is a drop in throughput every several seconds. Removing the fsync() calls from JE (you would never want to do this normally) smoothed out the dips in throughput which pointed to IO issues.

My colleague Sam pointed me at this discuss-zfs entry. And indeed, adding

set zfs:zfs_write_limit_override = 0x2000000

to /etc/system (with the fsync() calls put back into JE) does in fact seem to smooth things out (0x2000000 = 33MB). Of course, you'll want to adjust that parameter based on what size of on-board disk cache you have.

One way to get a rough indication of whether this is a potential problem is to use iostat and see if there are IO spikes.

Berkeley DB Java Edition High Availability Performance Whitepaper

Thu, 2010-06-03 07:53

Over the past few months we've been working on measuring the impact of HA on JE performance when running on large configurations. The results are documented in a whitepaper that I wrote.

Berkeley DB Java Edition 4.0.103 Available

Mon, 2010-05-03 02:19

We'd like to let you know that JE 4.0.103 is now at The patch release contains both small features and bug fixes, many of which were prompted by feedback on this forum. Some items to note:

  • New CacheMode values for more control over cache policies, and new statistics to enable better interpretation of caching behavior. These are just one initial part of our continuing work in progress to make JE caching more efficient.

  • Fixes for proper cache utilization calculations when using the -XX:+UseCompressedOops JVM option.

  • A variety of other bug fixes.

There is no file format or API changes. As always, we encourage users to move promptly to this new release.