Tuesday, April 10, 2012

HumptyDumpty in NoSQL land

“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.” -- Lewis Carroll's Through the Looking Glass

I've recently been trying to understand more about these "NoSQL" systems, and how they work.

One interesting question is what they mean by "consistency". There is lots of talk about consistency, and eventual consistency, and the CAP theorem, and things like that.

And it's all very vague.

For example, do a search like this: (Google search for "HBase strong consistency"), and you'll find lots of pages like this that say things like:

if you search online posts related to HBase and Cassandra comparisons, you will regularly find the HBase community explaining that they have chosen CP, while Cassandra has chosen AP – no doubt mindful of the fact that most developers need consistency (the C) at some level.

Indeed, HBase's own documentation says:

Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.

So I guess that the HBase development team is choosing to define "strongly consistent" as "not 'eventually consistent'". Which isn't very much of a definition, in my opinion.

If you search still more, you'll find more detailed information, such as this HBase page on ACID semantics, which admits that:

HBase is not an ACID compliant database.

and then proceeds to completely re-define the famous ACID properties that Jim Gray set forth nearly 35 years ago.

It's very instructive to compare the original relational database definitions of the ACID properties versus the HBase definitions.

First, here's the class relational DBMS definitions, from the above Wikipedia article:

Atomicity

Atomicity requires that each transaction is "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes.

Consistency

The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including but not limited to constraints, cascades, triggers, and any combination thereof.

Isolation

Isolation refers to the requirement that no transaction should be able to interfere with another transaction. One way of achieving this is to ensure that no transactions that affect the same rows can run concurrently, since their sequence, and hence the outcome, might be unpredictable. This property of ACID is often partly relaxed due to the huge speed decrease this type of concurrency management entails.[citation needed]

Durability

Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently. If the database crashes immediately thereafter, it should be possible to restore the database to the state after the last transaction committed.

Now, here's the HBase definitions, from the HBase ACID semantics page:

Definitions

For the sake of common vocabulary, we define the following terms:

Atomicity

an operation is atomic if it either completes entirely or not at all

Consistency

all actions cause the table to transition from one valid state directly to another (eg a row will not disappear during an update, etc)

Isolation

an operation is isolated if it appears to complete independently of any other concurrent transaction

Durability

any update that reports "successful" to the client will not be lost

Visibility

an update is considered visible if any subsequent read will see the update as having been committed

These aren't even remotely close to the same definitions!

It's not at all clear what the NoSQL community is trying to do by re-defining all these words, and it's doubly not clear why the entire computing industry appears to be going along with it.

Why not define new terminology? Why change the meanings of words that have had precise definitions for about as long as general purpose computers have been in use?

3 comments:

  1. This is where we miss Jim Gray. He would have gone and talked to everyone in a friendly way and then written a paper and then everyone would agree on terms. I suppose you could argue he already wrote that paper 30 years ago. Alas I do not think that C Mohan's efforts will have the same effect.

    ReplyDelete
  2. I read HBase's statement on ACID non-compliance recently and had a similar reaction. Mine went something like this:

    "’HBase is not an ACID compliant database’. Well, at least they're being clear and honest." I should have stopped there, but read the re-definitions and also scratched my head over the Humpty Dumpty ACID terminology--which, according to those maintaining that document, should be "ACIDV". "Atomic" seems fairly accurate. "Consistent" would be accurate if “table” was replaced with “database” or even “data store”, regardless of HBase’s underlying logical architecture or any physical implementation. “Isolation” and “durability” are definitely amended and seem to have strings attached. “Visibility” is irrelevant because it’s ACID, not “ACIDV”.

    I didn’t bother reading much detail since they’re using their own “common vocabulary”, which might be acceptable if explained and within context of long-standing, accepted, and cited definitions. But that’s absent, plus the detail is convoluted. So, perhaps this is an attempt to bend definitions and dodge hard, cold facts that maybe HBase is suitable for the digital equivalent of high-performance trash hoarding. I did read that rows must never support time travel, at which point I became disappointed because I was hoping to use HBase to go back in time and destroy it (paradox intended, but “NoSQL” discourse seems littered with contradictions).

    And yes, the world misses both E.F. Codd and Jim Gray, but the likes of C.J. Date and Fabian Pascal are still fighting the good fight, and all is not lost since there are posts and replies like these.

    ReplyDelete
  3. I also read it and HBASE does not guarantee consistency across multiple rows. It just guarantees changes within the row are consistent, so if you have 2 separate processes trying to update a series of rows, some rows may be updated by one while others are set to some different value, so with that said, is it really ACID? There is now also Impala from Cloudera and Phoenix from Salesforce. A lot of development going on but to get to true ACID like properties, there is a lot more work that needs to happen.

    ReplyDelete