Tuesday, September 29, 2009

The mystery of the extra database read

I find that I simultaneously both love and hate being stumped.

My object-relational library at work has been acting up. We have an internal protocol for handling object modifications, which is designed to maintain the clarity of the object cache: when an object is updated, we move it from the primary cache into a per-transaction lookaside cache for the duration of the modification process; at commit time, the updated objects are restored to the main cache.

Unfortunately, once in a blue moon, under quite high load, the cache loses track of an object. Something goes wrong during the commit process, and at least one object fails to be restored to the cache.

Because of the way the cache works, the symptoms of this are almost unnoticeable, as the library code simply reloads the object from the database the next time it is referenced. So for the longest time all we noticed was that we were doing slightly more database access than we expected. Eventually we noticed in our database traces that we were occasionally reading objects which ought to have been found in the cache, and slowly over time I've been trying to track down why that is.

I've been generating theories about how this could be, and constructing experiments or building assertions and scaffolding into the code, to prove or disprove the theories.

So far, all the theories have been disproven.

I call this the "Sherlock Holmes" technique of debugging: "once you have eliminated the impossible, whatever else remains, no matter how improbable, must be the truth." Unfortunately, I still have more impossibilities to eliminate, and I haven't yet found the truth.

My friend Walt says: "if you aren't struggling, you aren't learning", and he's absolutely right. It takes a challenging and infuriating problem to drag you deeply enough into the code to reach that breakthrough of discovery and insight.

As a believer in rationality and logic, I know that eventually, this problem too will succumb, and I'll arrive at a full explanation of the behavior I'm seeing.

But until then, I'm stumped.

No comments:

Post a Comment