Saturday, January 28, 2012

DERBY-5584

This week I've been spending a bit of time working on DERBY-5584.

This is a bug that I introduced into Derby in October, 2009, when I re-wrote Derby's GROUP BY engine to add multi-level aggregation support (DERBY-3002).

The new bug is rather complicated, because it involves (a) multiply-scanned result sets and (b) DISTINCT aggregates.

Let's see if I can explain those two concepts, starting first with DISTINCT aggregates.

Aggregates are a feature of the SQL language. The SQL statement normally reports on individual rows in your table(s): each row in your result can be traced back to a particular row in a particular table (possibly multiple tables, joined together). Aggregates, however, are functions which collapse the detail of individual rows of data into coarser-grained summarizations.

The classic aggregates in the core SQL language are: MIN, MAX, COUNT, SUM, and AVG, and these have been part of the language for decades. Most DBMS implementations add additional aggregate functions, but it's sufficient for our purposes to consider the base 5.

An example of an aggregate query would be to figure out the average mileage of flights between San Francisco and San Antonio:

SELECT AVG(flight_miles) 
   FROM flights
   WHERE (origin = 'SFO' AND destination = 'SAT')
      OR (destination = 'SFO' AND origin = 'SAT')

When performing these sorts of aggregate queries, it's often quite useful to break things down by grouping the results based on some other fields. For example, suppose we wanted to know whether the distance flown tends to vary depending on the day of the week (do weekend flights get to take a shorter flight path?):

SELECT departure_day_of_week, AVG(flight_miles) 
   FROM flights
   WHERE (origin = 'SFO' AND destination = 'SAT')
      OR (destination = 'SFO' AND origin = 'SAT')
   GROUP BY departure_day_of_week

Executing a query such as this is performed (in Derby, as in many other databases), by sorting the data on the grouping column(s) and then processing it in order. So we'd: (a) find all the rows matching the WHERE clause, (b) pull the departure_day_of_week and flight_miles columns out of each row, (c) sort the rows by departure_day_of_week, and then pump all that data through the aggregation engine.

The aggregation engine would then be able to see the rows in a single pass: first all the "FRI" rows, then all the "MON" rows, then "SAT" rows, "SUN" rows, "THU" rows, and "TUE", and lastly the "WED" rows. For each group of rows, the engine computes the AVG(flight_miles) in the obvious straightforward fashion (computing the total sum and the number of instances and dividing). The result of the query is 7 rows, with the average for each day.

Note that the author of the query generally includes the grouping columns in the result, to make the result easy to read.

You can GROUP BY multiple columns, but the overall principle is the same: the columns used to sort the records are exactly the columns used to group the results.

DISTINCT aggregates add some additional complexity. Suppose that, if two flights happen to have exactly the same flight_miles, we only want to consider the flight in the average once; that is, we only want to compute the average over all the unique values of flight_miles for that particular departure_day_of_week.

(I admit, this is very contrived, but hopefully it's still obvious what we're talking about here.)

In this case, the programmer can use the word DISTINCT to specify this:

SELECT departure_day_of_week, AVG(DISTINCT flight_miles) 
   FROM flights
   WHERE (origin = 'SFO' AND destination = 'SAT')
      OR (destination = 'SFO' AND origin = 'SAT')
   GROUP BY departure_day_of_week

This adds a small additional wrinkle into the query execution, as we need to ensure that we can refine the results to consider each unique combination of destination_day_of_week and flight_miles only once.

Derby accomplishes this by including the DISTINCT aggregate column into the sorting process, so that the rows are sorted by the compound key (departure_day_of_week, flight_miles). This way, all the rows which continue duplicate values for this pair of columns arrive together, and we can consider only the first such row and discard the others.

When I re-wrote the GROUP BY engine as part of DERBY-3002, I considered this problem, and implemented it, but I made a mistake. Note that, above, I observed that "the columns used to sort the records are exactly the columns used to group the results". However, with DISTINCT aggregates, this isn't precisely true, as there is one extra column in the sort key (the DISTINCT aggregate column) which isn't used to group the results, just to sort them.

In DERBY-3002, I handled that special case by: after the sort, but before the grouping pass, I removed the DISTINCT aggregate column from the sort key.

This worked, but ...

For most queries, the natural flow of query execution processes each intermediate result once. This is efficient and so the database works hard to do this whenever possible.

However, there are some cases in which an intermediate result must be processed multiple times. One example is the so-called "Cartesian Product" query:

SELECT a.*, b.* FROM a, b

In this query, the database produces all possible combinations of rows from tables a and b: for each row in table a, and each row in table b, there is a row in the result containing those values from a's row and b's row.

In such a query, Derby uses a brute force technique and simply implements the query as: for each row in a, read each row of b, and emit the values.

This means that we read the inner table (table b), multiple times, once per row in table a.

This was where my bug came in: it is possible that the inner table is a GROUP BY query of its own, which includes a DISTINCT aggreate.

When this happens, my code that removed the DISTINCT aggregate column from the sort key was causing problems. The first time we read the inner table, everything was fine, but then the next time, the ordering/grouping/sorting columns were all messed up (I removed the column from the key, but didn't add it back for the next go-round).

In DERBY-5584, you can see a clear and simple example of a query which demonstrates the bug.

The fix, I believe, is that instead of physically removing the column from the sort key, we need to instead teach the grouping logic that it may need to consider only a leading subset of the columns in the sort key as the grouping column.

I've implemented that change and am hoping it solves the bug.

No comments:

Post a Comment