Sunday, May 17, 2009

Group By Rollup

About two years ago, I did some investigation and prototyping of a new Derby feature: GROUP BY ROLLUP.

I got distracted by some other work, and time passed. But recently Knut noticed that I hadn't done anything on the project in a while, and encouraged me to return to it.

So I have.

In SQL, the GROUP BY statement instructs the DBMS to partition the data into separate groups based on the values of one or more columns; usually it is used with computations which evaluate a COUNT, SUM, AVG, or other function across the groups. So, for example, it is common to see:


  • SELECT D.NAME, COUNT(E.ID) FROM EMPLOYEE E, DEPARMENT D WHERE E.DEPT_ID = D.ID GROUP BY D.NAME

  • SELECT D.NAME, SUM(PROJ.BUDGET) FROM PROJECT PROJ, DEPARTMENT D WHERE PROJ.DEPT_ID = D.ID GROUP BY D.NAME

  • SELECT D.NAME, AVG(E.SALARY) AS AVG_SALARY FROM EMPLOYEE E, DEPARTMENT D WHERE E.DEPT_ID = D.ID GROUP BY D.NAME ORDER BY AVG_SALARY

The first statement finds the number of employees in each department; the second statement finds the total budget for each department's projects; the third statement finds the average salary for each department.

You can group by more than one column, so if you wanted to find the budget for each department, for each project type, for each month, you'd write something like:


SELECT D.NAME, P.MONTH, P.TYPE, SUM(P.BUDGET)
FROM DEPARTMENT D, PROJECT P
WHERE D.ID = P.DEPT_ID
GROUP BY D.NAME, P.MONTH, P.TYPE


This statement produces a detailed breakdown, but when producing a report, the user commonly wants to "roll up" these detailed groups so that they get an overall sum for:

  • All the project types for this department this month (GROUP BY D.NAME, P.MONTH), as well as

  • All the project types for the entire budget year for this department (GROUP BY D.NAME)



The GROUP BY ROLLUP feature is designed for precisely this purpose, as it allows the user to issue a single statement which produces groupings at all those levels of detail in a single result set:


SELECT D.NAME, P.MONTH, P.TYPE, SUM(P.BUDGET)
FROM DEPARTMENT D, PROJECT P
WHERE D.ID = P.DEPT_ID
GROUP BY ROLLUP ( D.NAME, P.MONTH, P.TYPE )


The prototype work seems to be still quite relevant, as that part of Derby hasn't changed substantially. Knut noticed a few things that need cleaning up, so I'll start with those.

I suspect that the major pieces to tidy this up and ready it for commit will be:
  • More testing. I wrote a couple trivial tests, but more are required.
  • Documentation. Some additional information is required to describe what the new feature is and how to use it.
  • Performance. The ROLLUP feature is essentially a shorthand for issuing N separate SQL statements, so as long as it doesn't perform any worse than the separate SQL statements, it's a useful feature because of its convenience. But in principle it should enable some optimizations which would cause it to perform substantially better in many cases, so I'll want to have a look at whether we can do any of those improvements.

If I can find a little bit of time, I should be able to accomplish this over the next few months, and see if I can get the project over the hurdle from prototype to operational.

No comments:

Post a Comment