I have written some very fundamental tools for grouping, pivoting, unioning and subtotaling datasets acquired from non DB sources (eg: CSV, OLTP systems). The "group by" techniques sit fundamentally of many of these.

However i am sure large amount of work continues to be completed in making efficient calculations for grouping data... and i am sure i am not with them. And my Google-fu has completely unsuccessful to show anything up.

What are the good online sources or books explaining the greater techniques to produce arranged data?

Or must i just start searching in the MySQL source or something like that similar?

One very handy method to "group by" some area (or group of fields and expressions, but I'll use "area" for simplicity!-) is when you are able arrange just to walk within the results before grouping (RBG) inside a sorted way -- you really don't worry about the sorting (save within the common situation by which a purchase BY can also be there and merely is actually on a single area because the GROUP BY!-), but instead concerning the "side-effectInch property of ordering -- that rows in RBG with similar value for that grouping area come immediately after one another, so that you can accumulate before the grouping area changes, then emit/yield the outcomes gathered to date, and go to reinitialize the accumulators using the new row (the main one having a different worth of the grouping area) -- make certain to "just initialize the accumulators" in the beginning, AND "just emit/yield gathered results" in the very finish, obviously.

If the does not work, you may can hash the grouping area and employ a hash table for that results being gathered for your group -- each and every row in RBG, hash the grouping area, see if it had been already present like a type in the hash table, otherwise place it there with accumulators superbly initialized in the RBG row, else update the accumulators per the RBG row. You simply emit everything in the finish. The issue obviously is you are trying out more memory before the finish!-)

Fundamental essentials two fundamental approaches. Do you want pseudocode for every, BTW?

You can examine out OLAP databases. OLAP enables you to produce a database of aggregates intended to be examined inside a "slice and dice" fashion.

Aggregate measures for example counts, earnings, mins, maxs, sums and stdev's could be rapidly examined by a variety of dimensions utilizing an OLAP database.

See this summary of OLAP on MSDN.

Give a good example CSV file and kind of result wanted and that i might have the ability to rustle up a an answer in Python for you personally.

Python has got the CSV module and list/generator comprehensions that will help with this particular kind of factor.

  • Paddy.