now you ask , among design. i am gathering a large slice of performance data with a lot of key-value pairs. virtually my way through /proc/cpuinfo, /proc/meminfo/, /proc/loadavg, plus a lot of other things, from the 3 hundred hosts. at this time, i simply need to display the most recent slice of data during my UI. i'll most likely finish up doing a bit of research into the data collected to determine performance problems in the future, but this can be a new application so i am unsure just what i am searching for performance-smart at this time.

i possibly could structure the information within the db -- possess a column for every key i am gathering. the table would finish up being O(100) posts wide, it might be a discomfort to place in to the db, i would need to add new posts basically start gathering a brand new stat. but it might be simple to sort/evaluate the information simply using SQL.

or i possibly could just dump my unstructured data blob in to the table. maybe three posts -- host id, timestamp, along with a serialized version of my array, most likely using JSON inside a TEXT area.

that ought to I actually do? can i be sorry basically opt for the unstructured approach? when you are performing analysis, must i just convert the fields i am thinking about and make up a new, more structured table? do you know the trade-offs i am missing here?

I only say if you want to run SQL queries to calculate such things as min/max/avg in order to perform sorting, limitations, or joins in line with the values, then you definitely should produce the 100+ posts. That is what I'd do.

You do not condition which make of database you're using, but many should support 100+ posts inside a table without chance of ineffectiveness.

Please don't make use of the Entity-Attribute-Value antipattern -- the important thingOrworth design that many people will suggest. It's nice simple to place any arbitrary assortment of key/value pairs into this type of design, but the queries that might be easy to do inside a conventional table with one column per attribute become insanely difficult and inefficient using the EAV design. Additionally you lose several benefits of utilizing an SQL database, like data types and constraints.

i believe

performance_data

        host_id
        key
        value
        timestamp

may be the proper structure. you'll have the ability to query the particular subsets in the specific hosts in the specific occasions to create your analysis.

Here's a different: use several table.

An apparent schema design will be a table each for cpuinfo, meminfo, loadavg, etc. You may finish track of a miscellaneous_stats table, based on what you are including in "a lot of other thingsInch.

This method has several attractive features:

  • simplified column naming.
  • simple to report against an associated sub-group of statistics e.g. all meminfo. Most likely better performance too.
  • less problematic to include a column. Should you start gathering a brand new cpuinfo statistic they all are clumped together, whereas within the One Large Yable you'd finish track of posts 1-15 and column 94.
  • granularity of recording. For example you will possibly not wish to log cpuinfo as frequently as meminfo.

You ought to have an expert table of stats_runs to carry such things as HOST, TIMESTAMP, etc instead of copying individuals particulars on each table.

I've two working presumptions underlying this proposal:

  1. You are likely to perform some analysis of the data (if you are not likely to evaluate it while bother collecting it?).
  2. SQL continues to be best mechanism for data crushing, although flat file tools are enhancing constantly.