We're searching to produce a software that receive log files from the large number of products. We're searching around 20 million rows each day with log (2kb / each for every log line).

I've developed lots of software but never with this particular great quantity of input data. The information must be searchable, sortable, groupable by source IP, dest IP, alert level etc.

It ought to be mixing similiar log records (happened 6 occasions etc..)

Any ideas and suggestions on which kind of design, database and general thinking around this is much appreciated.

UPDATE:
Found this presentation, appears just like a similar scenario, any ideas about this? http://skillsmatter.com/podcast/cloud-grid/mongodb-humongous-data-at-server-density

Look at this, it may be useful https://github.com/facebook/scribe

I see a few things you should consider.

1) message queue - to decrease a log line and let other area (worker) from the system to consider proper care of it when time permits

2) noSQL - reddis, mongodb,cassandra

I believe your real problem could be in querying the information , not in storing.

You also most likely would want a scalable solution. A number of noSql databases are distributed you might need that.

I'd base many choices how customers most frequently is going to be choosing subsets of information -- by device? by date? by sourceIP? You need to keep indexes low and employ only individuals you have to complete the job.

For low-cardinality posts where indexing overhead is high yet the need for utilizing an index is low, e.g. alert-level, I'd recommend a trigger to produce rows in another table to recognize rows akin to emergency situations (e.g. where alert-level > x) to ensure that alert-level itself will not have to become indexed, but you can quickly find all high-alert-level rows.

Since customers are upgrading the logs, you can move handled/handled rows over the age of 'x' days from the active log and into accurate documentation log, which may improve performance for ad-hoc queries.

For determining recurrent problems (same issue on same device, or same issue on same ip, same issue on all products produced by exactly the same manufacturer, or in the same manufacturing run, for instance) you can identify the subset of posts that comprise the specific kind of problem after which create (inside a trigger) a hash from the values in individuals posts. Thus, everything of the identical kind would have a similar hash value. You might have multiple posts such as this -- it might rely on your meaning of "similar problem" and just how a variety of problem-kinds you desired to trace, as well as on the subset of posts you'd have to enlist to define each type of problem. Should you index the hash-value column, your customers would have the ability to very rapidly answer the question, "Shall we be seeing this type of problem frequently?" They'd consider the current row, grab its hash-value, after which search the database for other rows with this hash value.

An internet explore "Stackoverflow logging device data" produced a large number of hits.

Here is one. The question requested might not be exactly like yours, but you need to get dozens on intersting ideas in the reactions.