I've got a couple of large databases, more than 100 million records. They contain the next:

  1. A distinctive key.
  2. An integer value, not unique, but employed for sorting the query.
  3. A VARCHAR(200).

I've these questions mysql isam table now. My thought was, hey, I'll just setup a covering index around the data, also it should take out reasonably fast. Queries are from the form...

select valstr,account 
    from datatable 
    where account in (12349809, 987987223,...[etc]) 
    order by orderPriority;

This appeared OK in certain tests, but on our more recent installation, its terribly slow. It appears faster to possess no index whatsoever, which appears odd.

Regardless, I am thinking, perhaps a different database? We make use of a datawarehousing db for other areas from the system, nevertheless its not perfect for anything in text. Any free, or fairly cheap, db's are a choice, as lengthy because they have reasonably helpful API access. SQL optional.

Thanks ahead of time.

-Kevin

Here is a reasonably sized illustration of a MySQL database while using innodb engine which uses clustered indexes on the table with approximately. 125 million rows with a question runtime of .021 seconds which appears fairly reasonable.

http://stackoverflow.com/questions/3534597/rewriting-mysql-select-to-reduce-time-and-writing-tmp-to-disk/3535735#3535735

http://pastie.org/1105206

Other helpful links:

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://dev.mysql.com/doc/refman/5.0/en/innodb-adaptive-hash.html

Hope it proves of great interest.

CouchDB provides you with storage by key and you will create sights to complete the query/sorting. Second item might be cassandra, there is however quite a large learning curve.

CouchDB and MongoDB and Riak are likely to be proficient at locating the key (account) relatively rapidly.

The issues you are likely to have (with any solution) are associated with the "order by" and "account in" clauses.

Problem #1: account in

120M records likely means gb of information. You most likely come with an index on the gig. The main reason this can be a problem is your "in" clause can certainly span the entire index. Should you look for accounts "0000001" and "9999581" you most likely have to load lots of index.

So just to obtain the records your DB first needs to load potentially a gig of memory. Then to really load the information you've to return to the disk again. In case your "accounts" around the in clause aren't "close together" then you are returning multiple occasions to fetch various blocks. Sooner or later it might be faster to simply perform a table scan then to load the index and also the table.

Then you're able to problem #2...

Problem #2: order by

For those who have lots of data returning in the "in" clause, then order by is simply another layer of slowness. By having an "order by" the server can't stream the data. Rather it needs to load all the records in memory after which sort them after which stream them.

Solutions:

  1. Have plenty of RAM. When the RAM can't fit the whole index, then your loads is going to be slow.
  2. Try restricting the amount of "in" products. Even 20 or 30 products within this clause could make the query really slow.
  3. Consider using a Key-Value database?

I am a large fan of K/V databases, but you need to take a look at point #1. Without having lots of RAM and you've got plenty of data, then your system will run gradually regardless of what DB you utilize. That RAM / DB size ratio is actually important if you would like good performance during these situations (small look-ups in large datasets).