Suppose you've got a really large table, say a couple of billion unordered rows, and you wish to index it for fast searches. Or possibly you will bulk load it and order it around the disk having a clustered index. Clearly, when you are getting to some volume of data this size you need to stop presuming that can be done such things as sorting in memory (well, not without likely to virtual memory and going for a massive performance hit).

Can anybody produce some clues about how exactly databases handle large amounts of information such as this underneath the hood? I am speculating you will find calculations which use some type of wise disk caching to deal with all of the data but I'm not sure how to start. References could be especially welcome. Maybe a professional databases textbook?

Multiway Merge Sort is really a keyword for sorting immeasureable memory

So far as I understand most indexes use some type of B-trees, which don't need to have stuff in memory. You can just put nodes from the tree inside a file, after which jump to varios position within the file. This may also be used for sorting.

You would need to partition your computer data occur a way. Disseminate each partition on the separate server's RAM. Basically were built with a billion 32-bit int's - thats 32 GB of RAM immediately. And thats only your index.

For low cardinality data, for example Gender (only has 2 bits - Male, Female) - you are able to represent each index-entry in under a byte. Oracle uses a little-map index in such instances.

Are you currently creating a database engine?

Edit: I built a disc based database system in the mid '90's.

Fixed size records would be the simplest to utilize since your file offset for finding an archive can be simply calculated like a multiple from the record size. I additionally had some with variable record dimensions.

My system must be enhanced for reading through. The information was really saved on Compact disc-ROM, therefore it was read-only. I produced binary search tree files for every column I needed to look on. I required a wide open source in-memory binary search tree implementation and converted it to complete random access of the disc file. Sorted reads from each index file were simple and easy , then reading through each data record in the primary computer file based on the indexed order seemed to be easy. I did not have to do any in-memory sorting and also the system was way faster than the available RDBMS systems that will operate on a customer machine at that time.

For fixed record size data, the index can just keep an eye on the record number. For variable length data records, the index just must keep offset inside the file in which the record begins and every record needs to start with a structure that identifies it's length.

Hmm... Interesting question.

I believe that many used database management systems using operating-system mechanism for memory management, so when physical memory eventually ends up, memory tables would go to swap.