Let us say my application produces, stores and retrieves a really great deal of records (hundreds of millions). Each entry has variable a few different data (for instance, some records only have a couple of bytes for example ID/title, although some might have mb of extra data). Fundamental structure of every entry is same and it is in XML format.

Records are produced and edited (probably by appending, not spinning) randomly.

Will it seem sensible to keep records separate files inside a file system and keep necessary teams of indexes within the DB versus. saving my way through a DB?

It truly is dependent how you are going for doing things. Databases are designed for more records inside a table than many people think, particularly with proper indexing. However, discover likely to be utilizing the functionality that the relational database provides, there is probably not much reason for doing things.

Ok, enough generalizing. Considering that a database eventually boils lower to "files on disk" anyway, I would not worry an excessive amount of by what "the best factor to completeInch is. When the primary reason for the database is simply to effectively retrieve these files, It could be perfectly fine to help keep the DB records small , lookup file pathways rather than actual data - especially as your file system ought to be pretty efficient at locating data given a particular location.

Just in case you are interested, this really is really a typical data storage pattern for search engines like google - the index will keep indexed data along with a pointer towards the saved data on disk, instead of storing my way through the index.

I'd certainly keep data around the file system and a hash the road within the DB.

Well based on your costs, MS SQL Server has what's known as a "Primary XML Index" that may be produced, even on unstructured data. This enables you to definitely write XQuery to look lower the posts and also the database will help you.

If there's any coherency whatsoever within the data, or it may be place into a schema you might visit a help to this.

Might I suggest for those who have considerable amounts of binary data for example images etc, that you simply strip these out and put them elsewhere, like a file system. Or if you are using 2008 there's a kind known as "Filestream" (cheers @Marc_s) which enables you to definitely index, store and secure all of the files you are writing lower and employ NTFS APIs to retrieve them (i.e fast block transfer) but nonetheless ask them to stored as posts within the database.

Getting the database there may provide you with a good layer of abstraction and scaling in case your application puts large demands on searching with the XML data, meaning it's not necessary to.

Just my 2c.

A few factors:

  • transaction management
  • backup and recovery.

They are general simpler to marshal having a database compared to personal files system. But most likely the toughest factor would be to synchronise personal files system backup having a database's roll forward (redo) logging. The greater transactional the application, the greater these factors matter.

It seems out of your question that you're not planning to create any utilization of normal database functionality (relational integrity, joining). By which situation you need to give strong consideration to some third option: store your computer data within the file system and, rather than a database, make use of a file-based text retrieval engine like Solr (or Lucene) , Sphinx, Autonomy, etc.

I'll use HDFS(Hadoop distributed file system) to keep the information. Primary idea is you will get high availability, scalability and replication. Questions for your application can be created map reduce queries. And primary fields could be saved like a distributed index on the top of Hadoop using Katta.

Try searching of these technologies.