My primary goal would be to serve many XML files ( > 1bn each <1kb) via web server. Files can be viewed as as staic as individuals is going to be modified by exterior code, in relatively really low frequency (about 50k updates daily). Files is going to be asked for in high frequency (>30 req/sec).

Current suggestion from the team is to produce a devoted Java application to implement HTTP protocal and employ memcached to accelerate the one thing, keeping all file data in RDBMS and eliminating file system.

On other hands, I believe, a tweaked Apache Web Server or lighttpd ought to be enough. Caching could be left to OS or web server's defalt caching. There's no reason to keep data in DB when the same output is needed and just queried according to file title. Unsure how memcached works here. Also upgrading exterior cache (memcached) while upgrading file via exterior code will prove to add complexity.

Also other question, basically opt for files it's easy to store individuals in directory like abcd.xml and access via abcd.xml? Or must i invest 1bn files in single directory (Unsure OS allows it or otherwise).

This isn't an internet site, however for a credit card applicatoin API in closed network so Cloud/CDN is useless.

I'm likely to use CentOS + Apache/lighttpd. Suggest what other and finest possible solution.

This may be the only public note available on such subject, which is little old too.

1bn files at 1KB each, that's about 1TB of information. Impressive. Therefore it will not squeeze into memory unless of course you've very costly hardware. It can also be an issue on disk in case your file system wastes much space for small files.

30 demands another is way less impressive. It's definitely not the restricting factor for that network nor for just about any serious web server available. It may be just a little challenge for any slow harddisk.

So make an effort to: Place the XML files on the hard disk drive and obtain an ordinary vanilla web server of your liking. Then appraise the throughput and optimize it, if you do not achieve 50 files another. Try not to invest into anything unless of course you've proven that it is a restricting factor.

Possible optimizations are:

  • Look for a better layout within the file system, i.e. distribute your files over enough sites to ensure that you do not have a lot of files (a lot more than 5,000) in one directory.
  • Distribute the files over several harddisks to ensure that they are able to access the files in parallel
  • Use faster harddisk
  • Use solid condition disks (SSD). They're costly, but could easily serve 100s of files another.

If a lot of the files are asked for several occasions each day, then a slow hard disk drive ought to be enough since your OS may have the files within the file cache. With present day file cache size, a great deal of your everyday shipping will squeeze into the cache. Because at 30 demands another, you serve .25% of files each day, for the most part.

Regarding disbursing your files over several sites, you are able to hide this by having an Apache RewriteRule, e.g.:

RewriteRule ^/xml/(.)(.)(.)(.)(.*)\.xml /xml/$1/$2/$3/$4/$5.xml

Another factor you could think about is Pomegranate, which appears much like what you're attempting to do.

In my opinion that the devoted application with everything else feeding off a memcache db will be the best choice.