Right now i'm doing PHP and also have 11 million domain names in text files loaded into an assortment after which I sort through them using Regex. To get this done I have to raise memory limit to 2gigs after which it requires like ten seconds to process. I'll soon have 100 million domain names and intend on moving to some database solution, but nonetheless, how can you get good performance when searching through a listing of 100 million domain names?
I search using regex such as this:
$domain names = preg_grep("/store./", $array)
foreach($domain names as $domain)
What about a internet search engine like lucene: http://lucene.apache.org/java/paperwork/index.html
It's intended for this very purpose.
Regex is most likely the slowest method to search something. You might take advantage of MongoDB if you're coping with such large volumes of information.
Is dependent that which you mean by "search". Regex? Scan areas of strings? Databases wond help - indices are of no help for full partial matches.
OTOH legitimate matching (particularly if you store such as the domain title outside of the very best level part)... I'd expect single digit ms on any decent hardware.
Regex - better load the file into memory ONCE and it there. Yes, needs 2 gig - who cares. 64 gigabyte servers are cheap )
If guess what happens the domain starts with, this can be of great interest. You can separate your text files out into "starts with" files. A website can begin with 36 different figures (this is a-z plus -9). Have 36 different files, and them maintained this way.
As your example starts with 's', you'd run that around the 's' file, and discover the totally much faster - it's less to look through.
For a moment always be aware of first N figures of the search (i.e. there's no wildcard in front) you are able to achieve tremendously better results by splitting files up into groups. Since subsequent figures can include a hifen, your quantity of files could be a maximum of 36 * (37 ^ (N - 1)), where N > 1... that is still quite a bit!