I've got a mysql database where considerable amounts of text are constantly added. (10 pages of text each hour). Text is saved as plaintext in text fields. Every row contacts a webpage or a couple of texts.

I have to do full text search (look for a keyword within the text and do complex queries) about this database regularly. I simply need to look for recently added text. But it is crucial for additional text to become immediately searchable (within just a few minutes).

From what i have read, fulltext with mysql is extremely inefficient. I understand lucene is definitely an option but i am unsure yet how rapidly it may index new text.

What exactly are my options? it is possible to method to make mysql more effective? is lucene my best answer? something more appropriate?


You've got a handful of options:

  • Sphinx Search: Can integrate directly together with your MySQL DB. Has support legitimate-time indexes, with restrictions

  • Solr/Lucene: Feed it data via JSON or XML out of your DB. Has wealthy querying abilities. Current versions aren't real-timey w/o some edge develops. You need to re-index your computer data and commit it for changes to look. Which based on your quantity of data, you can perform a commit every 10 min. This will not be an problem til you have 100K / 1M+ documents as Lucene becomes manifest pretty quickly at indexing. 10 pages / hour is fairly trivial.

  • ElasticSearch: Is Java based like Solr/Lucene but seems to become the truly "near real-timeInch enough. Its designed as they are to become distributed and support linear scale-out. You feed it data via JSON and query via JSON.

It truly is dependent on your requirements and abilities. Sphinx may be the simplest to obtain began. Nevertheless its Real-time index restrictions may not meet your needs.

I've done benchmarking for Indexing Occasions for Sphinx &lifier Solr. Sphinx is way ahead as in comparison to Solr regarding Indexing Calculations (very fast indexing occasions and small index size).

Whenever you say 10 pages of text, it appears you do not need Real-time Sphinx Indexing. You are able to stick to the primary + delta indexing plan in Sphinx (you'll find that on Sphinx Documentation). It might be fast and near real-time. If you would like more help about this don't hesitate to request, could be glad to describe you.

Solr is excellent however when it involves enhanced Calculations Sphinx rocks!! Try Sphinx.

Visiting the questions you have within the comment, Solr/Lucene supports incremental indexing (referred to as delta imports within their terminology) and it is quiet simple to configure however are pretty slow as in comparison towards the method utilized by Sphinx.

Primary+Delta is quick enough because you skill is produce a temporary table store you new text for the reason that and index that. Based on the documentation:Sphinx supports "live" (almost real-time) index updates and it may be implemented using so known as "primary+delta" plan. The concept is to setup two sources and 2 indexes, with one "primary" index for that data, and something "delta" for that new documents.

Say for instance you've ten million records so that you can keep that because the primary index and all sorts of the brand new documents get added to a different table that will behave as the delta. This new table could be indexed every so often (say every 1hr) and also the data will get searchable within very couple of seconds as you've 10 pages of text. Now after your brand-new records are now being looked you are able to merge the documents from the primary table + delta table which may be completed without interfering your research. Once the documents are incorporated, empty the brand new table and again after an hour or so you are able to perform whole process again. I think you'll got that else don't hesitate to request any question.