This is a difficult question however i will attempt anyway: our task would be to feed Microsoft FAST ESP with gb of information. The ultimate quantity of indexed information is somewhere within the neighborhood of 50-60GB.
FAST includes a .Internet API but core components are designed in Python (processing pipelines to index documents). The task would be to dependably talk to the machine while feeding it gb of information for indexing.
The issues that arise with FAST listed here are:
the machine is cool when it's given an excessive amount of data at the same time because it really wants to reindex its data throughout that the system remains unreachable for hrs. Unacceptable.
it's not a choice to queue up all data and serially feed one item at any given time because this will require too lengthy (a few days).
when a product can't be be listed in FAST the customer needs to re-feed the item. With this to operate, the machine should really call a callback approach to inform the customer concerning the failure. However, whenever the system occasions the feeding client is not able to respond to the timeout because that callback isn't known as. Hence the customer is depriving. Data is incorporated in the queue but can't be passed along somewhere. The queue collapses. Information is lost. You get the drift.
- feeding a product may take seconds for any small item or more to five-8 hrs for any single large item.
- the products being indexed are generally binary and text based.
- the aim is perfect for the entire indexing to consider "only" 48-72h, i.e. it must happen over the past weekend.
- The Short document processing pipelines (Phyton code) here have around 30 stages each. You will find a maximum of 27 pipelines by this writing.
To sum up:
The main challenge would be to feed the machine with products, large and small, at the perfect speed (much less fast since it might collapse or run into memory issues much less slow as this will require too lengthy), concurrently, inside a parallel manner like asynchronously running threads. In my estimation there needs to be an formula that decides when you should feed what products and just how many at the same time. Parallel programming involves mind.
There may be multiple "queues" where each queue (process) is devoted to certain-sized products that are loaded inside a queue after which given 1 by 1 (in worker threads).
I'm curious if anybody has ever done anything such as this, or how the way you would start an issue such as this. Thanks.
EDIT: Again, I'm not searching to "fix" FAST ESP or improve its inner workings. The task would be to effectively utilize it! Thanks!
To begin with, you need to use the duties for such problem.
They may be began sync, async, in thread pool etc, plus much more cheaper on memory than models with thread-securing.
I believe, the Task.ContinueWith fits perfectly for the problem.
Formula will appears like:
- Gather a queue with data you have to publish.
- Begin a task (or tasks, if you're dangerous :) that takes the heavier object from queue.(and also the littlest object using their company side), and begin upload it.
- Produce a way of the finish of uploading, that will start new job for new queue item.
- You should use Cancellation tokens for that timeouts.
- Any time you can define on which item the machine get error.