I wish to develop a large scalable database with countless high dimensional vectors using LSH. Since I must hold all of the data in ram for fast querying, the information should be distributed onto multiple servers to carry all of the objects.
A naïve approach is always to spread all objects to various servers and send one query to each server. The server using the best solution correctly has got the right object.
I am sure there has to be some better solution, in which a query don't needs to be send to any or all server nodes and other alike objects are arranged together on a single server.
What will be a good method for distributed LSH tables? Maybe you will find even some projects available?
Thank you for any hint.
First, you need to think about the secrets through which the information will be utilized. It's these secrets that you would hash - and, knowing the precise secrets you need to access, you are able to hash them to find out which server to question - getting rid of the necessity to query every server.
Things get harder if you do not be aware of exact secrets (when i suspect to become your circumstances) - the LSH creates an overall total ordering for the records - where similar records are most likely (although not guaranteed) to achieve the same hash. I think about this as, for instance, a mapping of hyperplanes to the size of their normal vector in the origin... hence, for instance, if hunting for a similar (but non-identical) hyperplane to 1 that's between 4 and 5 models in the origin, a great starting point searching is one kind of other hyperplanes between 4 and 5 models in the origin. Hence, if the 'distance from origin' is the locality sensitive hash function, you are able to shard your computer data utilizing it, and, by doing this - you can reduce load (while growing worst situation latency) by searching just the shard having a matching 'distance from origin' LCH. With this particular specific LCH, where similarity is linearly correlated using the hash, it might be easy to have an definitive result while only being able to access a subset from the distributed servers. This isn't the situation for those LSH functions.
IMHO, everything is dependent upon your LSH function - which selecting is dependent upon the more knowledge about the application.