I lately released my humble side project and want to give a "related distribution" section when viewing a submission. The same as what Same with doing here - see right column, entitled "Related"

Thinking about that every submission includes a title and some tags, what is ideal (optimum result), most effective (fast, memory friendly) method to query the database for related distribution?

I'm able to think about one method to do that (which I'll publish being an answer) but I am very interested to determine what others need to say. Or possibly there's already a typical method of accomplishing this?

Here's my two cent solution:
To offer the best output, we have to put “weight” around the query results.

To begin with, each submission within the database is assumed to possess a weight of zero. Then, if your submission within the "pool" shares one tag using the current submission, we'd add +3 towards the found submission. Hence, if another submission is located that shares two tags using the current submission, we add +6 towards the weight.

Next, we split/tokenize the title from the current submission and take away “stop words”.
I have seen a listing of stop words from google, but for the time being I’ll define my stop words to become: [“of”, “a”, “the”, “in”]

Title “The Best Submission of Times”
Result the array: ["The", “Best”, “Submission”, “of”, “All”, “Times”]
Remove stop words: [“Best”, “Submission”, “All”, “Times”]

Only then do we query the database for distribution that contains the pointed out game titles, as well as for each result we add the load: +2
And lastly sort their email list climbing down by weight and go ahead and take top N results.

What is your opinion? (be gentle!)

Basically understand well, you'll need a method to find whether two posts are "similar" someone to one another. You might want to make use of a probabilistic model for your:


The concept is always to state that if two posts share lots of "uncommon" words, they're most likely speaking on a single subject. For discovering uncommon words, based on the application, you can utilize an over-all table of wavelengths, or possibly better, construct it your self on the world from the words of the posts (but you will have to have sufficient of these to possess something relevant).

I wouldn't limit myself on title and tags, however i would overweight them within the research.

This type of ideas is extremely common in junk e-mail blocking. I regrettably time to create a full review, but a fast search gives:

http://www.aclweb.org/anthology/P/P04/P04-3024.pdf karlmicha.googlepages.com/acl2004_poster.pdf