I'm removing 4-grams from binary products in hexadecimal form, this suggest I'm able to have for the most part 65535 different grams per item.
I wish to connect every item to it's grams as well as their frequency however i am puzzled regarding how to store everything – this really is my first data mining experience and that i haven't any clue about guidelines and customary tools.
I had been trivially thinking to construct a large table inside a relational database having a schema like
(ITEM-NAME, GRAM1, GRAM2... GRAM65535) and store within it the wavelengths but I can tell this method is uber impratical due to the amount of posts.
I understand there has to be better solutions available but I'm not sure where to check out.
The easiest method to store ngram is prefixTree IMHO. It's accustomed to in extremely powerful library lingpipe.
Illustration of tree:
1. gr1 1. gr2 (item1) 2. gr3 (item2,item3,item4) 2. gr3 (item1, tem2) 3. gr2 1. g3 (item5,item6) 2. g4 (item1)
Other choice is to keep in format of inverted index: ngramm -> item
gr1 (item1, item2) gr2 (item1, item3) gr3 (item2, item3) gr4 (item1, item2)
Note: Second item doesn't store order information that is crucial for ngram...