I'm attempting to train a Naive Bayes classifier with positive/negative words removing from the sentiment. example:

I really like this movie :))

I personally don't like if this rains :(

The concept is I extract negative or positive sentences in line with the emoctions used, but to be able to train a classifier and persist it into database.

However , I've a lot more than a million such sentences, therefore if I train it word by word, the database will get a toss. I wish to remove all non-relevant word example 'I','this', 'when', 'it' to ensure that quantity of occasions I must create a database totally less.

Help me in solving this problem to point out me possible ways to do it


You will find two common approaches:

  1. Compile a stop list.
  2. POS tag the sentences and get rid of individuals areas of speech that you simply believe are not interesting.

In the two cases, identifying which words/POS tags are relevant might be done utilizing a measure for example PMI.

Actually: standard stop lists from information retrieval might operate in sentiment analysis. I lately read a paper (no reference, sorry) where it had been stated that ! and ?, generally removed in search engines like google, are valuable clues for sentiment analysis. (So may 'I', esp. when you then have a neutral category.)

Edit: you may also securely discard exactly what happens only one time within the training set (so known as hapax legomena). Words that occur once haven't much information value for the classifier, but might take up much space.

To lessen quantity of data retrieved out of your database, you could make inside your database a dictionary -- a table that maps words* to amounts** -- and than retrieve merely a number vector for training along with a complete sentence for manual marking a sentiment.

* No scientific publication involves my thoughts but maybe it is sufficient to only use stems or lemmas rather than words. It might reduce how big the dictionary.

** If the operation kills your database, you may create a dictionary inside a local application -- that utilizes a text indexing engine (e.g., apache lucene) -- and store just the lead to your database.