# discovering outliers inside a sparse distribution?

i must determine what the easiest method to identify outliers is. this is actually the problem plus some things that most likely won't work. let us say you want to seafood out some quasi-uniform data from the dirty varchar(50) column in mysql. let us begin by doing an analysis by string length.

``````| strlen |  freq  |
|      0 |   2312 |
|      3 |     45 |
|      9 |     75 |
|     10 |  15420 |
|     11 |    395 |
|     12 |    114 |
|     19 |     27 |
|     20 |   1170 |
|     21 |     33 |
|     35 |     9  |
``````

what i must do is devise an formula to find out which string length has a good venture to be actively unique instead of being typeo's or random garbage. this area has the potential of becoming an "enum" type, so there might be several frequency spikes for valid values. clearly 10 and 20 are valid, is simply overlooked data. 35 and three may be some random trash despite both being completely different in frequency. 19 and 21 may be type-os round the 20 format. 11 may be type-os for 10, but how about 12?

it appears simply using occurrence frequency % isn't enough. there have to 'hang-outs' of greater "just a mistakeInch probability round the apparent outliers.

also, getting a set threshold fails when you will find 15 unique measures which could vary by between 5-20 chars, each with between 7% - 20% occurrence.

standard deviation won't work since it depends on the mean. median absolute deviation most likely wont work because you'll have a high frequency outlier that can't be thrown away.

yes you will see other params to clean the information within the code, but length appears to very rapidly pre-filter and classify fields with anywhere of structure.

what are the known techniques which may work effectively? i am not so acquainted with Bayesian filters or machine learning but maybe they are able to help?

thanks! leon

Seems like anomaly recognition is how a to visit. Anomaly recognition is really a type of machine learning that's accustomed to find outliers. It is available in a few types, including supervised and without supervision. In supervised learning, the formula is training using good examples of outliers. In without supervision learning, the formula tries to find outliers with no good examples. Here are a handful of links to begin:

http://en.wikipedia.org/wiki/Anomaly_detection

http://s3.amazonaws.com/mlclass-resources/docs/slides/Lecture15.pdf

I did not find any links to easily available libraries. Something similar to MATLAB, or its free cousin, Octave, may well be a nice method to if you cannot locate an anomaly recognition library inside your language of preference. https://goker.wordpress.com/tag/anomaly-detection/