I've got a table with 1,000,000+ records and I must find the most typical sub string that's a minimum of 5 figures lengthy.
Basically possess the following records:
KDHFOUDHGOENWFIJ 1114H4363SDFHDHGFDG GSDLGJSLJSKJDFSG 1114H20SDGDSSFHGSLD SLSJDHLJKSSDJFKD 1114HJSDHFJKSDKFSGG
I must write in SQL an argument that chooses
1114H because the most commmon sub string. How do i do that?
- The substring doesn't have to become in the same position.
- The subtrings should be length 5
- The utmost period of each record is 50 figures
You will find no requirement to obtain the longest substring so every substring with length more than 5 will invariably possess a substring of 5 figures that's a tie for count. Therefore we just check substrings of length 5.
Within the sample data you will find three strings that occur three occasions.
_ would be to show the place of the space
Within this solution
master..spt_values can be used instead of a amounts table.
declare @T table ( ID int identity, Data varchar(50) ) insert into @T values ('KDHFOUDHGOENWFIJ 1114H4363SDFHDHGFDG'), ('GSDLGJSLJSKJDFSG 1114H20SDGDSSFHGSLD'), ('SLSJDHLJKSSDJFKD 1114HJSDHFJKSDKFSGG') select top 1 substring(T.Data, N.Number, 5) as Word from @T as T cross apply (select N.Number from master..spt_values as N where N.type = 'P' and N.number between 1 and len(T.Data)-4) as N group by substring(T.Data, N.Number, 5) order by count(distinct id) desc
Word ------ 1114
This does not answer your question entirely, but here's articles from the book about advanced search techniques where it mentions a person-defined function "LCS" (longest common substring) that could be useful: