I've got a table with 1,000,000+ records and I must find the most typical sub string that's a minimum of 5 figures lengthy.

Basically possess the following records:

KDHFOUDHGOENWFIJ 1114H4363SDFHDHGFDG
GSDLGJSLJSKJDFSG 1114H20SDGDSSFHGSLD
SLSJDHLJKSSDJFKD 1114HJSDHFJKSDKFSGG

I must write in SQL an argument that chooses 1114H because the most commmon sub string. How do i do that?

Notes:

  • The substring doesn't have to become in the same position.
  • The subtrings should be length 5
  • The utmost period of each record is 50 figures

You will find no requirement to obtain the longest substring so every substring with length more than 5 will invariably possess a substring of 5 figures that's a tie for count. Therefore we just check substrings of length 5.

Within the sample data you will find three strings that occur three occasions. _1114H, _1114 and 1114H (_ would be to show the place of the space ).

Within this solution master..spt_values can be used instead of a amounts table.

declare @T table
(
  ID int identity,
  Data varchar(50)
)

insert into @T values
('KDHFOUDHGOENWFIJ 1114H4363SDFHDHGFDG'),
('GSDLGJSLJSKJDFSG 1114H20SDGDSSFHGSLD'),
('SLSJDHLJKSSDJFKD 1114HJSDHFJKSDKFSGG')

select top 1 substring(T.Data, N.Number, 5) as Word
from @T as T
  cross apply (select N.Number
               from master..spt_values as N
               where N.type = 'P' and
                     N.number between 1 and len(T.Data)-4) as N
group by substring(T.Data, N.Number, 5)                      
order by count(distinct id) desc

Result:

Word
------
 1114

This does not answer your question entirely, but here's articles from the book about advanced search techniques where it mentions a person-defined function "LCS" (longest common substring) that could be useful:

http://books.google.com/books?id=wGwVkAt79bEC&pg=PA248&lpg=PA248&dq=sql+full+text+common+substring&source=bl&ots=fveHa8an08&sig=VTWHQDTA6gqSNylY9oR0mPhcP6Y&hl=en&ei=iALcTd_AB-j00gG3iZ3lDw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CBoQ6AEwAA#v=onepage&q&f=false