I am searching for a method to improve this regular expression:

^(?:([^.]+).?){6}_tid

This extracts the sixth area of the point.separated.string.of.arbitrary.measures as much as "_tid"

Therefore if it appears such as this:

mc11_7tev.138345.dgnol_tb6_m12u_140_140_110_2l_jimmy_susy.evgen.log.e825_tid431423_0

it will return

e825

Funnily enough, basically take away the _tid area of the regex ^(?:([^.]+).?){6}, I recieve the performance I had been searching for.. one to two seconds for any million strings to check on. Using the _tid.. it requires as much as a few minutes.

It is possible to better method of doing this?


EDIT: Ah, I didn't remember to say, this really is in Apache Pig, so everything ought to be within the regex clause.

I'd first split the String on ., obtain the sixth part, split it on _, get part one:

s.split("\.")[5].split("_")[0];

Not examined!

You didn't remember to flee the us dot, do this

^(?:([^.]+)\.?){6}_tid

by doing this your regex has a smaller amount options to complement. The "." without getting away matches any character (without line break figures).

Another possibility I see is eliminating the not compulsory us dot

^(?:[^.]+\.){5}([^.]+)_tid

View it here on Regexr

That one appears to operate faster than yours:

^(?:[^.]+\.){5}([^.]+)_tid

That one provides me with the very best performance results:

    Pattern p = Pattern.compile(".*\\.([^_]+)_tid.*");