I am writing a loganalysis application and desired to grab apache log records between two certain dates. Think that to start dating ? is formated as a result: 22/12 ,/2009:00:19 (day/month/year:hour:minute)

Presently, I am utilizing a regular expression to exchange the month title using its number value, take away the separators, therefore the above date is transformed into: 221220090019 creating a date comparison trivial.. but..

Managing a regex on each record for big files, say, one that contains one fourth million records, is very pricey.. can there be every other method not including regex substitution?

Thanks ahead of time

Edit: here's the function doing the convertion/comparison

function dateInRange(t, from, to) {
    sub(/[[]/, "", t);
    split(t, a, "[/:]");
    match("JanFebMarAprMayJunJulAugSepOctNovDec", a[2]);
    a[2] = sprintf("%02d", (RSTART + 2) / 3);
    s = a[3] a[2] a[1] a[4] a[5];

    return s >= from && s <= to;
}

"from" and "to" would be the times within the aforementioned format, and "t" may be the raw apache log date/time area (e.g [22/12 ,/2009:00:19:36)

I remember when i had exactly the same problem of the very slow AWK program that involved regular expressions. After I converted the entire program to Perl, it went at much greater speed. I suppose it had been because GNU AWK compiles a normal expression each time it translates the expression, where perl just compiles the expression once.

Here is a Python program I wrote to perform a binary sort through a log file according to dates. It may be modified to dedicate yourself your use.

It seeks to the center of the file then syncs to some newline, reads and blogs about the date, repeats the procedure splitting the prior half in two, doing that before the date matches (more than or equal), rewinds to make certain there is no more with similar date before, then reads and results lines before the finish from the preferred range. It is extremely fast.

I've got a more complex version within the works. Eventually I'll have it completed and publish the up-to-date version.

Well, here's a concept, presuming records inside a log are purchased by date.

Rather than running regexp on every line inside a file and checking in the event that record is at the needed range, perform a binary search.

Get final amount of lines inside a file. Read a line in the middle and appearance its date. If it's over the age of your range - then anything before that line could be overlooked. Split what remains in two and appearance a line in the middle again. And so forth before you find your range limitations.

Cutting up files simply to identify a variety sounds a little heavy handed for such an easy task (binary search may be worth thinking about, though)

here's my modified function, that is clearly considerably faster since the regex is removed

BEGIN {
    months["Jan"] = 1
    months["Feb"] = 2
    ....
    months["Dec"] = 12
}
function dateInRange(t, from, to) {
    split(t, a, "[/:]");
    m = sprintf("%02d", months[a[2]]);
    s = a[3] m a[1] a[4] a[5];
    ok = s >= from && s <= to;
    if(!ok && seen == 1){exit;}
    return ok;
}

An array is determined and subsequently accustomed to index several weeks. It's made certain the program does not continue checking records once date has run out of the number (variable seen is placed on first match)

Thanks all for the inputs.