I'm writing a credit card applicatoin that is recording some 'basic' stats -- page sights, and different site visitors. I do not like the thought of storing each and every view, and so do considered storing totals having a hour/day resolution. For instance, such as this:
Tuesday 500 views 200 unique visitors Wednesday 400 views 210 unique visitors Thursday 800 views 420 unique visitors
Now, I wish to have the ability to query this data set on selected cycles -- ie, for any week. Calculating sights is simple enough: just addition. However, adding unique site visitors won't provide the correct answer, since a customer might have visited on multiple days.
So my real question is how do you determine or estimate unique site visitors for just about any period of time without storing every individual hit. Is even possible? Google Statistics reviews these values -- surely they do not store each and every hit and query the information looking for each time period!?
I can not appear to locate any helpful info on the internet relating to this. My primary instinct is the fact that I will have to store 2 teams of values with various resolutions (ie day and half-day), and in some way interpolate these for those possible time ranges. I have been having fun with the maths, but can't get almost anything to work. Do you consider I might be onto something, or around the wrong track?
You can store a random subsample from the data, for instance, 10% from the customer IDs, then compare these between days.
The simplest method of doing this really is to keep a random subsample of every day for future evaluations, however, for that present day, temporarily store all of your IDs and do a comparison towards the subsampled historic data and see the fraction of repeats. (That's, you are evaluating the subsampled data to some full dataset for any given day and never evaluating two subsamples -- you can compare two subsamples and obtain a quote for that total however the math could be a little more difficult.)
If you're Comfortable with approximations, I believe tom10 is onto something, but his perception of random subsample isn't the correct one or requires a clarification. If I've got a customer that occurs day1 and day2, but is tried only on day2, that's likely to introduce a prejudice within the estimation. Things I would do would be to store full information for any random subsample of customers (let us say, all customers whose hash(id)%100 == 1). Then you definitely perform the full information around the tried data and multiply by 100. Yes tom10 stated about exactly that, but you will find two variations: he stated "for instance" sample in line with the ID and I only say that's the only method you need to sample because you are looking at unique site visitors. Should you be thinking about unique Insolvency practitioners or unique ZIP codes or anything you would sample accordingly. The standard from the estimation could be evaluated while using normal approximation towards the binomial in case your sample is large enough. Beyond this, you can test and employ one of user loyalty, as if you realize that over a couple of days 10% of site visitors visit on days, over 72 hours 11% of site visitors visit two times and 5% visit once and so on up to and including most of day. These amounts regrettably can rely on time each week, season as well as modeling individuals, loyalty changes with time because the users list matures, alterations in composition and also the service changes too, so any model must be re-believed. My prediction is the fact that in 99% of practical situations you would be better offered through the sampling technique.
You don't have to store each and every view, just each unique session ID each hour or day with respect to the resolution you'll need inside your stats.
You can preserve these log files that contains session IDs sorted to count unique site visitors rapidly, by merging multiple hrs/days. One file each hourOrday time, one unique session ID per line.
In *nix, an easy one-lining like that one is going to do the task:
$ sort -m sorted_sid_logs/2010-09-0-??.log | uniq | wc -l
It counts the amount of unique site visitors throughout the very first 72 hours of September.