Friday, April 6, 2007

Cleaning up data before analysis

Prof. Gauvin from Laval University contacted me to share an interesting challenge for a new study he will be conducting. I'm sharing his inquiry because I think there could be some application in the web analytics field. It also relates, in a way, to my previous post about box and whisker plots, especially when I talked about outliers.

Detecting outliers

Let's say the above chart represents traffic to a site. Something is obviously wrong between time 24 and 28. If we were to do a box plot, we would see the values over 700 to be outliers. As an analyst looking at the data, and depending on the context, we might, or might not, want to exclude those extreme data points from our analysis.

In this example, is the sudden spike a case of server misconfiguration or a software bug? An attack or a bot repeatedly hitting our web site, or maybe the effect of being Digg'ed? Again, the data needs to be put in context in order to provide a good story.

A more complex situation

The other example might be more realistic, but is also much more complex.
Now we have a situation were a trend is suddenly being disrupted by something, making the whole base level shift from it's regular "State A" to a new level at "State B". We also see some impulses affecting the trend.

The law of relativity

Now imagine we were to do daily analysis as data becomes available. When we reach the 16th data point, it will show as an outlier. But as we progress toward the 20th data point, they might become valid data. So depending on the data range we use, even if it is statistically valid, outliers might change dramatically. But we don't know what to expect after the 71th data point and beyond... any guess?

The challenge

We're wondering if there is a mathematical way to detect state changes and impulses in a data set. And to make it even more complex, how could we use predictive analytics as we move along the data set? Any help would be appreciated.

P.S. I'm still negotiating with my boss to be able to register for the course on predictive analytics at the eMetrics Summit...