Tuesday, January 16, 2007

Trends and abnormalities

One of the Web 2.0 Measurement working group raises an interesting questions: there are more and more bots and spiders that crawl the web and try to "hide" themselves by faking their user agent identification. Some of them now executes JavaScript, making them more intelligent and bypassing the usual detections built in web analytics solutions to categorize those visits correctly.

Since we're in the world of statistics, which, by definition, includes "randomness and uncertainty modeled by probability theory" (source: Wikipedia) and what Avinash calls the puzzles and mysteries, I would tend to put this kind of traffic in the "anomalies" category. We more than often face mysteries, and trying to explain and achieve perfect results (like in a puzzle) is usually not reasonable.

The question bares down to the overall significance of this traffic and the "cost" (in terms of detection, human time, computing power, analysis, etc.) to exclude it from our results. If it's not significant (say, less than 5%? 2%?), I would just not worry with it. We usually look for trends and spot "anomalies". If we can explain those abnormalities, be it strong emarketing results (good!) or unexpected crawlers (less good!), I think we're in a pretty good position.