Friday, May 30, 2008

About bad bots, web anaytics and WASP

Simple fact: bad bots can screw up your web analytics data.
Interesting post from Marshall Sponder this morning, himself referencing Jay Harper on SEOmoz and Judah Phillips. So I'll had my two cents from the "bot" side of things, specifically regarding the Web Analytics Solution Profiler (WASP).

No surprise

From a server-side perspective, IT has known since the early days of Yahoo! that crawlers would affect web server logs. If you want some historical tidbits fun, look at those early posts:
"I count the accesses to my page to see if it's being used . Similarly, I browse through the access logs to see _who_'s using the page" then someone replying "This ablity to know *exactly* what someone looks like is going to be very very sigificant down the road." (March 1995: Visitor counts?)
"Yahoo! has links to well over 40,000 different web pages, and over 300,000 people use Yahoo! every day."(August 1995: The *NEW* Yahoo!)
There is very little we can do about bad behavioring bots, executing JavaScript or not, but as long as it's not widespread (and even it it does) or you are not the deliberate target of an attack, that shouldn't render your web analytics data useless. Malicious bots will not care about robots.txt exclusion rules, or slowing down your servers or screwing up your data. The only difference is that more sophisticated bots (crawlers) or session recorders/playbacks can run JavaScript and simulate real user sessions.

This is not new, it just seems marketing took a long time to find out :)

The web analyst role

Data cleansing and validation is an essential activity of the web analyst job and whenever there's a spike in traffic you should be able to explain it. As Avinash said a while back: "data quality sucks, lets get over it".

Depending on your experience and skills, and of course the web analytics solution you are using, it might be fairly easy to identify misbehaving visitors and spot outliers. The next step is to segment the data to exclude what shouldn't be there. Not "delete", "exclude". I recently saw a post suggesting to create Google Analytics filters to completely get rid of non-US traffic for a US-centric website. Don't do that... you still want to know where your unqualified traffic is coming from! Be it outside your geographic market, bad keywords or referrers or anything else, what you think is "unqualified traffic" can still help you optimize your site and even discover new opportunities. Segment, segment, segment...

Now about WASP

From the other side of the coin, ethic and professionalism (and I guess knowledge of how the Web works and experience too) plays a big role in how a crawler will behave. For example, the crawling feature of WASP runs JavaScript and could screw up your web analytics data pretty badly. That's why the current version is limited to crawling only 100 pages. But a test was done for a website with 30,000 pages without any glitch and it is expected WASP will easily handle crawling sites of over 100,000 pages in a single run.

The upcoming version of WASP will include the following options:
  • Abide by the robots.txt rules for excluding areas of your site and reducing the load
  • Modify the user agent string to identify itself as a bot
  • Show your real IP address before crawling so you can filter that data
  • "Stealth mode", effectively blocking the web analytics request altogether.
Any other ideas?