Wednesday, January 31, 2007

Box plot and whisker plots in Excel 2007

2011-10-05 Update! A new version of this article and a FREE and enhanced Microsoft Excel 2007 example is available in my blog post at Cardinal Path: The math behind web analytics: box plot.

This articles aims to highlight the importance of box plots (or horizontal box, or Whisker plot) in analytics and provides a step by step example to make them in Microsoft Excel 2007.

What are box plots?

A box plot is a visualization method used in statistics. It shows, in a glimpse, several very important elements describing a data sample:
  • the smallest observation
  • the lower quartile (25%)
  • the median (50%)
  • the higher quartile (75%)
  • the largest observation
In addition, a box plot can help visualize two other elements:
  • abnormal data (outliers)
  • average

How can they be used in web analytics?

Boxplots are a quick visualization approach for examining one or more data sets. Because they can easily reveal the limits of acceptable data and any extremes, it becomes very easy to explain trends and abnormalities and communicate the right information.

We have to consider, however, than in term of data visualization, some ways of showing Probability Density Function can also reveal the same information, and more, in a very interesting way. However, drawing them in Excel might be very difficult...

How can I draw my own box plot?

For this example, we will use the fictitious data representing number of Visits to a Web site on a day by day basis, as shown below:

  1. The first thing to do is to sort the values (the Visits column) in ascending order.
  2. Get the following statistical information from your data. Those are the essential ones to build the box plot:
  3. We now need to calculate the width of each box and the lower/higher non-outliers limits:
  4. Select the 25th, 50th and 75th percentile values (yellow cells) and Insert a new 2-D Stacked Bar graph (or press Alt-F1 and then Change Chart Type). Then Switch Row/Colum (under Chart Tools/Design menu). You should get something like this:
  5. Some formatting will get us closer to our goal:
    • Delete the Legend
    • Delete the Horizontal (Value) Axis Major Gridlines
    • Set the 1st box and set it to be totally transparent: right-click, Format Data Series..., Fill, No fill
    • Set the 2nd and 3rd boxes to be transparent with a border: no fill as above, then Border Color, Solid Line
    • Set Vertical axis to cross at maximum value: select the Horizontal (Value) Axis, right-click, Format Axis..., Vertical axis crosses: Maximum axis value
    • Hide the vertical axis bar: select the Vertical (Category) Axis, right-click, Format Axis..., Line Color, No line
    • Set the Category axis label: right-click in the chart, Select Data, pick the Horizontal (Category) Axis labels, click the Edit button and pick the cell containing your data label
  6. Set the lower limit and higher limit ranges
    • Select the first box, click on Chart Tools, Layout. Select Error bars, More error bars options..., pick the Display Direction: Minus, indicate the Error Amount: Custom and click the Specify Value button. Leave the Positive Error Value as is and select the blue cell (Lower limit) for the Negative Error Value.
    • Same as above for the 3rd box, but choose the Display Direction: Plus and pick the Higher limit (green cell) cell for the Positive Error Value.
  7. We're getting close, we will now add the average:
    • Select the cells containing the average label and value (C3:D3). Do a copy (ctrl-c). Simply select the graph and paste this value (ctrl-v) as a new series.
    • It's being added as a new box in our graph. Select it and Change Series Chart Type..., pick the X Y (Scatter) with only markers.
    • This will show the marker, but it won't be positioned correctly in the graph. Remember we switched rows and columns earlier: right-click anywhere in the chart, Select Data... then pick the newly added series (Average). Edit and switch the X and Y values.
    • Pick the left-side Secondary Vertical (Value) Axis and rick-click to Format Axis..., set the Minimum value to -1 and the Maximum to 1. Close the dialog. You can now delete this extra axis scale indicator.
    • The average is now positioned correctly. You might want to change the indicator to a more visible symbol.
  8. Let's add the last touch! We want to show the outliers:
    • Right-click the graph and Select Data...
    • Click the Add button
    • Set the Series name to "Low outliers"
    • For the the Series X, select any value from the sorted data that is lower than the minimum valid range (D2)
    • For the Series Y Values, put ={0} so the new data points are shown correctly on the vertical scale.
    • Use the same technique for the "High outliers" with any value that is larger then the maximum valid range (D7)
  9. Some minor aesthetic adjustments and we're set!

Analysis

  • We see that Visits ranged from about 125 to 250.
  • We see the normal range to be between 165 and about 200.
  • We spot two abnormal values: one too low, probably an indication there was a problem on that day, and one two high, for example a spider or web bot crawling our site.
Comments and suggestions are always appreciated!

Some references

Tuesday, January 30, 2007

February WaW - Quebec city

Ce mois ci, la rencontre aura lieu mardi le 20 février à Ste-Foy afin de donner une chance aux gens de la région de Québec de participer.

La rencontre de la communauté Analytique Web du Québec est un évenement inspiré de la tradition "Web Analytics Wednesday", une rencontre mensuelle, informelle, de gens oeuvrant dans le domaine de l'analytique Web, du design, du SEO et de l'internet en général.

Lieu:
Restaurant Bugatti
3355, rue de la Pérade
Sainte-Foy
418-651-4747

Nous aurons une section privée à partir de 18:00h.

SVP réserver en me faisant parvenir un courriel à l'adresse shamel67(A)gmail.com, en laissant un commentaire ou en ligne

Note: Si vous aimeriez présenter un service ou un produit, être un "sponsor" en offrant une tournée ou un cadeau de présence, ou simplement faire une petite présentation sur le sujet de l'analytique, SVP me contacter.


This month, our meeting will be on Tuesday February 20th in Ste-Foy to give a chance to those in the region of Quebec city to participate.

The meeting of Quebec's Web Analytics community is inspired by the Web Analytics Wednesday tradition, which is an happy gathering of practitioners in the field of web analytics held every month.

Location:
Restaurant Bugatti
3355, rue de la Pérade
Sainte-Foy
418-651-4747

We have a reserved section around 6:00pm.

Please confirm by sending me an email at shamel67(A)gmail.com, by leaving a comment below or online.

Note: If you would like to present a product or service, be a sponsor for a drink or offer a small participation gift, or simply do a small presentation to share your experience on the topic of web analytics, please contact me.

On my way to eMetrics

If everything goes as planned, I will be attending the next eMetrics summit in San-Francisco (May 6-9). This conference is an absolute must for anyone involved in web analytics and covers everything from implementing the web analytics culture within your organization up to specific measurement techniques, optimization tips, marketing, B2C and B2B, etc. Plus, you get to meet the gurus in the field and fellow practitioners!

This event, organized by Jim Sterne's Target Marketing, also includes a special training day from the Web Analytics Association. This is a great opportunity to get some of the excellent courses from the UBC Award of Achievement in Web Analytics.

Not only that, but I'm delighted to say I will be presenting the Web Analytics for site optimization course! This online course will be adapted for in-class presentation, along with examples, exercises and of course, questions and discussions with the attendants.

I have to thank M.Sterne for this opportunity, but also my employer, Desjardins General Insurance Group (DGIG), for allowing me to prepare the course material and attend this event. After all, DGIG was recently named best employer to work for in Quebec and 12th in Canada!

If you are from Quebec and plan to attend the eMetrics Summit, let me know!

Friday, January 26, 2007

Snap Preview Anywhere

This morning I have installed Snap Preview Anywhere code on my blog. It enables you to get a glimpse of the other sites I'm linking to, without having to actually leave my site. By rolling over any link, you get a visual preview of the site, thus eliminating wasted "trips" to linked sites.

I think this cool usability feature might, in fact, influence users to stick longer on my site, increase the overall feel of quality and is just plain cool.

Let me know what you think of this feature!

Thursday, January 25, 2007

Cool Stuff for January

Following a recent discussion with colleagues and a post from Manoj on Web Analytics World, I will summarize a couple of cool discoveries, some quite new, some a bit older but worth taking a look.

Design/Usability

Blogging

  • MyBlogLog: this community service, simply configured trough a couple of JavaScript lines added to your blog template, brings a lot of useful information about your readership and links and affinities between blogs.

Personal

  • LinkedIn: I started using it May 2004, which probably makes me an early adopter of this professional networking tool. The power is in the network, the more people use it, the better it gets.
  • Plaxo: A contact management solution that follows you everywhere. Early on people were suspicious of their privacy protection, but it's getting more widely used. However, I found it automatically establish links with people I only communicated once or very few times.
  • Pandora: the ultimate and coolest music engine! It offers recommendations based on your preferences and results from it's user network.
  • PageFlakes: you know Google Personalized? PageFlakes goes way beyond with a very nice interface, a vast array of widgets to use, and your custom page can be shared.

Culture

  • "Inside the soul of the web": this article from the November 2005 issue of Wired is a fascinating inside look at Google, still worth reading today.
  • "startup.com: the raise and fall of the American dream": check out my review from February 2004 if you want to better understand the craziness of the Internet Bubble era.
  • EPIC: if you have never seen this video about a future where Google and Amazon rule the world, check it out.

Simply cool stuff

  • Amaznode: provides a visualization of the links between books found on Amazon. Pretty cool!
  • TouchGraph: this company offers a tool similar to Amaznode, but is much more powerful (see snapshot). It also offers a visualization tool based on Google relate which is really interesting to find how a site or a topic is being recognized as part of an "ecosystem" by Google.
Cool stuff you want to share? Post a comment!

Wednesday, January 24, 2007

A/B vs multivariate testing

Referring to Wikipedia, multivariate statistical analysis, or multivariate testing (MVT) "describes a collection of procedures which involve observation and analysis of more than one statistical variable at a time". The key is "more than one".

A/B testing generally aims to compare various versions of one statistical element. In it's simplest incarnation, and what most web analytics solutions offers out of the box, A/B testing is based on two time frames.

When we refer to "4 different version's of a website homepage", the element of comparison is the whole page. What if instead we used each sub-elements such as graphics, layouts and text to evaluate the best combination of all those elements to get the winning home page? The first example can't tell if one element of a page is better than another, it tells us if the whole page is better. With the second example, we could find that "text A" combined with "layout B" and "graphics C" is, in fact, much better than "text A" with "layout A" and "graphic A".

Another important difference between A/B and multivariate testing is that MVT provides quantitative data, while A/B testing often a branch of usability testing that also encompass qualitative data. Both quantitative and qualitative information are important to take the best decisions.

Finally, when you want to test two variations of an element, you end up with two tests. Fairly easy! If you want to test three variables (text, layout, graphics), and say each one as two variations, you end with 3^2 tests to do. This quickly gets out of hand as the number of possibilities increases. MVT statistical methods use a smaller number of tests and uses inference analysis to find out which combination of individual elements is "optimal."

The demos at Offermatica are pretty good at simplifying the differences between A/B and multivariate.

Friday, January 19, 2007

January's Montreal WaW

This week was our 4th WaW, 3rd one in Montreal. About a dozen people gathered at Baton Rouge to eat a rack of excellent BBQ ribs, drink a few beers at talk about web analytics. The place was a bit noisy, but this wasn't enough to stop us from talking about our various experiences, some tips and the importance of web analytics in Web (re-) design, continuous improvement, and as a demonstration of Web ROI.

Being informal as its advantages, which allowed us to digress about intranets and picture management solutions. But I take good note of having a minimal agenda, thanks Benoit! :)

We will reach to the Association de Marketing de Montreal (AMM), the Federation d'Informatique du Quebec (FiQ), as well as the Association of Internet Marketing & Sales and the Web Analytics Association to set up a thematic event about web analytics. However, I will have to see how much effort I can dedicate to the organization of this event.

As always, your suggestions, critics and comments are welcome and I look forward to the next WaW. In February it will take place in Quebec city, so spread the word, let your colleagues know about WaW!

Tuesday, January 16, 2007

Trends and abnormalities

One of the Web 2.0 Measurement working group raises an interesting questions: there are more and more bots and spiders that crawl the web and try to "hide" themselves by faking their user agent identification. Some of them now executes JavaScript, making them more intelligent and bypassing the usual detections built in web analytics solutions to categorize those visits correctly.

Since we're in the world of statistics, which, by definition, includes "randomness and uncertainty modeled by probability theory" (source: Wikipedia) and what Avinash calls the puzzles and mysteries, I would tend to put this kind of traffic in the "anomalies" category. We more than often face mysteries, and trying to explain and achieve perfect results (like in a puzzle) is usually not reasonable.

The question bares down to the overall significance of this traffic and the "cost" (in terms of detection, human time, computing power, analysis, etc.) to exclude it from our results. If it's not significant (say, less than 5%? 2%?), I would just not worry with it. We usually look for trends and spot "anomalies". If we can explain those abnormalities, be it strong emarketing results (good!) or unexpected crawlers (less good!), I think we're in a pretty good position.

Friday, January 12, 2007

Coming WaW Montreal on the 17th

I'm very excited because the coming Web Analytics Wednesday, on the 17th, will certainly be the largest to date. Word of mouth is picking up and I have about 15 confirmations and another 7 that might attend.

What's really good is the quality and representation of the various companies in the local web analytics community: some practitionners, some customers, some solution vendors might attend.

Altought they are not all confirmed, here's a short list of those who have expressed interest in being there:
Web analytics users:
Web Analytics Association MemberIf you know other people who might be interested in web analytics, behavior analysis, design and the web in general, pass the word!

Tuesday, January 9, 2007

The challenges of JavaScript Tagging

The post "Web Analytics Technical Implementation Best Practices. (JavaScript Tags)" from Avinash is again right on the spot.
  1. Tag all your pages.
  2. Tags go last (customers come first :).
  3. Tags should be inline.
  4. What’s your unique page definition?
  5. Use cookies intelligently (they are delicious).
  6. Javascript wrapped links might be a issue.
  7. Redirects, be aware of them.
  8. Validate data is being captured correctly.
  9. Don’t forget Flash, Flex, RIA, RSS, Videos etc.
For item #8, there are various techniques to check if the tags are correct. IEWatch and Fiddler are great on MSIE, Firefox has a number of extensions that covers cookies, live HTTP headers and such.

In my experience implementing Omniture, WSS, Coremetrics, GA and others, I found that tweaking and checking if the tracking code matches our requirements might sometimes become a significant part of the technical implementation effort.

That's why I've been working on WASP, the Web Analytics Solution Profiler, a Firefox extension aimed at web analytics implementation specialists, web analysts and savvy web surfers who wants to understand how their behavior is being analyzed.

The early beta brought good feedback and I'm close to releasing beta 0.2, where I nailed down a couple of glitches and improved the detection algorithm, as well as added other WA solutions. I'm also working on other nifty elements such as P3P information and probably Alexa info.

Monday, January 8, 2007

Web Analytics Wednesday - January 2007

Our Web Analytics Wednesday will be exceptionally held on the 3rd Wednesday of January, that is, January 17th, in Montréal. This month get together will be at Bâton Rouge, on Ste-Catherine street (Complexe Desjardins) (Map) around 6PM. A table will be reserved to the name of M.Stéphane Hamel.



So far, 13 confirmations, 4 "maybe".

Web Analytics Wednesday is the world's only distributed networking event for web analytics professionals. Open to everyone, practitioners and vendors alike, Web Analytics Wednesday is a free event allowing you to meet folks with similar work interests.

Please RSVP at by emailing me (Stéphane Hamel).

Wednesday, January 3, 2007

Skip, skim and engage

Today, while browsing my aggregated feed of nearly 50 conversations I noticed Google Reader introduced a new feature called "personalized trends".

Measuring "engagement"

Recent discussions about Web 2.0 measurement touched this very topic: how to measure engagement?

Google's attempt is a good start, but it doesn't tell something very important: when I use Google Reader, if I don't want to view the same posts over and over, I have to tell Google Reader to "mark it as read". This also happens automatically if I stay focused on a post for a couple of seconds. This ends up showing "% Read" of nearly 100% in most cases (see right). Of course, I don't actually "read" everything, I barely "skim" over most posts, stumbling on something I find of interest from time to time. So we face, as on most Web sites, three types of engagements (or attention): skip, skim (or scan), read (or focus).

Skip

As the word implies, we go right over the item (be it a blog post, an RSS item or an element of a page). The attention is diverted to something else in a very short amount of time. This could be the result of a badly chosen title, a good title that doesn't cater to my interests or my current goal, or the presence of another item screaming at me for attention. Clearly, there is no engagement and thus, very little value.

Skim

This happen all the time. While Googling, we fiercely scan the results. While reading a blog, we scroll down a specific article to look for something interesting when looking at a video on Youtube, we watch a few seconds and decide to continue or stop right away, etc. When I use Google Reader, I do that all the time naturally, since articles from various sources appears one below the other. There is no real engagement yet, it will depend on my specific goal and my ability to achieve it. The value might not be immediate.

Engage

On the three dimensions of attention defined by Davenport, this would be "attractive" (as opposed to "aversive"), "voluntary" (as opposed to "captive") and "front-of-mind" (as opposed to "back-of-mind". I'm truly "engaged" and there is a clear value, usually for both parties.

Tuesday, January 2, 2007

My year end review

As the tradition goes, it's now time to post about the past year events.

Web Analytics blogosphere

  • I've been blogging since 2002, but it's only in 2006 that I really took time to build up my presence in the blogosphere, blogging mostly about Web Analytics. I only got my Google Analytics account July, but we can easily see two things:
    1. Overall traffic is on a growing trend
    2. The huge spike is the Digg effect on my post entitled "Google growing larger than Microsoft"
  • The other topics that raised interest is WASP, the Web Analytics Solution Profiler, a Firefox extension aimed at web analytics implementation specialists, web analysts and savvy web surfers who wants to understand how their behavior is being analyzed.
  • I also took time to get involved in the Web Analytics community:
    • I organized 3 Web Analytics Wednesdays to foster interests in the local web analytics community, more to come in 2007 in Montréal and Québec city!
    • The Google Co-Op specialized search encompass over 100 specialized sources: blogs, solutions, references
    • I aggregated over 45 blog feeds for easier tracking of the Web Analytics blogosphere
  • I also got involved in the Web 2.0 measurement group, and some of the Web Analytics committees, as well as Forrester's Web Analytics Peer Research Panel, but my attention span is getting more limited and I don't spend as much time as I would like.

Personal views on 2006

What to look for in 2007

  • WASP will definitely by my main point of interest
  • I also want to continue to post about Web Analytics in general, including the translation and revamp of two papers I wrote in the course of my MBA