Basic charting in Excel
The very first thing to do is to show the data as a simple line graph. For this post, I simply used visits to my blog in January of 2011. After some minor visual adjustments we end up with something like this:Figure a: simple Excel charting |
Figure b: time series (visits) |
Mean
The mean [wikipedia] is often referred to as the "average", which, in reality, is the "arithmetic mean". This is very simple math: add all the numbers and divide by the number of data points.
Look at Figure c - what can you tell about the red line crossing the whole graph? In a time series like daily visits for a month, honestly... we can't tell much! Yet, only averages are reported by most web analytics tools - so please, don't even bother saying "the average number of visits this month was X"!
Learning point: The average is rarely a good indicator in a time series such as those found in web analytics because it is influenced by extreme values (known as outliers [wikipedia]). At best, in the case above, one might want to calculate the mean for weekdays and the mean for weekends. As a rule of thumb, if you have less than 30 data points, use the median.
Look at Figure c - what can you tell about the red line crossing the whole graph? In a time series like daily visits for a month, honestly... we can't tell much! Yet, only averages are reported by most web analytics tools - so please, don't even bother saying "the average number of visits this month was X"!
Figure c: showing mean, trend, min & max and control limits. |
Figure d: descriptive statistics |
Median and mode
The median [wikipedia] is the middle value. The mode [wikipedia], on the other end, is the value appearing the most frequently. Again, in a time series, where the spread of values (the standard deviation explained below) is large, those descriptive statistics [wikipedia] (Figure d) are usually of little interest.Min & max
The min and max values are... well.. the maximum and minimum values in a time series. Those could be qualified as "anecdotes" - we could be thrilled we've got so much traffic on a single day, or deceived by a poorly performing day, but knowing that has absolutely no value if we can't explain why.In the time series used in this example, the min value is 93 visits on Saturday, January 1st. What can we tell about that? Obviously, people were busy doing something else than visiting my blog. What happened during the 4th week, around January 25? I shared my views about our little web analytics community and recounted my contributions. In both cases, we have very plausible explanations and the min & max values were useful only because they made us ask "why?".
Trend
To me, the linear trend [wikipedia] (shown as a dotted line in Figure b) is one of the interesting modeling stats because it marks the begining of our regression analysis [wikipedia] capabilities - our ability to explain the why's and "this, therefore that". Basically, it can help us do some predictive analytics (albeit very simple). Remember y = mx + b? That is, the position of a point on the y axis (the visits) depends on a factor of x (the day) plus a starting baseline. I can tell, based on historical data, that I should get approximately 350 visits next Tuesday.Standard deviation
If we do max - min we get the range [wikipedia], another descriptive statistic. Interesting at best. What's much more interesting is the standard deviation [wikipedia] - the variability of the data. As we've seen, the average isn't of much use because it is largely influenced by outliers. Standard deviation gives an appreciation of the spread of values around the mean, or if you prefer, the variation in a distribution of values.Why is this important?
Figure e: control limits at +/- 1.5 sigma |
Secondly, a large variation is an indication of an unstable process (think conversion rate), or low reproductibitiliy (anecdotal campaign success), or if you prefer, a larger standard deviation reduces our ability to predict the value of Y given a certain X. Basically, as analysts, we want to explain the past, but we also want to provide insight on how to fix issues and seize opportunities - we eventually want to be able to predict outcomes of our recommendations.
Coming up: normal distribution, histogram and box-plots
In the next installment we'll look at what is an histogram as well as normal distribution and their impact on our analysis. Also, although nifty spinning 3D-shadowed-shiny-Flash graphs are impressive... we'll look at box plots elegant simplicity yet powerful and under-used visualization tool.What do you think of this series so far? What would you like to see discussed or any examples you would like to see?