Thursday, May 21, 2015

You are never too young (or to old) to be thinking about data visualization!

Over the last couple of weeks, I have been working with my son on his first science fair project. It has been lots of fun, working with him to develop a question, a prediction, and design a meaningful experiment. Collecting the data was also great - as you can (hopefully) see in the pictures below, we were investigating how rubber ball bounced at different temperatures. When it came time to "writing" up his results, we decided that the best approach would be to plot all his data (using stickers to represent the heights of the first bounce, of balls dropped from a height of 2m). While that may seem obvious, for many scientists, there is a great adversion to plotting raw data. It would be far more likely to see a "professional" version of this experiment present results using bar plots of mean values (and either SE, SD or 95%CI error bars). This is very unfortunate, as bar plots forsake a great deal of useful visual information about the distribution of the data. By coincidence, this was also (one of) the take-home message(s) of a recent paper:

Weissgerber, T. L., Milic, N. M., Winham, S. J., & Garovic, V. D. (2015). Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biology: e1002128.

that we has chosen to read in this week's Long Lab journal club. Overall, I thought the authors did a commendable job, and it is evident from the paper's metadata that their message is reaching a large audience. While I am in favour of anything that turns the tide against bar plots, I do wish they would have given boxplots as much publicity as the univariate scatterplots that were heavily featured in the manuscript. I suspect that as the sample sizes in the literature they were surveying (physiology) tended to have small sample sizes  According to the authors "the minimum and maximum sample sizes for any group show in a figure ... were 4... and 10 respectively". These results are presented in panel C of supplemental figure S2*

I have nothing against univariate scatterplots. In fact, for small sample sizes (say <30 elements/group), directly plotting data reveals a great deal about the distribution of the data. However, after a certain point the usefulness of this approach starts to wain, as there will be more overlap in points. In such cases, a box-plot is a more desirable solution. Not only is as aesthetic, but is also clearly indicates meaningful visual information to the reader about the centrality, the skew and the distribution of the data. *I suspect that is why, when Weissgerber et al. presented their data of their hundreds of figures, they did so using a box-plot.

"let me tell you about the wonders of data visualization"

Tuesday, April 14, 2015

Statistically, it's mayhem*

Below is a letter to the editor I recently wrote in response to the potential flaws in the analysis that forms the basis of the Waterloo Region Record's recent article "Police call records reveal region's trouble hot spots" which can be read here ->

One of the first things that I emphasize to the students in my biostatistics class at Wilfrid Laurier University is that statistics are a powerful tool. Used carefully and properly, statistics can provide valuable insight into the factors that shape the world around us - but used or interpreted incorrectly, statistics can potentially lead to conclusions that are unjustified or altogether incorrect​. Your recent "analysis" of police call data seems to fall into the latter category due to problems with your data set, and in the conclusions drawn from them.

First, let's consider your data set. Of the ~903,000 calls in your initial data set almost half were excluded from the analysis for a variety of reasons. Whenever data is dropped, there is the strong possibility that what remains is a non-random (and thus biased) set of data. Furthermore, the remaining data points "do not measure crime" (as belatedly stated in the 30th inch of the story) -but instead capture a wide variety of incidents (including "enforcement of traffic laws" and "attend at collisions" that are not necessarily linked to the residents of that region). It should go without saying that if your data does not contain variables are relevant to the question, then the conclusions drawn from them will be suspect. 

Using this questionable data set, the conclusion "the poorer the zone, the more often people call police and the more time police spend there, responding to distress" is drawn, without any thought of potentially confounding effects. There are potentially dozens of other factors besides average household income that differ between the patrol zones that may be ultimately responsible for the observed patterns. For instance, a cursory search on Google Maps seems to indicate that the regions with the highest frequencies of calls to the police also have a greater density of Tim Hortons locations - but you would not (hopefully) conclude that their presence is responsible for "where trouble lives". 

Generations of statisticians have warned that "correlation does not imply causation", but that message seems to have been ignored in the construction of this article, to the detriment of your readership. 


Tristan A.F. Long

*The title for this post is taken from one of the hyperbolic statements made in the article. I think that, ironically, this statement is an apt description of the statistics used in the analysis.