Tuesday, April 14, 2015

Statistically, it's mayhem*

Below is a letter to the editor I recently wrote in response to the potential flaws in the analysis that forms the basis of the Waterloo Region Record's recent article "Police call records reveal region's trouble hot spots" which can be read here ->http://goo.gl/DDQEg0

One of the first things that I emphasize to the students in my biostatistics class at Wilfrid Laurier University is that statistics are a powerful tool. Used carefully and properly, statistics can provide valuable insight into the factors that shape the world around us - but used or interpreted incorrectly, statistics can potentially lead to conclusions that are unjustified or altogether incorrect​. Your recent "analysis" of police call data seems to fall into the latter category due to problems with your data set, and in the conclusions drawn from them.

First, let's consider your data set. Of the ~903,000 calls in your initial data set almost half were excluded from the analysis for a variety of reasons. Whenever data is dropped, there is the strong possibility that what remains is a non-random (and thus biased) set of data. Furthermore, the remaining data points "do not measure crime" (as belatedly stated in the 30th inch of the story) -but instead capture a wide variety of incidents (including "enforcement of traffic laws" and "attend at collisions" that are not necessarily linked to the residents of that region). It should go without saying that if your data does not contain variables are relevant to the question, then the conclusions drawn from them will be suspect. 

Using this questionable data set, the conclusion "the poorer the zone, the more often people call police and the more time police spend there, responding to distress" is drawn, without any thought of potentially confounding effects. There are potentially dozens of other factors besides average household income that differ between the patrol zones that may be ultimately responsible for the observed patterns. For instance, a cursory search on Google Maps seems to indicate that the regions with the highest frequencies of calls to the police also have a greater density of Tim Hortons locations - but you would not (hopefully) conclude that their presence is responsible for "where trouble lives". 

Generations of statisticians have warned that "correlation does not imply causation", but that message seems to have been ignored in the construction of this article, to the detriment of your readership. 

Sincerely,

Tristan A.F. Long

*The title for this post is taken from one of the hyperbolic statements made in the article. I think that, ironically, this statement is an apt description of the statistics used in the analysis.


Sunday, September 28, 2014

Long Lab (summer fun edition)

L->R: Arnold Mak, David Filice, Katie Plagens, Tristan Long, Thiropa Balasubramaniam, Mireille Golemiec, Emily Martin

BONUS: GIF!

How to (truly) randomly assign mating treatments - an elegant R approach

This week in the lab we'll be setting up some experiments using our 45 inbred lines of flies (lines were inbred by mating sets of single male an females, derived from the IV line, and subsequently selecting a single brother and sister from the resulting offspring to found the next generation. This process was repeated for >10 generations).

In the experiment we want to randomly pair males and females from different lines, which in R seems pretty simple, as you can use the code

inbred.lines <-c(1:45)
random.mates<- sample(inbred.lines, 45, replace = FALSE)
combos <-cbind(inbred.lines, random.mates)

Which (most of the time) will end up with randomly pairing males and females from different lines....
...However, as this is a random process, there is a chance that R might by chance choose pairs of males and females from the same line. This may not seem a big issue, as you could use a simple logical argument such as

inbred.lines==random.mates

to make sure that there were no matches, and re-run the random sampling if any TRUE values returned by the last command until all pairs were different.

This "brute force" approach is OK, I guess, but becomes much less efficient if we want to place two (or more) males, each selected from a different line in with females. Now we might get a TRUE value if we had a match between the female and "male 1" ,  a match between the female and "male 2", or  a match between "male 2" and "male 1". You can imagine how this problem can get more difficult as the number of combinations increases with the number of males and females in each vial (see Handshake Problem).

So, as I could not find a solution online, I have developed some R code to quickly and elegantly solve this problem.
Let's begin as above

inbred.lines <-c(1:45)

Here is my solution

for (i in 1:length(inbred.lines)){treatmentBB<-sample(inbred.lines,45)}
if (treatmentBB[i]==inbred.lines[i]) {treatmentBB<-sample(inbred.lines,45)} else{treatmentBB}

As you can see in this code, I am creating a new column "treatmentBB" that I am populating with 45 randomly sampled numbers (with no replacement) from the inbred.lines vector. The next step is to ask R to check if any of the rows match. If they do, then we ask R to start all over again, but if there are no matches, then to leave treatment BB alone.

Now let's expand this to see if we wanted to add a second (random male) into each treatment, by creating a column of values called treatmentEI. As you can see below, I have taken into account potential matches between "females and males from treatmentBB", "females and males from treatmentE1", and "males from treatment E1 and males from treatmentBB"

for (i in 1:length(inbred.lines)){treatmentE1<-sample(inbred.lines,45)}
if (treatmentE1[i]==inbred.lines[i]|treatmentE1[i]==treatmentBB[i]) {treatmentE1<-sample(inbred.lines,45)} else{treatmentE1}
Created by Pretty R at inside-R.org

and If I wanted to add a third, I would need to make sure that all possible matches are accounted for... 

for (i in 1:length(family)){treatmentE2<-sample(family,45) if (treatmentE2[i]==family[i]|treatmentE2[i]==treatmentBB[i]|treatmentE2[i]==treatmentE1[i]) {treatmentE2<-sample(family,45)} else{treatmentE2}}

Hope you find this useful!
TL


Wednesday, June 25, 2014

Setting up for a hemiclone assay




 Several of the assays we are conducting in the lab this summer are examining the genetic basis of female mating behaviours. There are many ways to do these assays, but (in my opinion) one of the best ways to measure ofstanding genetic is using hemiclonal analysis. Both Mireille and Arnold (above) have been learning the ropes of this technique, and this week set up the first (of many) assays.

The first step involves mating clone males (which carry a randomly-sampled haploid genome, and set of translocated autosomal chromosomes) with many wild-type virgin female collected from one of our outbred populations. These mated flies are placed into half-pint collectors outfiltted with 35mm (diameter) petrie dishes containing a grape-juice/agar media for no more than 18h.

During that time, eggs are laid on the surface (see above).
As fruit fly development can be strongly influenced by variation in larval density, it is essential that our assay vials are standardized. This is done by counting exact numbers of eggs (under the microscope, using a paintbrush)...
...which are then gently cut from the surface of the juice "cookie"....
...and transferred to vials containing ~10ml of our lab's standard banana Drosophila media!
Due to the nature of the cross between the clone males and the wild-type females, there is a 50% mortality (due to chromosomal imbalances). Thus, for every 100 eggs we transfer into the vial, only 50 will hatch into larvae. To further establish standard developmental conditions in the vials, we add an additional 50 eggs from one of our other populations that carry a recessive-brown-eyed marker, yielding a total of ~100 larvae per vial (which mimics the typical developmental conditions experienced in the our lab populations).

Now that our vials are set up, we'll incubate them, and then collect hemiclonal (red-eyed) females as they eclose starting 9d later. Check back for updates!

Sunday, June 22, 2014

More great news!

I'm a little slow to post this, but last week was also Spring Convocation here at Laurier. It was great to see all the lab alumni (Ashley Guncay, Leah DeJong and Conor Delar) walk across the stage. After the ceremony, I had the honour of presenting Leah DeJong with the 2013-2014 Rick Playle Award for Best Thesis! Congratulations Leah!
Incidentally, Leah is the third student from the lab to win this award (following Erin Sonser in 2012-2013, and Jessie Young in 2011-2012)!

Bac(teria) to the Future


This week, the Long Lab was very busy examining the relationship between immunocompetence and parental characteristics (using fruit flies, as a model organism). This work, which began in the Fall with the work of Ashley Guncay (co-supervised with Dr. Joel Weadge) is being continued by two of our lab's summer students Katie Plagens & Thiropa Balasubramaniam (seen here with their flies). 

  So far this summer, Thiropa & Katie have been busy, first learning the ropes of fly pushing, then in running behavioural assays, and setting up immunocompetence assays. All of this has led up to the events of this week - a massive bacterial plating assay. It was hard work (but very rewarding), and would not have been possible without the concerted work of everyone in the lab (as well as many kind expert volunteers from the Weadge lab)!
Briefly, from each vial of flies (top), 5 (virgin) individuals were collected. They were then homogenized in LB media in order to release their bacteria into the solution. The next step is to create a serial dilution of the solution (transferring 10ul in succession into vials containing 90ul of LB media). These solutions are then plated onto petrie dishes (below) which are then checked on the following day for bacterial growth. The number of colonies on each plate (below) are counted, and from this, we can estimate the bacterial load of the original flies.

Ready for action!


Sunday, June 1, 2014

How to create a multi-panelled line graph with error bars in ggplot2

#Below you will find some code that I cobbled together from various sources to create a multi-panelled line graph with error bars
 
#first I need to load the file and the ggplot2 package
library(ggplot2)
#FYI I am using ggplot v 0.8.9 on R 2.13.1 (I know, I should upgrade!)
BI393<-read.csv(file.choose())
#if you look at the file 
# <https://www.dropbox.com/s/39vhky49xu3v4l6/393_FILE.csv> 
# you will see the mean (and sd), and median scores on a standardized student survey for each of the last 3 years of my BI393 Biostatistics class (useful data for preparing a tenure application file). As the QUESTION column is a bit long & unweildy, I will force it to break every 70th character (helps with the formatting of the strip text in gglplot)
BI393$groupwrap = unlist(lapply(strwrap(BI393$QUESTION, width=70, simplify=FALSE), paste, collapse="\n"))
 
#now, to business: I want to plot the mean scores for each question by year
printing<- ggplot(BI393,aes(x=as.factor(YEAR),y=MEAN,colour=QUESTION))
#and add on some error bars
+geom_errorbar(aes(ymin=MEAN-SD,ymax=MEAN+SD),width=.1)
#connect the points with colour-coded lines
+geom_line(aes(group = QUESTION))
#and make the points clear to see
+geom_point(shape=21, fill="white")
#adjust the range of the y-axis, and customize the x-label
+ylim(1,7.5)+xlab("Academic Year")
#change the background colours
+theme_bw()
#wrap the scores by the questions
+facet_wrap(~ groupwrap,ncol=3)
#and get rid of a pesky legend (as the question is in the strip text)
+opts(legend.position="none")
#make an informative y-axis label
+ylab(expression(paste("BI393 Scale Grade (Mean" %+-% "1SD)")))
 
#and now print it off...
ggsave(printing, file="396.png", width=10, height=10)
dev.off()
Created by Pretty R at inside-R.org