Pennsylvania's Motorcycle Experiment
In 2003, the Pennsylvania legislature repealed the state's motorcycle helmet law, creating a unique opportunity for data exploration. Has repealing the law led to an increase in motorcycle fatalities?
The Pennsylvania Department of Transportation (PennDOT) provides publicly-available crash data from 1997 through 2016, tracking every reported crash and attempting to detail the factors involved. Let's look at what those numbers show.
From 1997 to 2003, before the helmet law repeal, an average of 129 people died in motorcycle crashes per year. After the repeal, from 2004 to 2016, though, that average increased about 65% to 199 fatalities per year.
But those numbers can be misleading. Registered motorcycles also increased in Pennsylvania by about 65% during that same time period. So fatalities per 1,000 registered motorcycles has stayed fairly flat since 1997. Watch the video below to learn more about that trend.
Use the graph below to see how fatalities have changed over the years, and how many involved drivers wearing a helmet or not. You can click on a box to turn that variable on and off. Note: Before the helmet law repeal in 2003, PennDOT did not regularly track whether drivers in motorcycle accidents were wearing helmets.
But still, dozens of more people are dying each year since the helmet law was lifted. Some riders value the freedom to choose whether to wear a helmet. But emergency room doctors and first responders often are confronted with the consequences of those decisions.
You can explore the rate of motorcycle fatalities across Pennsylvania by clicking on a county in the map below.
Many bikers blame distracted driving for the increase in deaths. But an analysis of PennDOT crash statistics show only 4-5% of fatal crashes involve distracted driving and that number hasn't changed significantly since 1997. Is distracted driving underreported? Maybe, but overall distracted driving crashes in Pennsylvania are about 12%, which is consistent with nationwide estimates from the National Highway Traffic Safety Administration.
But if distracted driving isn't a factor in fatal motorcycle crashes, what is?
To answer that question, I looked at more closely at the data.
PennDOT's crash datasets contain more than 100 variables tracking information like vehicle type, injury level and driver age. Many of these are recorded as yes/no values, each with their own column within a large spreadsheet.
Using a Pearson's chi-squared test I analyzed which yes/no variables in the dataset were most correlated to fatal motorcycle crashes. Whether there was a fatality in the crash is one of the spreadsheet columns, I used that as my target variable, meaning that the test looks at how all the other crash factors to see if it is related to a fatal crash involving a motorcycle.
I removed a few factors that would not provide much additional information. For example, the fact that there was an injury in an accident is highly correlated to a fatal accident, but it does not explain the circumstances of the crash, so I removed it before the analysis. I asked the test to return five variables, which allows for several factors to be selected while narrowing the data.
Correlation tests check to see if variables are related as opposed to being truly independent. Items can be positively or negatively correlated. Correlation is measured on a scale of 1 to -1 (known as a correlation coefficient) . A coefficient of 1 means variables are fully positively correlated and -1 means they're fully negatively correlated. If a variable is positively correlated, that means as that factor increases, the target variable increases. In negative correlation, as a factor increases, the target variable decreases. Let's take a look at how that applies to my analysis results.
After looking at more than 96,000 crashes in a 20 year period, I found one positively correlated factor:
Alcohol related: At least one driver or pedestrian involved in the crash had reported or suspected alcohol use.
That means, for example, that as drunk driving increases, fatal motorcycle crashes increase.
And these four variables were negatively correlated to fatal crashes.
Drugged driver: At least one driver involved in the crash had drug use reported or suspected.
65 mph speed limit zone: The crash happened in a stretch of road with a 65 mph speed limit. This is the only speed limit variable in the dataset, and indicates a high speed crash.
Phantom vehicle involved: A phantom vehicle is one that was not struck, but whose actions contributed to a crash. Think of a car pulling out into an intersection that causes another driver to swerve and crash.
Hazardous truck involved: At least one heavy truck carrying hazardous material was involved in the crash.
Since these are negative correlations, the more crashes there are in a 65 mph zone, the less likely it is to involve a motorcycle fatality. A similar relationship would apply to all four variables. So, for example, if fatal motorcycle crashes increase, drugged driver crashes should decrease slightly.
These correlations can be visualized using a correlation matrix.
Reading a correlation matrix
The matrix shows the strength of correlation through color density and circle size. The darker the color and the bigger the circle, the more correlated the variables are. It's a visual representation of the correlation coefficient with red representation negative coefficients and blue showing positive ones.
The matrix doesn't just show the relationship between the target variable (FATAL) and the selected variables, but also how each variable is related to the others. For example, alcohol related and phantom vehicle have a strong negative correlation while alcohol related and drugged driver have a weak positive correlation. Meaning that an increase in drugged driving would lead to an increase in alcohol related fatal motorcycle crashes.
Let's look at the correlation coefficients for these factors. In my analysis, all the correlation numbers were fairly weak, with the highest positive correlation with fatal crashes being alcohol related. But the correlation is only 0.14 out of a possible score of 1. The highest negatively correlated variable is phantom vehicle with a coefficient of -0.298 out of a possible score of -1.
So what can these variables tell us?
Another way to analyze this selected group of data is through a test called a logistic regression. Logistic regressions allow us to classify binary data (in this case, was it a fatal accident? 0 means no, 1 means yes). We can check the importance of these variables by seeing how well the predict that an accident will be fatal.
These five variables together can predict if a motorcycle accident is a fatality with more than 90% accuracy. That accuracy level might not be enough if you're making parts for an airplane engine, but it does allow us to learn more about the factors in a fatal motorcycle crash. No one variable has a strong relation to fatalities, but the five together paint a picture of what is involved (or not involved) in fatal motorcycle accidents.
The chart below shows those results. It's called a confusion matrix, and shows how the logistic model performed. The blue boxes show times the model predicted that an accident would be fatal or not fatal based on our five variables and was correct. The red boxes show the times it was incorrect. For example, the top right box shows false positives. That is, the model predicted an accident would be fatal based on the criteria, but the data showed that the crash was not fatal. The graph shows that the model is very good at predicting when an accident won't be fatal, but not as good at accurately predicting fatalities.
But do we need all the variables chosen?
I asked the initial test to choose five variables, so that's what was returned, but some variables might play a stronger role in our logistic regression than others. We can run another test to see the importance that our model has placed on each factor.
The importance can be affected by the number of instances of a variable, how strong the correlation is or how much the data varies. There are only 33 motorcycle crashes involving a hazardous truck so its overall importance is low. Drugged driver had a very weak negative correlation, but the instances do not vary much, so it is the second most important factor in the model.
So what do we get if we run a logistic regression using just the two most important variables?
The model using just two factors, drugs and alcohol, is actually slightly more accurate that the one that uses five. It is even better at predicting non-fatalities but worse at predicting fatal crashes. The fatal crash predictions dropped from 451 in the previous model to 429. So depending on what you need the model to do, it might be better to stay with the one using five variables.
So how can Pennsylvania prevent more motorcycle fatalities? Changing the level of any of the variables would change their importance in relation to fatalities, but it doesn't mean overall deaths would decrease. And since alcohol and drugged driving are the two most important variables, those would seem to be the ones to focus on.
But drugged driving has a negative correlation to fatal crashes. Should we ask people to drive under the influence of drugs more since that would ostensibly lead to fewer motorcycle fatalities? Not exactly. The correlation between drugs and fatalities is weak, and such a prescription seems irresponsible, so that leaves us to focus on alcohol.
More than 8,500 crashes in the dataset are alcohol related, by far the largest of out five factors. According to the correlation statistics, lowering that number should lower fatalities. Continuing efforts to curb drunk driving can save motorcyclists' lives.
There's more to dig into with helmets and how they can save lives. PennDOT also keeps crash statistics based on helmet types. Do full face helmets decrease fatalities more than smaller helmets? Does whether the helmet is PennDOT certified make a difference?
I'm also interested in seeing if the make and model the motorcyclist is riding is related to accidents. Harley-Davidson riders sometimes frown on riders of smaller sport bikes. PennDOT has make and model data that would make it possible to explore this.
Using the chi-squared test I would like to explore the variables in all vehicle accidents. I had trouble running such a large dataset, but in looking at all vehicle crashes in York County, alcohol and drugs were not a factor. Aggressive driving and older drivers, however, were factors. Do those factors change on a statewide analysis?
There are many relationships between these factors to explore. One interesting similarity a coworker and I found while working with motorcycle data is the strong correlation between counties in Pennsylvania that had the most registered motorcycles and counties that voted for Trump in the 2016 election.