I came across an article in PNAS (the Proceedings of the National Academy of Sciences) with the catchy title ‘Female Hurricanes are deadlier than male hurricanes’. It is doing the rounds in the international media, with the explicit conclusion that our society suffers from gender bias because it does not sufficiently urge precautions when a hurricane gets a female name. Intrigued, and skeptic from the outset, I made the effort of looking up the article and take a closer look at the statistical analysis. I can safely say that the editor and the referees were asleep for this one as they let through a real shocker. The gist of the story is that female hurricanes are no deadlier than male ones. Below, I pick the statistics of this paper apart.
The authors support their pretty strong claims mainly on the basis of historical analyses of the death toll of 96 hurricanes in the US since 1950 and partially on the basis of hypotheticals asked of 109 respondents to an online survey. Let’s leave the hypotheticals aside, since the respondents for that one are neither representative nor facing a real situation, and look at the actual evidence on female versus male hurricanes.
One problem is that the hurricanes before 1979 were all given female names as the naming conventions changed after 1978 so that we got alternating names. Since hurricanes have become less deadly as people have become better at surviving them over time, this artificially makes the death toll of the female ones larger than the male ones. In their ‘statistical analyses’ the authors do not, however, control adequately for this, except in end-notes where they reveal most of their results become insignificant when they split the sample in a before and after period. For the combined data though, the raw correlation between the masculinity in the names and the death toll is of the same order as the raw correlation between the number of years ago that the hurricane was (ie, 0.1). Hence the effects of gender and years are indeed likely to come from the same underlying improvement in safety over time.
Using the data of the authors, I calculate that the average hurricane before 1979 killed 27 people, whilst the average one after 1978 killed 16, with the female ones killing 17 per hurricane and the male ones killing 15.3 ones per hurricane, a very small and completely insignificant difference. In fact, if I count ‘Frances’ as a male hurricane instead of a female one, because its ‘masculinity index’ is smack in the middle between male and female, then male and female hurricanes after 1978 are exactly equally deadly with an average death toll of 16.
It gets worse. Even without taking account of the fact that the male hurricanes are new ones, the authors do not in fact find an unequivocal effect at all. They run 2 different specifications that allow for the naming of the hurricanes and in neither do they actually find an effect unequivocally in the ‘right direction’ (their Table $3).
In their first, simple specification, the authors allow for effects of the severity of a hurricane in the form of the minimum air pressure (the lower, the more severe the hurricane) and the economic damage (the higher, the more severe the hurricane). Conditional on those two, they find an insignificant effect of the naming of the hurricanes!
Undeterred and seemingly hell-bent to get a strong result, the authors then add two interaction terms between the masculinity of the name of the hurricane and both the economic damage and the air pressure. The interaction term with the economic damage goes the way the authors want, ie hurricanes with both more economic damage and more feminine names have higher death tolls than hurricanes with less damage and male names. That is what their media release is based on, and their main text makes a ‘prediction graph’ out of that interaction term.
What is completely undiscussed in the main text of the article however is that the interaction with the minimum air pressure goes the opposite way: the lower the air pressure, the lower the death toll from a more feminine-named hurricane! So if the authors had made a ‘prediction graph’ showing the predicted death toll for more feminine hurricanes when the hurricanes had lower or higher air pressures, they would have shown that the worse the hurricane, the lower the death toll if the hurricane had a female name!
The editors and the referee were thus completely asleep for this pretty blatant act of deception-by-statistics. Apparently, one can hoodwink the editors of PNAS by combining the following tricks: add correlated interaction terms to a regression of which one discusses only the coefficients that fit the story one wants to sell; then make a separate graph out of the parameter one needs in the main text, whilst putting technically sounding information in parentheses to throw editors, reviewers, and readers off the scent.
And the hoodwinking in this case is not small either. In order to accentuate what really is a non-result, the authors in the main text claim that “changing a severe hurricane’s name from Charley (MFI=2.889, 14.87 deaths) to Eloise (MFI=8.944,41.45 deaths) could nearly triple its death toll.” This, whilst in the years since 1979 the average death toll for their included hurricanes is 16 for both ‘female hurricanes’ and 16 for ‘male hurricanes’ (own calculations)! The authors conveniently forgot to mention in their dramatic result that Charley would have had to have been a hurricane that did immense economic damage but that had a very high minimum air pressure, ie was actually a very weak hurricane. Only for such an ‘impossible hurricane’ would their own model predict the increase in deaths from a female name. Put differently, I could have claimed that if the hurricane was very strong in terms of low air pressure, that changing the name from Charley to Eloise would have halved the death toll!
The authors also quite willingly pretend to have found things they have not in fact researched. They thus write “”Feminine-named hurricanes (vs. masculine-named hurricanes) cause significantly more deaths, apparently because they lead to a lower perceived risk and consequently less preparedness”” and the conclusions even speak of “gender biases”! Where do they try and measure this supposed bias in actual preparations? You guessed it, nowhere. PNAS should really clean up its act and not allow this sort of article, with its fairly blatant statistical artefacts, to slip through the cracks.
Let me explain the trickery in a bit more depth for the interested reader: air pressure and economic damage are highly related (the correlation is apparently -0.56), which means that one gets a strongly significant interaction between femininity and economic damage only because one simultaneously has added the interaction with minimum air pressure. One then talks about the interaction that goes the way one wants and happily neglects to mention the other one. And one needs both interactions at the same time to get the desired result on the interaction between the names and economic damage: without this interaction with minimum air pressure, what you get is a whole shift upwards of the male death prediction and a loss of significance on the interaction term with economic damage. You see this in the ‘additional analyses’ run by the author, in very small font after the conclusions, wherein the whole thing becomes insignificant for the first period and the reduced coefficient for the later period on the interaction with air pressure coincides with a halving of the coefficient on the interaction with economic damage as well. Hence, without including both interactions you would probably get that the female hurricanes are predicted to be less deadly than the male ones when the economic damage is small and more deadly when the damage is large (to an insignificant extent). So you need the interaction that is almost invisible in the main text and the conclusions to ‘get’ the result that the headlines are based on.
There is another, even more insidious trick played in this article. You see, with only 96 hurricanes to play with, which really only includes 26 to 27 ‘male’ hurricanes, the authors are asking rather a lot from their data in that they want to estimate 5 parameter coefficients, three of which based on names. If you then only use a simple indicator for whether or not a hurricane has a male name, you have the problem that you don’t have enough variation to get significance on anything.
So what did the authors do? Ingeniously, they decided to increase the variation in their names by having people judge just how ‘masculine’ their names were. Hence many of the ‘female’ hurricanes were ‘re-badged’ as ‘somewhat male hurricanes’. So the female hurricanes of the pre 1979 era had an average “masculinity index” of 8.42, whilst those of the new post-1979 era had an average of 9.01. Simply put, according to the authors the female hurricanes ‘of old’, which were of course more deadly as they occurred earlier, were also more masculine, contributing to the headline ‘results’.
Supposedly masculine female names included “Ione”, “Beulah”, and “Babe”. And who judges whether these are masculine names? Why, apparently this was done by 9 ‘independent coders’, by which one presumes the authors meant colleagues sitting in the staff room of their university in 2013! Now, even supposing that they were independent, one cannot help but notice that the coders will have been relatively unaware of the naming conventions in the 1950s and 1960s. How is someone born in 1970 sitting in a staff room in 2013 supposed to judge how ‘masculine’ the name ‘Ione’ was perceived to be in 1950? These older names probably just sounded unusual and hence got rated as ‘more probably male’. Similarly, it is beyond me why ‘Hugo’ would be rated as less masculine than ‘Jerry’ or ‘Juan’.
The authors’ own end-notes called ‘additional analyses’ indeed show that you get insignificant results without this additional variation begotten from making the names continuous. So the authors need to fiddle with the names of the hurricanes, pool two eras together whilst not controlling for era, and add two strongly correlated and opposing interaction terms in the same analyses to get the results they want. It is what economists refer to as ‘torturing the data until it confesses’.
Finally, for the observant, there is the following anomaly telling you something about the judgements made in this research: the masculinity of names is judged on a 1 to 11 scale (only integers) by 9 raters. Yet the averages reported in the authors’ appendices include such values as 1.9444444 (Isaac) and 9.1666666 (Ophelia). Note that if it were true that there were indeed nine raters, then all values should be an exact multiple of one-ninth, ie 0.11111111. The discrepancy indicates that either there were not always nine raters, or else that not all coded values were integers (an impossibility according to the main text). The 9.16666 for instance is a multiple of one-sixth and thus suggests only 6 rates were used for ‘Ophelia’. the 1.9444444 is a multiple of one-eigteenth, suggesting that there were twice as many raters for ‘Isaac’. Alternatively, in both cases, there were nine raters but one of the nine raters picked two values simultaneously (one even and one uneven) and thus added 0.055555 to a multiple of one-ninth in the displayed average. It is not a big thing as this kind of judgement is made all the time but I can’t find the footnote that owns up to this in the paper.