. I also argued that the rating scales of fact checkers like PolitiFact and The Fact Checker are valuable, but they conflate soundness and validity, which causes their ratings to be vague. As usual, I syndicated the post on the Daily Kos. Kossack
provided valuable constructive criticism, which we'll consider today.
The aptly titled comment by Ima Pseudonym was,
This is Internet commenting at its best: constructive, well-reasoned, and mainly correct. Let's address the comment point by point.
"Validity is a nice standard for mathematics and logic but it is not often found in public discourse."
I can't agree more. This unfortunate fact should not, however, discourage us from specifying and enumerating the logical fallacies that public figures commit. It should encourage us to do so, as it has encouraged the establishment of the fact checking industry.
"Even scientific conclusions are rarely (if ever) backed by valid reasoning as they typically rely on induction or inference to the best explanation."
I agree that scientists stray from valid (and sound) argumentation more often than they should. I do not, however, agree that scientists rarely if ever make sound or valid arguments. I also agree that scientists often use inductive reasoning. Scientists will continue to do so as Bayesian statistical methods proliferate. I do not, however, agree that inductive inference is immune to the assessment of soundness and, by inclusion, validity.
is probabilistic. For instance, a
(following Wikipedia's example) could go,
90% of humans are right-handed.
Joe is a human.
Therefore, the probability that Joe is right-handed is 90% (therefore, if we are required to guess [one way or the other] we will choose "right-handed" in the absence of any other evidence).
You can assess the validity of this statistical syllogism by considering whether the steps in the argument follow logically from one another. You can assess its soundness by furthermore considering whether its premises are true. Are 90% of humans right-handed? Is Joe a human? Inductive logic is still logic.
"Not every claim is an argument. An argument must offer evidence intended to support a conclusion. I can claim 'I am hungry' without thereby offering any sort of argument (valid, inductive, fallacious or otherwise) in support of that claim. One cannot test the validity of a single proposition."
I agree that not every claim is an argument, either in the formal or informal sense. Every claim is, however, a premise. In such cases, we can simply determine whether or not the premise is true. Furthermore, many claims that fact checkers care about imply or support an informal (or even formal or legal) argument. In such cases, you can assess the implied informal argument's validity. Lastly, in any case where a public figure makes a claim that ties vaguely to an informal argument, that public figure deserves to be criticized for committing the
. Many politicians often commit the ambiguity fallacy. As much as possible, we should call them on it whenever they do it.
"No need to check for 'both' soundness and validity. If you check for soundness, then you have already checked for validity as part of that. Perhaps you meant to say you would check for both truth of basic premises and validity of reasoning."
Correct. To be sound, an argument must be valid. What I should have said is that fact checkers conflate truth with validity.
"It depends a bit on which notion of fallacy you are working with, but arguments can fail to be valid without committing a common named fallacy. A far simpler check for validity is simply to find counterexamples to the reasoning (logically possible examples in which the basic premises of the argument are all true and in which the conclusion of the argument is false)."
I hope that Ima Pseudonym will elaborate on the logical counterexample part of this statement. If it's a viable shortcut, I'm all for it. That said, I suspect that there are many logical fallacies that do not yet have a name. Perhaps Malark-O-Meter's future army of logicians will name the unnamed!
Thank you again, Ima Pseudonym. Your move if you wish to continue playing. I like this game because you play it well. I encourage constructive criticism from you and all of Malark-O-Meter's readers. Cry 'Reason,' and let slip the dogs of logic.
Whether or not fact checkers wield it as a "moral weapon", they certainly use the "language of truth and falsehood", and some of them attempt to define "
categories". This is most true for
The Fact Checker
, which give clear cut, categorical rulings to the statements that they cover, and whose rulings currently form the basis of Malark-O-Meter's
, which rates the average factuality of individuals and groups.
The language of truth and falsehood does "invoke ideas of journalists as almost scientific fact-finders." But it isn't just the language of truth and falsehood that bestows upon the art of fact checking an air of science. Journalists who specialize in fact checking do many things that scientists do (but not always). They usually cover falsifiable claims, flicking a wink into Karl Popper's posthumous cup of tiddlies. They always formulate questions and hypotheses about the factuality of the claims that they cover. They usually test their hypotheses against empirical evidence rather than unsubstantiated opinion.
Yet Fact checkers ignore a lot of the scientific method. For instance, they don't replicate (then again,
neither do many scientists
). Moreover, fact checkers like PolitiFact and The Fact Checker use rating scales that link only indirectly and quite incompletely to the logic of a claim. To illustrate, observe PolitiFact's description of its
Sometimes, fact checkers specify in the essay component of their coverage the logical fallacies that a claim perpetrates. Yet neither the Truth-O-Meter scale nor The Fact Checker's Pinocchio scale specify which logical fallacies were committed or how many. Instead, PolitiFact and The Fact Checker use a discrete, ordinal scale that combines accuracy in the sense of correctness with completeness in the sense of clarity.
By obscuring the reasons why something is false, these ruling scales make it easy to derive factuality metrics like the malarkey score, but difficult to interpret what those metrics mean. More importantly, PolitiFact and The Fact Checker make themselves vulnerable to the criticism that their truth ratings are subject to ideological biases because...well...because they are. Their apparent vagueness makes them so. Does this make the Truth-O-Meter and Pinocchio scales worthless? Probably not. But we can do better. Here's how.
When evaluating an argument (all claims are arguments, even if they are political sound bites), determine if it is
. To be sound, all of an argument's premises must be
, and the argument must be
. To be true, a premise must adhere to the empirical evidence. To be valid, an argument must commit no logical fallacies
The problem is that the ruling scales of fact checkers conflate soundness and validity. The solution is to stop doing that.
When and if Malark-O-Meter grows into a fact checking entity, it will experiment with rating scales that specify and enumerate logical fallacies. It will assess both the soundness and the validity of an argument. I have an idea of how to implement this on the web that is so good, I don't want to give it away just yet.
There are thousands of years of formal logic research that stretch into the modern age. Hell, philosophy PhD Gary N. Curtis publishes
an annotated and interactive taxonomic tree of logical fallacies on the web
Stay tuned to Malark-O-Meter, where I'm staging a fact check revolution.
There's a lot of talk this week about
, who is already being vetted as a possible front runner in the 2016 presidential campaign...in 2012...right after the 2012 presidential campaign. In answer to the conservatives' giddiness about the Senator from Florida, liberals have been looking for ways to steal Rubio's...er...storm clouds on the horizon that could lead to potential thunder maybe in a few years? I dunno. Anyway, one example of this odd little skirmish involves a comment that Senator Rubio made in
answer to a GQ interviewers' question about the age of the Earth
" say my fellow liberals (
). Ross Douthat, conservative blogger at the
New York Times
(among other places),
that it was a "politician's answer" to a politically contentious question, but rightly asks why Rubio answered in a way that fuels the "conservatives vs. science" trope that Douthat admits has basis in reality. Douthat writes that Rubio could have said instead:
So why didn't Rubio say that instead of suggesting wrongly, and at odds with overwhelming scientific consensus, that the age of the Earth is one of the greatest mysteries?
A more important issue relevant to the fact checking industry that Malark-O-Meter studies and draws on to measure politicians' factuality, why aren't statements like this featured in fact checking reports? The answer probably has something to do with one issue Rubio raised in his answer to GQ, and something that pops up in Douthat's wishful revision.
"I think the age of the universe has zero to do with how our economy is going to grow." (Rubio)
"...I'm not running for school board..." (Douthat)
You can easily associate these statements with a key constraint of the fact checking industry. As
Glenn Kessler stated in a recent panel discussion about the fact checking industry
, fact checkers are biased toward newsworthy claims that have broad appeal (PolitiFact's growing state-level fact checking effort notwithstanding). Most Americans care about the economy right now, and few Americans have ever thought scientific literacy was the most important political issue. Fact checkers play to the audience on what most people think are the most important issues of the day. I could not find one fact checked statement that a politician made about evolution or climate change that wasn't either a track record of Obama's campaign promises, or an assessment of how well a politicians' statements and actions adhere to their previous positions on these issues.
What does the fact checker bias toward newsworthiness mean for Malark-O-Meter's statistical analyses of politicians' factuality? Because fact checkers aren't that interested in politicians' statements about things like biology and cosmology, the malarkey score isn't going to tell you much about how well politicians adhere to the facts on those issues. Does that mean biology, cosmology, and other sciences aren't important? Does that mean that a politicians' scientific literacy doesn't impact the soundness of their legislation?
The scientific literacy of politicians is salient to whether they support particular policies on greenhouse gas reduction, or stem cell research, or education, or, yes, the economy. After all, although economics is a soft science, it's still a science. And if you watched the
recent extended debate between Rubio and Jon Stewart on the Daily Show
, and you also read the
Congressional Research Report that debunks the trickle down hypothesis
, and you've read the evidence that we'd need a lot of economic growth to solve the debt problem, you'd recognize that some of Rubio's positions on how to solve our country's economic problems do not align well with the empirical evidence.
But does that mean that Rubio is full of malarkey? According to his Truth-O-Meter report card alone, no. The mean of his simulated
distribution is 45, and we can be 95% certain that, if we sampled another incomplete report card with the same number of Marco Rubio's statements, his measured malarkey score would be between 35 and 56. Not bad. By comparison, Obama, the least full of malarkey among the 2012 presidential candidates, has a simulated malarkey score based on his Truth-O-Meter report card of 44 and is 95% likely to fall between 41 and 47. The odds that Rubio's malarkey score is greater than Obama's are only 3 to 2, and the difference between their malarkey score distributions averages only one percentage point.
How would a more exhaustive fact checking of Rubio's scientifically relevant statements influence his malarkey score? I don't know. Is this an indictment of truthfulness metrics like the ones that Malark-O-Meter calculates? Not necessarily. It does suggest, however, that Malark-O-Meter should look for ways to modify its methods to account for the newsworthiness bias of fact checkers.
ever come to fruition, I'd like it to be at the forefront of the following changes to the fact checker industry:
Measure the size and direction of association between the topics that fact checkers cover, the issues that Americans currently think are most important, and the stuff that politicians say.
Develop a factuality metric for each topic (this would require us to identify the topic(s) relevant to a particular statement).
Incorporate (and create) more fact checker sites that provide information about a politicians' positions on topics that are underrepresented by the fact checker industry. For example, one could use a Truth-O-Meter-like scale to rate the positions that individuals have on scientific topics, which are often available at sites like
So it isn't that problems like these bring the whole idea of factuality metrics into question. It's just that the limitations of the fact checker data instruct us about how we might correct for them with statistical methods, and with new fact checking methods. Follow Malark-O-Meter and tell all your friends about it so that maybe we can one day aid that process.
Malark-O-Meter's mission is to statistically analyze fact checker rulings to make comparative judgments about the factuality of politicians, and to measure our uncertainty in those judgments. Malark-O-Meter's methods, however, have a serious problem. To borrow terms made popular by Nate Silver's new book, Malark-O-Meter isn't yet good at distinguishing the
. Moreover, we can't even distinguish one signal from another. I know. It sucks. But I'm just being honest. Without honestly appraising how well Malark-O-Meter fulfills its mission, there's no way to improve its methods.
Note: if you aren't familiar with how Malark-O-Meter works, I suggest you visit the
The signals that we can't distinguish from one another are the real differences in factuality between individuals and groups, versus the potential ideological biases of fact checkers. For example, I've shown in
Malark-O-Meter's analsis of the 2012 presidential election
could lead you to believe either that Romney is between four and 14 percent more full of malarkey than Obama, or that PolitiFact and The Fact Checker have on average a liberal bias that gives Obama between a four and 14 percentage point advantage in truthfulness, or that the fact checkers have a centrist bias that shrinks the difference between the two fact checkers to just six percent of what frothy-mouthed partisans believe it truly is. Although I've verbally argued that fact checker bias is probably not as strong as either conservatives or liberals believe, no one...
...has adequately measured the influence of political bias on fact checker rulings.
, I briefly considered some methods to measure, adjust, and reduce political bias in fact checking. Today, let's discuss the problem with Malark-O-Meter's methods that we can't tell signal from noise. The problem is a bit different than the one Silver describes in his book, which is that people have a tendency to see patterns and trends when there aren't any. Instead, the problem is how a signal might influence the amount of noise that we estimate.
Again, the signal is potential partisan or centrist bias. The noise comes from sampling error, which occurs when you take an incomplete sample of all the falsifiable statements that a politician makes. Malark-O-Meter estimates the sampling error of a fact checker report card by randomly drawing report cards from a
, which describes the probability distribution of the proportion of statements in each report card category. Sampling error is higher the smaller your sample of statements. The greater your sampling error, the less certain you will be in the differences you observe among individuals' malarkey scores.
To illustrate the sample size effect, I've reproduced a plot of the simulated malarkey score distributions for Obama, Romney, Biden, and Ryan, as of November 11th, 2012. Obama and Romney average 272 and ~140 rated statements per fact checker, respectively. Biden and Ryan average ~37 and ~21 statements per fact checker, respectively. The difference in the spread of their probability distributions is clear from the histograms and the differences between the upper and lower bounds of the labeled 95% confidence intervals.
The trouble is that Malark-O-Meter's sampling distribution assumes that the report card of all the falsifiable statements an individual ever made would have similar proportions in each category as the sample report card. And that assumption implies another one: that the ideological biases of fact checkers, whether liberal or centrist, do not influence the probability that a given statement of a given truthfulness category is sampled.
In statistical analysis, this is called selection bias. The conservative ideologues at PolitiFactBias.com (and Zebra FactCheck, and Sublime Bloviations; they're all written by at least one of the same two guys, really) suggest that fact checkers could bias the selection of their statements toward more false ones made by Republicans, and more true ones made by Democrats. Fact checkers might also be biased toward selecting some statements that make them appear more left-center so that they don't seem too partisan. I'm pretty sure there are some liberals out there who would agree that fact checkers purposefully choose a roughly equal number of true and false statements by conservative and liberal politicians so that they don't seem partisan. In fact, that's a common practice for at least one fact checker,
. The case for centrist bias isn't as clear for PolitiFact or The Fact Checker.
I think it will turn out that fact checkers' partisan or centrist biases, whether in rating or sampling statements, are too weak to swamp the true differences between individuals or groups. It is, however, instructive to examine the possible effects of selection bias on malarkey scores and their sampling errors. (In contrast, the possible effects of ideological bias on the observed malarkey scores are fairly obvious.)
My previous analysis of the possible liberal and centrist biases of fact checkers was pretty simple. To estimate the possible partisan bias, I simply compared the probability distribution of the observed differences between the Democratic and Republican candidates to ones in which the entire distribution was shifted so that the mean difference was zero, or so that the difference between the parties was reversed. To estimate possible centrist bias, I simply divided the probability distribution that I simulated by the size of the difference that frothy-mouthed partisans would expected, which is large. That analysis assumed that the width of the margin of error in the malarkey score, which is determined by the sampling error, remained constant after accounting for fact checker bias. But that isn't true.
There are at least two ways that selection bias can influence the simulated margin of error of a malarkey score. One way is that selection bias can diminish the efficiency of a fact checkers' search for statements to fact check, leading to a smaller sample size of statements on each report card. Again, the smaller the sample size, the wider the margin of error. The wider the margin of error, the more difficult it is to distinguish among individuals, holding the difference in their malarkey scores constant. So the efficiency effect of selection bias causes us to underestimate, not overestimate, our certainty in the differences in factuality that we observe. So the only reason why we should worry about this effect is that it would diminish our confidence in observed differences in malarkey scores, which might be real even though we don't know the reason (bias versus real differences in factuality) that those differences exist.
The bigger problem, of course, is that selection bias influences the probability that statements of a given truthfulness category are selected into an individual report card. Specifically, selection bias might increase the probability that more true statements are chosen over less true statements, or vice versa, depending on the partisan bias of the fact checker. Centrist selection bias might increase the probability that more half true statements are chosen, or that more equal numbers of true and false statements are chosen.
The distribution of statements in a report card definitely influences the width of the simulated margin of error. Holding sample size constant, the more even the statements are distributed among the categories, the greater the margin of error. Conversely, when statements are clumped into only a few of the categories, the margin of error is smaller. To illustrate, let's look at some extreme examples.
Suppose I have an individual's report card that rates 50 statements. Let's see what happens to the spread of the simulated malarkey score distribution when we change the spread of the statements across the categories from more even to more clumped. We'll measure how clumped the statements are with something called the Shannon entropy. The Shannon entropy is a measure of uncertainty, typically measured in bits (
that can be 0 or 1). In our case, entropy measures our uncertainty in the truthfulness category of a single statement sampled from all the statements that an individual has made. The higher the entropy score, the greater the uncertainty. Entropy (thus uncertainty) is greatest when the probabilities of all possible events are equal to one another.
We'll measure the spread of the simulated malarkey score distributed by the width of its 95% confidence interval. The 95% confidence interval is the range of malarkey scores that we can be 95% certain would result from another report card with the same number of statements sampled from the same person, given our beliefs about the probabilities of each statement.
We'll compare six cases. First is the case when the true probability of each category is the same. The other five cases are when the the true probability of one category is 51 times greater than the probabilities of the other categories, which would define our beliefs of the category probabilities if we observed (or forced through selection bias) that all 50 statements were in one of the categories. Below is a table that collects the entropy and confidence interval width from each of the six cases, and compares them to the equal statement probability case, for which the entropy is greatest the confidence intervals are widest. Entropies and are rounded to the nearest tenth, confidence interval widths to the nearest whole number, and comparisons to the nearest tenth. Here are the meanings of the column headers.
: self explanatory
: Absolute entropy of assumed category probabilities
: Entropy of assumed category probabilities compared to the case when the probabilities are all equal, expressed as a ratio
: Width of 95% confidence interval
Comp. CI width
: Width of 95% confidence interval compared to the case when the probabilities are all equal, expressed as a ratio
And here is the table:
For all the clumped cases, the entropy is 20% of the entropy for the evenly distributed case. In fact, the entropy of all the clumped cases are the same because the calculation of entropy doesn't care about which categories are more likely than others. It only cares whether some categories are more likely than others.
The lower entropy in the clumped cases corresponds to small confidence intervals relative to the even case, which makes sense. The more certain we think we are in the probability that any one statement will be in a given report card category, the more certain we should be in the malarkey score.
This finding suggests that if fact checker bias causes oversampling of statements in certain categories, Malark-O-Meter will overestimate our certainty in the observed differences if the true probabilities within each category are more even. This logic could apply to partisan biases that lead to oversampling of truer or more false statements, or to centrist biases that oversample half true statements. The finding also suggests that a centrist bias that leads to artificially equivalent probabilities in each category will cause Malark-O-Meter to
estimate the level of certainty in the observed statements.
Another interesting finding is that the confidence interval widths that we've explored follow a predictable pattern. Here's a bar plot of the comparative CI widths from the table above.
Click for larger version.
The confidence interval is widest in the equal probability case. From there, we see a u-shaped pattern, with the narrowest confidence intervals occurring when we oversample half true statements. The confidence intervals get wider for the cases when we oversample mostly true or mostly false statements, and wider still for the cases when we oversample true or false statements. The confidence interval widths are equivaelent between the all true and all false cases, and the all mostly true and all mostly false cases.
What's going on? I don't really know yet. We'll have to wait for another day, and a more detailed analysis. I suspect it has something to do with how the malarkey score is calculated, which results in fewer malarkey score possibilities when the probabilities are more closely centered on half true statements.
Anyway, we're approaching a better understanding of how the selection bias among fact checkers can influence our comparative judgments of the factuality of politicians. Usefully, the same logic applies to the effects of fact checkers' rating biases in the absence of selection bias. You can expect Malark-O-Meter's honesty to continue. We're not here to prove any point that can't be proven. We're here to give an honest appraisal of how well we can compare the factuality of individuals using fact checker data. Stay tuned.
Glenn Kessler, Fact Checker at
, gave two out of four Pinnochios to Barney Frank
, who claimed that GOP gerrymandering allowed Republicans to maintain their House majority. Kessler would have given Frank three Pinocchios, but Frank publicly recanted his statement in a live television interview. Here at Malark-O-Meter, we equate a score of three Pinocchios with a PolitiFact Truth-O-Meter score of "Mostly False". Kessler was right to knock off a Pinocchio for Barney's willingness to publicly recant his claim. I'll explain why Kessler's fact check was correct, and why he was right to be lenient on Frank.
Frank was wrong because, as a
Brennan Center for Justice report suggests
, the Democrats wouldn't have won the House majority even before the 2010 redistricting. Although the Republicans clearly won the latest redistricting game, it doesn't fully explain how they maintained their majority. The other factor is geography. Dan Hopkins at
The Monkey Cage
cited a study by Chen and Rodden showing that Democrats are clustered inefficiently in urban areas. Consequently, they get big Congressional wins in key urban districts, but at the cost of small margin losses in the majority of districts. (And no, fellow fans of the
Princeton Election Consortium
, it doesn't matter that the effect is even bigger than the one Sam Wang predicted; it's still not only because of redistricting.)
So why was Kessler right to knock off a Pinocchio for Barney's willingness to recant? At Malark-O-Meter, we see fact checker report cards as a means to measure the overall factuality of individuals and groups. If an individual recants a false statement, that individual's marginal factuality should go up in our eyes for two reasons. First, that person made a statement that adheres to the facts. Second, the act of recanting a falsehood is a testament to one's adherence to the facts.
Regardless of its causes and no matter what Barney's malarkey score ends up being because of his remarks about it, what do we make of the disparity between the popular vote and the House seat margin,
which has occurred only three other times in the last century
? Should we modify U.S. Code, Title 2, Chapter 1, Section 2c (
2 USC § 2c
), which became law in 1967 and requires states with more than one apportioned Representative to be divided into one-member districts? Should we instead go with a
, which gives all House seats to the party that wins a state's popular vote? Is there some sensible middle ground? (Of course there is.)
The answer to these questions depends critically on the role we want the geographic distribution of the U.S. population to play in determining the composition of the House. The framers of the Constitution meant for the House of Representatives to be the most democratic body of the national government, which is why we apportion Representatives based on the Census, and why there are more Representatives than Senators. Clearly, it isn't democratic for our redistricting rules to be vague enough that a party can benefit simply by holding the House majority in a Census year. Is it also undemocratic to allow the regional geography of the United States to determine the House composition?
I don't think so. Instead, the geographic distribution of
in the United States should determine the House composition. There are a bunch of redistricting algorithms out there that would help this happen. The underlying theme of the best algorithms is that Congressional districts should have comparable population size. Let's just pick an algorithm and do it already. And if we're not sure which of these algorithms is the best one, let's just do them all and take the average.
In the aftermath of the 2012 election, campaign prognosticators Nate Silver, Simon Jackman, Drew Linzer, and Sam Wang make preliminary quantitative assessments of how well their final predictions played out. Others have posted comparisons of these and other election prediction and poll aggregation outfits. Hopefully, we'll one day compare and combine the models based on their long term predictive power. To compare and combine models effectively, we need a good quantitative measure of their accuracy. The prognosticators have used something called the Brier score to measure the accuracy of their election eve predictions of state-level outcomes. Despite its historical success in measuring forecast accuracy, the Brier score fails in at least two ways as a forecast score. I'll review its inadequacies and suggest a better method.
The Brier score measures the accuracy of binary probabilistic predictions. To calculate it, take the average, squared difference between the forecast probability of a given outcome (e.g., Obama winning the popular vote in California) and the observed probability that the event occurred (.e.g, one if the Obama won, zero if he didn't win). The higher the Brier score, the worse the predictive accuracy. As
Nils Barth suggested to Sam Wang
, you can also calculate a normalized Brier score by subtracting four times the Brier score from one. A normalized Brier score compares the predictive accuracy of a model to the predictive accuracy of a model that perfectly predicted the outcomes. The higher the normalized Brier score, the greater the predictive accuracy.
Because the Brier score (and its normalized cousin) measure predictive accuracy,
I've suggested that we can use them to construct certainty weights for prediction models
, which we could then use when calculating an average model that combines the separate models into a meta-prediction. Recently, I've discovered research in the weather forecasting community about a better way to score forecast accuracy. This new score ties directly to a well-studied model averaging mechanism. Before describing the new scoring method, let's describe the problems with the Brier score.
) notes that the Brier score doesn't deal adequately with very improbable or probable events. For example, suppose that the probability that a Black Democrat wins Texas is 1 in 1000. Suppose we have one forecast model that predicts Obama will surely lose in Texas, whereas another model predicts that Obama's probability of winning is 1 in 400. Well, Obama lost Texas. The Brier score would tell us to prefer the model that predicted a sure loss for Obama. Yet the model that gave him a small probability of winning is closer to the "truth" in the sense that it estimates he has a small probably of winning. In addition to its poor performance scoring highly improbable and probable events, the Brier score doesn't perform well when scoring very poor forecasts (
; sorry for the pay wall).
These issues with the Brier score should give prognosticators pause for two reasons. First, they suggest that the Brier score will not perform well in the "safe" states of a given party. Second, they suggest that Brier scores will not perform well for models whose predictions were poor (here's lookin' at you, Bickers and Berry). So what should we do instead? It's all about the likelihood. Well, actually its logarithm.
Both Jewson and Benedetti convincingly argue that the proper score of forecast accuracy is something called the log likelihood. A likelihood is the probability of a set of observations given the model of reality that we assume produced those observations. As Jewson points out, the likelihood in our case is the probability of a set of observations (i.e., which states Obama won) given the forecasts associated with those observations (i.e., the forecast probability that Obama would win those states). A score based on the log likelihood penalizes measures that are very certain one way or the other, giving the lowest scores to models that are perfectly certain of the outcome.
To compare the accuracy of two models, simply take the difference in their log likelihood. To calculate model weights, first subtract the likelihood score of each model from the minimum likelihood score across all the models. Then exponentiate the difference you just calculated. Then divide the exponentiated difference of each model by the sum of those values across all the models. Voila. A model averaging weight.
Some problems remain. For starters, we haven't factored Occam's razor into our scoring of models. Occam's razor, of course, is the idea that simpler models are better than complex models all else equal. Some of you might notice that the model weight calculation in the previous paragraph is identical to the model weight calculation method based on the information criterion scores of models that have the same number of variables. I argue that we can ignore Occam's razors for our purposes. What we're doing is measuring a model's predictive accuracy, not its fit to previous observations. I leave it up to the first order election prognosticators to decide which parameters they include in their model. In making meta election forecasts, I'll let the models' actual predictive performance decide which ones should get more weight.
A funny short story about the triumph and perils of endless recursions in meta-analysis. NOT a critique of meta-analysis itself.
nce upon a time, there was a land called the United States of America, which was ruled by a shapeshifter whose physiognomy and political party affiliation was recast every four years by an electoral vote, itself a reflection of the vote of the people. For centuries, the outcome of the election had been foretold by a cadre of magicians and wizards collectively known as the Pundets. Gazing into their crystal balls at the size of crowds at political rallies, they charted the course of the shapeshifting campaign. They were often wrong, but people listened to them anyway.
Then, from the labyrinthine caves beneath the Marginuvera Mountains emerged a troglodyte race known as the Pulstirs. Pasty of skin and snarfy in laughter, they challenged the hegemony of the Pundet elite by crafting their predictions from the collective utterances of the populace. Trouble soon followed. Some of the powerful new Pulstir craftsmen forged alliances with one party or another. And as more and more Pulstirs emerged from Marginuvera, they conducted more and more puls.
The greatest trouble came, unsurprisingly, from the old Pundet guard in their ill-fated attempts to merge their decrees with Pulstir findings. Unable to cope with the number of puls, unwilling to so much as state an individual pul's marginuvera, the Pundet's predictions confused the people more than it informed them.
Then, one day, unbeknownst to one another, rangers emerged from the Forests of Metta Analisis. Long had each of them observed the Pundets and Pulstirs from afar. Long had they anguished over the amount of time the Pundets spent bullshyting about what the ruler of America would look like after election day rather than discussing in earnest the policies that the shapeshifter would adopt. Long had the rangers shaken their fists at the sky every time Pundets with differing loyalties supported their misbegotten claims with a smattering of gooseberry-picked puls. Long had the rangers tasted vomit at the back of their throats whenever the Pundets at Sea-en-en jabbered about it being a close race when one possible shapeshifting outcome had been on average trailing the other by several points in the last several fortnights of puls.
Each ranger retreated to a secluded cave, where they used the newfangled signal torches of the Intyrnet to broadcast their shrewd aggregation of the Pulstir's predictions. There, they persisted on a diet of espresso, Power Bars, and drops of Mountain Dew. Few hours they slept. In making their predictions, some relied only on the collective information of the puls. Others looked as well to fundamental trends of prosperity in each of America's states.
Pundets on all (by that, we mean both) sides questioned the rangers' methods, scoffed at the certainty with which the best of them predicted that the next ruler of America would look kind of like a skinny Nelson Mandela, and would support similar policies to the ones he supported back when he had a bigger chin and lighter skin, was lame of leg, and harbored great fondness for elegantly masculine cigarette holders.
On election day, it was the rangers who triumphed, and who collectively became known as the Quants, a moniker that was earlier bestowed upon another group of now disgraced, but equally pasty rangers who may have helped usher in the Great Recession of the early Second Millennium. The trouble is that the number of Quants had increased due to the popularity and controversy surrounding their predictions. While most of the rangers correctly predicted the physiognomy of the president, they had differing levels of uncertainty in the outcome, and their predictions fluctuated to different degrees over the course of the lengthy campaign.
Soon after the election, friends of the Quants, who had also trained in the Forests of Metta Analisis, made a bold suggestion. They argued that, just as the Quants had aggregated the puls to form better predictions about the outcome of the election, we could aggregate the aggregates to make our predictions yet more accurate.
Four years later, the Meta-Quants broadcast their predictions alongside those of the original Quants. Sure enough, the Meta-Quants predicted the outcome with greater accuracy and precision than the original Qaunts.
Soon after the election, friends of the Meta-Quants, who had also trained in the Forests of Metta Analsis, made a bold suggestion. They argued that, just as the Meta-Quants had aggregated the Quants to form better predictions about the outcome of the election, we could aggregate the aggregates of the aggregates to make even better predictions.
Four years later, the Meta-Meta-Quants broadcast their predictions alongside those of the Quants and the Meta-Quants. Sure enough, the Meta-Meta-Quants predicted the outcome with somewhat better accuracy and precision than the Meta-Quants, but not as much better as the Meta-Quants had over the Quants. Nobody really paid attention to that part of it.
Which is why, soon after the election, friends of the Meta-Meta-Quants, who had also trained in the Forests of Metta Analisis, made a bold suggestion. They argued that, just as the Meta-Meta-Quants had aggregated the Meta-Quants to form better predictions about the outcome of the election, we could aggregate the aggregates of the aggregates of the aggregates to make even better predictions.
One thousand years later, the (Meta x 253)-Quants broadcast their predictions alongside those of all the other types of Quants. By this time, 99.9999999% of Intyrnet communication was devoted to the prediction of the next election, and the rest was devoted to the prediction of the election after that. A Dyson Sphere was constructed around the sun to power the syrvers necessary to compute and communicate the prediction models of the (Meta x 253)-Quants, plus all the other types of Quants. Unfortunately, most of the brilliant people in the Solar System were employed making predictions about elections. Thus the second-rate constructors of the Dyson Sphere accidentally built its shell within the orbit of Earth, blocking out the sun and eventually causing the extinction of life on the planet.
I've written a lot recently about the promise of combining the results from the different election prediction models that have cropped up over the last decade. (
Here's a scroll of those articles
in reverse chronological order.) One suggestion I've made is to average the results of the election prediction models. The marginoferror blog made the same suggestion, noting that
the averaged aggregator performs better than any individual aggregator
(that they included in their sample of aggregators).
Today, I present suggestions for how to calculate averaging weights for a given prediction of the winner of the presidency in each state, and of the percent popular vote in that state. These methods suggestions were inspired by the reporting of Brier scores and other prediction accuracy statistics by Simon Jackman, Sam Wang, and Drew Linzer.
State-level outcomes (thus EV outcomes)
To calculate the model weight for a given model at a given point in time, start with Christopher A. T. Ferro's sample size adjusted Brier score (
see equation 8
, which depends on equation 3 and the first expression in section 2.a) comparing
observed state-level outcomes to the probability estimated from
of the years that an aggregator has made predictions at the specified calendar distance from election day.
Ferro's adjusted Brier score is best because it accounts for the effects of sample size on the Brier score.
Next, subtract that Brier score from one, which is the highest possible value for a Brier score. The result is an absolute score that increases as the Brier score decreases. Recall that the Brier score is larger when there is greater distance between the predicted and observed values.
Next, we repeat that process for all aggregators that have made predictions at that distance from election day.
Next, we normalize all the absolute scores by the summed absolute scores to give each model a relative weight.
Finally, we weight each model by its relative weight when averaging.
This method could easily be modified to give models weights corresponding to entire prediction histories, and/or to prediction within a given time interval at a given distance from election. It could also be extended to deal with one-off forecasts that are never updated. Because state-level outcomes largely determine the electoral vote, I propose that the same model weight calculated as above could be used when averaging electoral vote distributions.
State-level shares of popular vote
The method is identical to what I described above, except we replace Ferro's adjusted Brier score with the sample normalized root mean squared error, which would measured the average percentage point difference between the observed and expected popular vote outcomes. Simply calculate one minus the sample normalized root mean squared error of a given model, and divide the difference by the some of the same for the rest of the models. Then, calculate a weighted average.
These methods have a lot of nice features:
They result in weights that are easily interpreted.
The weights can also be decomposed into different components because they are based on the Brier score and root mean squared error. For example, the Brier score can be decomposed to examine calibration and uncertainty effects. The mean squared error can be decomposed into bias and variance components.
The methods are flexible enough to accommodate any scope of predictive power that interests researchers.
UPDATE: Edited out some two-am-induced errors.
Now we've established that people who analyze polling data might have something there, let's devise ways to compare and contrast the different models. Drew Linzer at
already described his strategy for checking how well his model worked, and started
some of his post hoc
. So did
. As of this moment, Micah Cohen at Nate Silver's FiveThirtyEight blog says "
." Darryl Holman is busy
, but I suspect we'll see some predictive performance analysis from him soon, too.
Tonight (okay, this morning), I want to compare the predictions that three of the modelers made about the electoral vote count to show you just how awesome these guys did, but also to draw some contrasts in the results of their modeling strategy. Darryl Holman, Simon Jackman, and Sam Wang all shared the probability distribution of their final electoral vote predictions for Obama with me. Here are the three probability distributions in the same plot for what I think is the first time.
The first thing to notice is that the two most likely outcomes in each of the models are 303 and 332. These two outcomes together are between 15%, 30%, and 36% likely for Holman, Jackman, and Wang, respectively.
Three hundred and three votes happens to be the number of votes Obama currently has secured. Three hundred and thirty-two votes would be the number Obama would have if 29 electoral votes from the remaining toss-up state, Florida, went to him. As most of you know, Obama won the popular vote in Florida, but by a small margin. That's the power of well designed and executed quantitative analysis.
Note, however, that the distributions aren't identical. Jackman's and Wang's distributions are more dispersed, more kurtotic (peaked), and more skewed than Holman's distribution. If you look at Silver's distribution, it is also more dispersed and kurtotic than Holman's. The models also differ in the relative likelihood they give to the two most likely outcomes. Another difference is that Jackman's distribution (and Silver's) has a third most likely outcome favorable to Obama that is much more distinguishable from the noise than it is is for Holman's model.
that differences like these are important, if not on election eve, then earlier in the campaign. I've
that all of these models together might better predict the election in aggregate than they do on their own. So let's see what these models had to say in aggregate in their final runs before the election. It might seem silly to do this analysis after the election is already over, but, hey, they're still counting Florida.
Here is the average probability distribution of the three models.
Whoopdeedoo. It's an average distribution. Who cares, right? Well that histogram shows us what the models predicted in aggregate for the 2012 election. The aggregate distribution leads to more uncertainty regarding the two most likely outcomes than for some models (especially Holman), but less uncertainty for others (especially Wang). If we had added Drew Linzer's model and Nate Silver's model, which both predicted higher likelihood of 332 than 303 electoral votes, perhaps the uncertainty would have decreased even more in favor of 332. That third outcome also shows up as important in the aggregate model.
Model averaging and model comparison like this would have been helpful earlier in the campaign because it would have given us a sense of what all the models said in aggregate, but also how they differed. The more models we average, and the better we estimate the relative weights to give the models when calculating that average, the better.
Anyway, the outcome that truly matters has already been decided. I admit that I'm happy about it.
I've recently noticed that my early writing about comparisons of malarkey scores, which I report as ratios, are easily misinterpreted. For example, I might say that candidate X spews 17% more malarkey than candidate Y, or that PolitiFact might have a 17% bias against party Z over party W. What I really mean is that the malarkey score is 17% larger for candidate X or party Z, not that there is a difference in 17 points along the malarkey scale. The confusion arises from the fact that I report comparisons as ratios. I'm not going to do that anymore. Why? Because if I report comparisons as differences instead, it makes more sense given that the malarkey score ranges from 0 to 100, and could be interpreted as the percentage of one's utterances that are malarkey-laden. I'll make the changes to the side bar reports sometime this week.