posted on Jun, 26 2012 @ 09:29 PM
Detecting Statistical Data Anomalies in Republican Primary Election 2012 Results (v. 1.0, 06/24/2012)
< continued, part 3>
2. Hypergeometric distribution allows making probabilistic statements about random samples’ properties based on the entire population, or inferences
about the entire population based on a specific instance of a random sample. For example, suppose that the total vote tally (population) in the state
is 100, and candidate A actually got 60 votes. Let’s assume that we have drawn a random sample without replacement from this population, which is
somewhat similar to a exist poll (assuming that it is perfectly random and truthful). If our random sample has size 10, then the probability that 6
respondents voted for the candidate A is 0.2643, the probability that at most 6 respondents voted for the candidate A is 0.6258, and the probability
that at least 6 respondents voted for the candidate A is 0.6386. Alternatively, one can infer number 60 form the above sample of size 10 with 6 votes
for A with the following formula: floor (6 * (100 + 1) / 10). But this is just a point estimate. There are at least two methods to construct the
confidence sets (similar to confidence intervals) around point estimates for the hypergeometric distribution: “test-method” and
“likelihood-method”. These methods are important for analyzing the exit polls results, but they are not the focus of this report for now. However,
even the point estimate of the vote counts for each candidate can show potential fraud-based bias in favor of or against one or several candidates.
Specifically, if the precincts are ordered by the vote tally, and the population point estimate of vote counts keeps on increasing for one candidate
while decreases or flat for other candidates, then this serves as another indication of the suspicious positive correlation, but viewed from a
different angle. The averages can be computed for deviations of these point estimate vote percent results for each precinct (starting from the one
with at least one cumulative vote for each candidate) from the “official” results. These are the rough corrections of the official results towards
the actual results.
3. Finally, let’s look at the third method to detect the same data anomaly. This time we run a series of one-sided hypothesis tests on the vote
percentages for each candidate. If we run these tests on the precincts that are sorted either randomly or alphabetically by country and/or precinct
name, then the anomaly is not detected. However, if precincts are sorted by the vote tally, then the anomaly is extremely pronounced in favor of one
specific and the same candidate across states and counties. The following list describes steps to reproduce this analysis:
a. After ordering precincts by the vote tally, compute precinct cumulative sums of vote counts and vote percentages for each candidate and for the
whole vote tally. You may think of these sums as incremental exit polls results.
b. Run two hypothesis tests for each candidate at each ordered precinct row with the cumulative counts. Use Excel function “HYPGEOM.DIST” to run
these tests. The following example illustrates the point. Suppose that the state-wide vote tally is 100, and the candidate Bad “officially” got 40
votes, while the candidate Good “officially” got 30 votes. We do not know how many votes these two candidates actually got. Let’s assume that we
added up all “official” votes from 35% percent of precincts with the smallest vote tally. These precincts have only 20 votes cast, and 10 of them
were for Mr. Good, while only 5 of them were for Mr. Bad. Evidently, Mr. Bad has to catch up in order to get his 40%, since he has only 25% so far.
Meanwhile, we can run a hypothesis test on Mr. Bad: the “null” hypothesis is that he will eventually get at least 40 votes (40%), and an
alternative hypothesis is that he will get less than 40% of votes. Since Mr. Bad has a long way to go to catch up, we will reject the “null”
hypothesis (say, at 99% confidence level), and we will concluded that Mr. Bad actually got less than 40 votes in total. This is called upper-tail
hypothesis test. We can run a lower tail hypothesis test for Mr. Good, who was a victim of vote flipping. In this case, we will reject a “null”
hypothesis (say, at 99% confidence level) that Mr. Good’s vote count was less than or equal to 30. Obviously, both tests should and can be applied
to both candidates.
c. Finally, we compute the percentage of the “null” hypothesis rejections for both test types for all candidates across all precincts. If we
observe that rejection occurred in 97% of precincts for one candidate and in 0.4% of precincts of another one (or the other way around), then we can
make statistical inference with respect to these candidates’ election results. The inference can be reinforced by running the test on different
states and counties and observing the same anomaly over and over again for the same candidate.