1. The graphs below show the 500-abstract rolling mean, standard deviation, skewness, and first-order autocorrelation for the initial ratings of the Consensus Project. I bootstrapped the data 10,000 times* to find the mean and the 95% confidence interval. The empirical statistics should lie above the upper bound in 2.5% of all cases, and fall below the lower bound by 2.5%.

    UPDATE1: 100-abstract rolling statistics added. UPDATE2: 50-year rolling abstracts added. UPDATE3: Cluster results added.

    500
    The mean exceeds the upper bound in 13.0% of cases, and the lower bound by 10.7%. The standard deviation exceeds the upper bound by 7.6%, and the lower bound by 12.2%. The skewness exceeds the upper bound by 3.3%, and the lower bound by 8.0%. The autocorrelation exceeds the upper bound by 25.5% and the lower bound by 0.0%.



    100
    The mean exceeds the upper bound in 6.9% of cases, and the lower bound by 7.8%. The standard deviation exceeds the upper bound by 4.7%, and the lower bound by 5.6%. The skewness exceeds the upper bound by 2.6%, and the lower bound by 6.4%. The autocorrelation exceeds the upper bound by 7.6% and the lower bound by 1.4%.


    50
    The mean exceeds the upper bound in 4.9% of cases, and the lower bound by 6.4%. The standard deviation exceeds the upper bound by 4.0%, and the lower bound by 5.0%. The skewness exceeds the upper bound by 2.7%, and the lower bound by 5.3%. The autocorrelation exceeds the upper bound by 6.1% and the lower bound by 1.5%.
    Clustering
    Rolling statistics are not intuitive to everyone. I therefore computed three additional statistics: The minimum distance between identical ratings, the maximum distance and the average distance. I again bootstrapped the data 10,000 times to compute the 90% confidence interval. The results are shown below. Because these are single statistics, we cannot compare the empirical exceedence frequencies with the theoretical ones. The graphs show that the sixes are placed further apart than you would expect by chance, while the sevens are too close together.

    Average
    Minimum
    Maximum


    Thanks to Brandon Shollenberger for pointing out that I initially had somehow messed up the data.

    *I know, I know. I could have invoked stationarity and saved my computer lots of runs. Brute force is easier to explain to the non-initiated, though.
    0

    Add a comment

  2. Dear Professor Høj,

    I was struck by a recent paper published in Environmental Research Letters with John Cook, a University of Queensland employee, as the lead author. The paper purports to estimate the degree of agreement in the literature on climate change. Consensus is not an argument, of course, but my attention was drawn to the fact that the headline conclusion had no confidence interval, that the main validity test was informal, and that the sample contained a very large number of irrelevant papers while simultaneously omitting many relevant papers.

    My interest piqued, I wrote to Mr Cook asking for the underlying data and received 13% of the data by return email. I immediately requested the remainder, but to no avail.

    I found that the consensus rate in the data differs from that reported in the paper. Further research showed that, contrary to what is said in the paper, the main validity test in fact invalidates the data. And the sample of papers does not represent the literature. That is, the main finding of the paper is incorrect, invalid and unrepresentative.

    Furthermore, the data showed patterns that cannot be explained by either the data gathering process as described in the paper or by chance. This is documented at https://docs.google.com/file/d/0Bz17rNCpfuDNRllTUWlzb0ZJSm8/edit?usp=sharing

    I asked Mr Cook again for the data so as to find a coherent explanation of what is wrong with the paper. As that was unsuccessful, also after a plea to Professor Ove Hoegh-Guldberg, the director of Mr Cook’s work place, I contacted Professor Max Lu, deputy vice-chancellor for research, and Professor Daniel Kammen, journal editor. Professors Lu and Kammen succeeded in convincing Mr Cook to release first another 2% and later another 28% of the data.

    I also asked for the survey protocol but, violating all codes of practice, none seems to exist. The paper and data do hint at what was really done. There is no trace of a pre-test. Rating training was done during the first part of the survey, rather than prior to the survey. The survey instrument was altered during the survey, and abstracts were added. Scales were modified after the survey was completed. All this introduced inhomogeneities into the data that cannot be controlled for as they are undocumented.

    The later data release reveals that what the paper describes as measurement error (in either direction) is in fact measurement bias (in one particular direction). Furthermore, there is drift in measurement over time. This makes a greater nonsense of the paper.


    I went back to Professor Lu once again, asking for the remaining 57% of the data. Particularly, I asked for rater IDs and time stamps. Both may help to understand what went wrong.

    Only 24 people took the survey. Of those, 12 quickly dropped out, so that the survey essentially relied on just 12 people. The results would be substantially different if only one of the 12 were biased in one way or the other. The paper does not report any test for rater bias, an astonishing oversight by authors and referees. If rater IDs are released, these tests can be done.

    Because so few took the survey, these few answered on average more than 4,000 questions. The paper is silent on the average time taken to answer these questions and, more importantly, on the minimum time. Experience has that interviewees find it difficult to stay focused if a questionnaire is overly long. The questionnaire used in this paper may have set a record for length, yet neither the authors nor the referees thought it worthwhile to test for rater fatigue. If time stamps are released, these tests can be done.

    Mr Cook, backed by Professor Hoegh-Guldberg and Lu, has blankly refused to release these data, arguing that a data release would violate confidentiality. This reasoning is bogus.

    I don’t think confidentiality is relevant. The paper presents the survey as a survey of published abstracts, rather than as a survey of the raters. If these raters are indeed neutral and competent, as claimed by the paper, then tying ratings to raters would not reflect on the raters in any way.

    If, on the other hand, this was a survey of the raters’ beliefs and skills, rather than a survey of the abstracts they rated, then Mr Cook is correct that their identity should remain confidential. But this undermines the entire paper: It is no longer a survey of the literature, but rather a survey of Mr Cook and his friends.

    If need be, the association of ratings to raters can readily be kept secret by means of a standard confidentiality agreement. I have repeatedly stated that I am willing to sign an agreement that I would not reveal the identity of the raters and that I would not pass on the confidential data to a third party either on purpose or by negligence.

    I first contacted Mr Cook on 31 May 2013, requesting data that should have been ready when the paper was submitted for peer review on 18 January 2013. His foot-dragging, condoned by senior university officials, does not reflect well on the University of Queensland’s attitude towards replication and openness. His refusal to release all data may indicate that more could be wrong with the paper.

    Therefore, I hereby request, once again, that you release rater IDs and time stamps.

    Yours sincerely,




    Richard Tol
    0

    Add a comment

  3. According to Cook et al., abstracts were presented in random order to the raters. The figure below shows the distribution of the year of publication (older papers at the bottom, newer ones at the top) in sequence of rating, for blocks of 1000 ratings (early ratings to the left, later ratings to the right). The pattern is indeed uniform, except for a lot of recent papers at the end of the process.
    The figure below shows the distribution of the distance between the first and second rating (distance between ratings on the horizontal axis, bins of 1000 500; number rating distances on the vertical axis). The distribution is as expected, except for a bunch of abstracts that were rated in close succession. This is consistent with the figure above.

    UPDATE: The figure below shows, in bins of 500, when the first rating (red), the second rating (blue) and the third rating (green) took place. Fourth and fifth ratings make up the difference (except for the final bin; there are 26848 ratings). Clearly, the ratings took place in sequence. This has implications for the results, as the population of raters changed over time, and the ratings were subject to discussion and reinterpretation in the beginning of the rating process.

    0

    Add a comment

  4. According to Cook et al., each abstract was assessed by at least 2 and at most 3 raters. In fact, 33 abstracts were seen by only one rater, 167 by four raters, and 5 by five. If the initial ratings disagreed, as they did in 33% of cases, abstracts were revisited by the original raters. In 15.9% of cases, this led to agreement. In 17.1% of cases, a third rater broke the tie.

    A reported error rate of 33%, with 2 ratings and 7 categories, implies that 18.5% of ratings were incorrect. 0.6% of abstracts received two identical but wrong ratings. 2.9% of ratings are still wrong after reconciliation. 3.2% of ratings are wrong after re-rating. In total, 6.7% of reported data are in error.

    The figure shows that the corrections to the ratings through reconciliation and re-rating were, on balance, towards rejection of the hypothesis of human-made climate change. In other words, these were not errors in either direction, but rather biases in one direction.

    Assuming that the 6.7% of erroneous data should be corrected in the same way, the consensus rate falls from 98% to 91%.


    1

    View comments

  5. Mr Cook has released more data. Unfortunately, a lot of data is still missing. Particularly, rater IDs and time stamps are not available. This means that we cannot test for systematic differences between raters. It also means that we cannot compute average and minimum rating times.

    Dan Kammen, the editor of Environmental Research Letters, has explicitly endorsed Cook's refusal to release all data, against the journal policy. No word yet from the University of Queensland or the Institute of Physics.

    Three new data sets were released. Ratings 4a and b are now available. This reveals that they had another look at 1000 papers, re-rated 5, and scaled this up to 40 for the entire sample. That's fine.

    Paper IDs were released too.

    Most importantly, ratings are now there. The data are in the order of rating. The data have the paper ID, the rater's rate, the rater's topic, the final rate and the final topic. The data are organized such that I cannot do much with the data with the software on my laptop. A full analysis will have to wait. A number of things are striking, however (and explain why the data resists analysis). The paper says that each abstract was rated at least twice and at most thrice. In fact, some abstracts were rated only once. Other abstracts were rated four or five times, which implies that there are discrepancies between the supposedly final ratings.

    For the final 3196 ratings (some 1500 abstracts), there are no original ratings at all -- only final ratings, with discrepancies.

    In my previous comment, I examine the data -- ordered by year and title -- for inexplicable patterns. Some have argued that ordering by year and title induces the patterns found. I am not convinced this is true. Be it as it may, if we take the paper at its word, abstracts were presented in random order to the raters. Assuming that that is true (I have yet to check) characteristics of the abstracts cannot induce a pattern in the newly released data. Even if there is a trend towards greater endorsement over time (a claim that the paper makes but fails to support), that trend would be destroyed by randomization. If there is autocorrelation induced by title (a rather bizarre suggestion by some commentators), that would be removed by randomization.

    So, there should be no pattern in the new data. The next three graphs show, in the top panel, the 50-abstract rolling mean rate (in blue), the 500-abstract rolling mean rate (in red) and the trend (in black); in the middle panel the standard deviation; and in the bottom panel the skew. Is that a pattern I see?




    9

    View comments

  6. Google graced me with computing power stronger than a 80286 so I recovered all data from the Tol Poll. See initial discussion.

    There were 2603 valid answers and 12531 answers from bots. We now know there were two bots, one green and one brown.

    Update of results
    Some people have better name recognition than others.

    Some people are more popular than others.
    Some people are popular with some, but less so with others.
    Some people are popular across the climate policy spectrum, others are not.
    New results
    The order in which questions are asked is always an issue. On August 3 & 4, Judy and Tamsin swapped position. On August 4, Dana and Steve swapped position.

    This statistically significantly affected the opinions about Judy, Steve and Tamsin, but not about Dana, Gavin and Joe. Tamsin shows the strongest effect: People think differently of her when she comes after Dana than when she comes after James.




    All valid data, and aggregate data for the bots, can be found here.




    0

    Add a comment

  7. The Tol Poll is a direct result of the series of op-eds in the Guardian on the relationship between the environmental movement and environmental science organized by Alice Bell, and particularly Tamsin Edwards’ call for experts to talk about their area of expertise only. In the ensuing discussion, many noted just how nasty the climate debate has become, and Chairman Al, the Climate Chimp, suggested a poll on nastiness.

    So I did, as a joke. Putting together an internet poll is trivial. (Designing a good poll is a lot of work.)

    The poll itself is simple. Rate 12 people who are prominent in the British climate debate online, on a scale of 1 to 5 where 1 stands for “very nasty” and 5 stands for “very friendly”. There is a bonus question that places the respondent in the political spectrum, rating themselves on a scale of 1 to 5 where 1 stands for “very worried about the impacts of climate policy” and 5 for “very worried about the impacts of climate change”. (Some people argued these are different things, which of course they are, but I was not after identifying the agony aunts who worry about everything.)

    The expected result: Some people are either loved or loathed, depending on the (in)congruence of their political position and that of the respondent, whereas other people are accepted by both sides of the debate.

    The best I hoped for were some giggles, and perhaps a data set that could be used for a class in forensic statistics (as the framing of the poll invites dishonest answers).

    I had not counted on Anthony Watts pushing the poll. I had not counted on someone writing a bot to flood the poll with fake results pushing a particular position, and someone else writing a bot to support the opposite position. Or maybe it was the same bot, as its author realized people saw through the ruse. The software I used, Google Docs, is not really suited for handling this amount of data.

    As a courtesy to all those who took the time to fill out the poll and who discussed it (in grave, jocular or puzzled tones), here are some of the results.
    0

    Add a comment

Blog roll
Blog roll
Translate
Translate
Blog Archive
About Me
About Me
Subscribe
Subscribe
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.