1. A full reconstruction of Cook's 97% nonsensus is still lacking. However, Sou of Bundanga may have unraveled one further mystery.

    In the data that Cook made available, abstract IDs run from 1 to 12,876. The paper says that 12,465 abstracts were downloaded, of which 11,944 were used.* So, 411 abstracts are unaccounted for. 411 is a large number relative to the number of papers that drive the alleged consensus, but not knowing why the 411 were missing, I only included a puzzled footnote in my Energy Policy paper.

    Here is an explanation. Cook downloaded the abstracts from the Web of Science in two batches. The first batch was the largest. After they had rated all that, there were a good few more recent publications, so a second batch of abstracts was downloaded.

    There was overlap between the first and second batch, and 411 duplicates were removed. So far, so uncontroversial.

    However, if Sou is to be believed, duplicates were removed from the FIRST batch, already rated, rather than from the second batch.

    The missing abstracts are indeed disproportionally concentrated among the lower IDs, which is consistent as the default data dump from the Web of Science presents the more recent papers first, and more recent papers are much more likely to overlap in Cook's two data dumps.

    By removing already rated abstracts, Cook created more work and denied an opportunity to test data quality.

    More importantly, Cook replaced ratings from the first rating period with ratings from the second rating period. These ratings are markedly and significantly different. It appears that Cook moved some 40 abstracts from category "3" to "4", again reducing the consensus rate down to 97%.

    UPDATE
    Sou offers an alternative explanation. Apparently, Cook queried WoS and downloaded the data in chunks. Some chunks were downloaded twice, or perhaps they were uploaded twice into Cook's database. For this explanation to work, we would have to believe that the data chunks were as small as 342 abstracts, or even 63 abstracts, or maybe even 4. Recall that Cook had 12,000+ abstracts.

    Alternatively, Cook may have split his query, e.g. by discipline. This would lead to sizable data chunks, but the pattern of overlap would be random. Cook's overlaps are concentrated: According to Sou, the missing IDs are:
    • IDs 5 to 346 inclusive = 342
    • IDs 1001 to 1004 inclusive = 4
    • IDs 2066 to 2128 inclusive. = 63
    • Total = 409 - the other two are probably isolated somewhere.

    I find all this implausible. If true, it would explain why my query returns 13,431 papers, but Cook has only 12,465 papers in his data: Some data chunks were downloaded but not uploaded.

    *Brandon Shollenberger drew my attention to the discrepancy.



    12

    View comments

  2. Now almost two years old, John Cook’s 97% consensus paper has been a runaway success. Downloaded over 300,000 times, voted the best 2013 paper in Environmental Research Letters, frequently cited by peers and politicians from around the world, with a dedicated column in the Guardian, the paper seems to be the definitive proof that the science of climate change is settled.

    It isn’t.

    Consensus has no place in science. Academics agree on lots of things, but that does not make them true. Even so, agreement that climate change is real and human-caused does not tell us anything about how the risks of climate change weigh against the risks of climate policy. But in our age of pseudo-Enlightenment, having 97% of researchers on your side is a powerful rhetoric for marginalizing political opponents. All politics ends in failure, however. Chances are the opposition will gain power well before the climate problem is solved. Polarization works in the short run, but is counterproductive in the long run.

    In their paper, Cook and colleagues argue that 97% of the relevant academic literature endorses that humans have contributed to observed climate change. This is unremarkable. It follows immediately from the 19th century research by Fourier, Tyndall and Arrhenius. In popular discourse, however, Cook’s finding is often misrepresented. The 97% refers to the number of papers, rather than the number of scientists. The alleged consensus is about any human role in climate change, rather than a dominant role, and it is about climate change rather than the dangers it might pose.

    Although there are large areas of substantive agreement, climate science is far from settled. Witness the dozens of alternative explanations of the current, 18 year long pause in warming of the surface atmosphere. The debate on the seriousness of climate change or what to do about it ranges even more widely.

    The Cook paper is remarkable for its quality, though. Cook and colleagues studied some 12,000 papers, but did not check whether their sample is representative for the scientific literature. It isn’t. Their conclusions are about the papers they happened to look at, rather than about the literature. Attempts to replicate their sample failed: A number of papers that should have been analysed were not, for no apparent reason.

    The sample was padded with irrelevant papers. An article about TV coverage on global warming was taken as evidence for global warming. In fact, about three-quarters of the papers counted as endorsements had nothing to say about the subject matter.

    Cook enlisted a small group of environmental activists to rate the claims made by the selected papers. Cook claims that the ratings were done independently, but the raters freely discussed their work. There are systematic differences between the raters. Reading the same abstracts, the raters reached remarkably different conclusions – and some raters all too often erred in the same direction. Cook’s hand-picked raters disagreed what a paper was about 33% of the time. In 63% of cases, they disagreed about the message of a paper with the authors of that paper.

    The paper’s reviewers did not pick up on these things. The editor even praised the authors for the “excellent data quality” even though neither he nor the referees had had the opportunity to check the data. Then again, that same editor thinks that climate change is like the rise of Nazi Germany. Two years after publication, Cook admitted that data quality is indeed low.

    Requests for the data were met with evasion and foot-dragging, a clear breach of the publisher’s policy on validation and reproduction, yet defended by an editorial board member of the journal as “exemplary scientific conduct”.

    Cook hoped to hold back some data, but his internet security is on par with his statistical skills, and the alleged hacker was not intimidated by the University of Queensland’s legal threats. Cook’s employer argued that releasing rater identities would violate a confidentiality agreement. That agreement does not exist.

    Cook first argued that releasing time stamps would serve no scientific purpose. This is odd. Cook’s raters essentially filled out a giant questionnaire. Survey researchers routinely collect time stamps, and so did Cook. Interviewees sometimes tire and rush through the last questions. Time stamps reveal that.

    Cook later argued that time stamps were never collected. They were. They show that one of Cook’s raters inspected 675 abstracts within 72 hours, a superhuman effort.

    The time stamps also reveal something far more serious. After collecting data for 8 weeks, there were 4 weeks of data analysis, followed by 3 more weeks of data collection. The same people collected and analysed the data. After more analysis, the paper classification scheme was changed and yet more data collected.

    Cook thus broke a key rule of scientific data collection: Observations should never follow from the conclusions. Medical tests are double-blind for good reason. You cannot change how to collect data, and how much, after having seen the results.

    Cook’s team may, perhaps unwittingly, have worked towards a given conclusion. And indeed, the observations are different, significantly and materially, between the three phases of data collection. The entire study should therefore be dismissed.

    This would have been an amusing how-not-to tale for our students. But Cook’s is one of the most influential papers of recent years. The paper was vigorously defended by the University of Queensland (Cook’s employer) and the editors of Environmental Research Letters, with the Institute of Physics (the publisher) looking on in silence. Incompetence was compounded by cover-up and complacency.

    Climate change is one of the defining issues of our times. We have one uncontrolled, poorly observed experiment. We cannot observe the future. Climate change and policy are too complex for a single person to understand. Climate policy is about choosing one future over another. That choice can only be informed by the judgement of experts – and we must have confidence in their learning and trust their intentions.

    Climate research lost its aura of impartiality with the unauthorised release of the email archives of the Climate Research Unit of the University of East Anglia. Its reputation of competence was shredded by the climate community’s celebration of the flawed works of Michael Mann. Innocence went with the allegations of sexual harassment by Rajendra Pachauri and Peter Gleick’s fake memo. Cook’s 97% nonsensus paper shows that the climate community still has a long way to go in weeding out bad research and bad behaviour. If you want to believe that climate researchers are incompetent, biased and secretive, Cook’s paper is an excellent case in point.

    An edited version appeared in the Australian on March 24, 2015
    25

    View comments

  3. Ben Dean succeeded where I failed: He got a comment on Cook's 97% published at ERL. Dean's trick is simple: Ask a question. He has been following the discussion, so he knew the answer, but by phrasing it the way he did, he avoided criticizing the authors and editors (which, as we know, is not allowed at ERL).

    Cook and Cowtan replied. Dean expressed his surprise that the original paper did not report inter-rater reliability. With a two year delay, Cook reports that Cohen's kappa is 0.35.

    Kappa is not a particularly rigorous statistic. Rules of thumb, however, have that kappa has to be greater than 0.40, but preferably larger than 0.75 or 0.80.

    Cook thus admits that his data are unreliable, something that others (including me) have been saying all along.

    Cook omits that his is a biased estimate of kappa: Kappa measures the rate of agreement between two INDEPENDENT raters. There is no reason to assume that Cook's raters were independent. They were selected from a small group of people who know each other well, and chatted a lot about their ratings with each other.

    Dependent raters would tend to agree more than independent raters would. Therefore, Cook's estimate of kappa is biased UPWARD. The true kappa is smaller than the reported kappa, and Cook's data is even more unreliable than he has now finally admitted.
    1

    View comments

  4. The Guardian has written a series of articles about me and my work, a veritable smear campaign.

    I complained and was surprised by the foot-dragging of Guardian staff until, tada, the Press Complaints Commission (PCC) ceased to exist and the Guardian became self-regulated.*

    Anyone with an inkling of understanding of markets and corporations knows that self-regulation is a really, really bad idea. Adam Smith had something to say on that.

    Although the Guardian had agreed with the PCC to take on all unfinished business, I of course had to start again. After more foot-dragging by the Guardian, there is now a final ruling, which somehow fails to refer to all the material I had submitted to the PCC.

    There is a surprise: On one minor point, they ruled in my favour. Huzzah!

    Less surprisingly, the Guardian ruled against Nafeez Ahmed, who had already been fired from the Guardian.

    Note
    * The Guardian did not join the Independent Press Standards Organization (IPSO).
    0

    Add a comment

  5. Shub Niggurath alerted me to the second ratings of rater 4194. In 88% of cases, his second ratings are equal to the first ratings. Averaged over all raters, that ratio is only 68%. What's more, in his first ratings, rater 4194 deviates more from the average than anybody but one.

    It is as if rater 4194 realized that he was an outlier -- or perhaps he was told -- and just copied the first ratings in his second ratings. If so, this would belie Cook's claim of independent ratings.

    However, rater 873 deviates even more in the first ratings, and stays the course in the second ratings.

    Indeed
    • testing first ratings against the average
    • testing second ratings against the average
    • testing first ratings against second ratings
    • testing deviations in the first ratings against deviations between first and second ratings
    shows that Cook's raters seem to have wandered about aimlessly. Indeed, rater 4194's 88% is not significantly different from the average 68%.

    The figure below illustrates that. It shows average ratings, first and second, for each individual rater as well as the average first and second ratings. The figure reveals no pattern.

    However, the figure reveals the quality of the data: 19 out of 21 individuals significantly deviate from the average on the first ratings, 21 out of 21 for the second ratings, and 14 out of 20 individuals rated things differently in the first and second period. Recall that abstracts were allocated randomly, so we would expect to see 1 deviation out of 20.

    It seems that Cook's data are essentially random numbers.


    0

    Add a comment

Blog roll
Blog roll
Translate
Translate
Blog Archive
About Me
About Me
Subscribe
Subscribe
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.