In the data that Cook made available, abstract IDs run from 1 to 12,876. The paper says that 12,465 abstracts were downloaded, of which 11,944 were used.* So, 411 abstracts are unaccounted for. 411 is a large number relative to the number of papers that drive the alleged consensus, but not knowing why the 411 were missing, I only included a puzzled footnote in my Energy Policy paper.
Here is an explanation. Cook downloaded the abstracts from the Web of Science in two batches. The first batch was the largest. After they had rated all that, there were a good few more recent publications, so a second batch of abstracts was downloaded.
There was overlap between the first and second batch, and 411 duplicates were removed. So far, so uncontroversial.
However, if Sou is to be believed, duplicates were removed from the FIRST batch, already rated, rather than from the second batch.
The missing abstracts are indeed disproportionally concentrated among the lower IDs, which is consistent as the default data dump from the Web of Science presents the more recent papers first, and more recent papers are much more likely to overlap in Cook's two data dumps.
By removing already rated abstracts, Cook created more work and denied an opportunity to test data quality.
More importantly, Cook replaced ratings from the first rating period with ratings from the second rating period. These ratings are markedly and significantly different. It appears that Cook moved some 40 abstracts from category "3" to "4", again reducing the consensus rate down to 97%.
UPDATE
Sou offers an alternative explanation. Apparently, Cook queried WoS and downloaded the data in chunks. Some chunks were downloaded twice, or perhaps they were uploaded twice into Cook's database. For this explanation to work, we would have to believe that the data chunks were as small as 342 abstracts, or even 63 abstracts, or maybe even 4. Recall that Cook had 12,000+ abstracts.
Alternatively, Cook may have split his query, e.g. by discipline. This would lead to sizable data chunks, but the pattern of overlap would be random. Cook's overlaps are concentrated: According to Sou, the missing IDs are:
I find all this implausible. If true, it would explain why my query returns 13,431 papers, but Cook has only 12,465 papers in his data: Some data chunks were downloaded but not uploaded.
*Brandon Shollenberger drew my attention to the discrepancy.
UPDATE
Sou offers an alternative explanation. Apparently, Cook queried WoS and downloaded the data in chunks. Some chunks were downloaded twice, or perhaps they were uploaded twice into Cook's database. For this explanation to work, we would have to believe that the data chunks were as small as 342 abstracts, or even 63 abstracts, or maybe even 4. Recall that Cook had 12,000+ abstracts.
Alternatively, Cook may have split his query, e.g. by discipline. This would lead to sizable data chunks, but the pattern of overlap would be random. Cook's overlaps are concentrated: According to Sou, the missing IDs are:
- IDs 5 to 346 inclusive = 342
- IDs 1001 to 1004 inclusive = 4
- IDs 2066 to 2128 inclusive. = 63
- Total = 409 - the other two are probably isolated somewhere.
I find all this implausible. If true, it would explain why my query returns 13,431 papers, but Cook has only 12,465 papers in his data: Some data chunks were downloaded but not uploaded.
*Brandon Shollenberger drew my attention to the discrepancy.
View comments