1. In the first days of rater bias, I note that Cook's ratings were done over two periods with a break in between. Ratings are different before and after the break, and raters had the opportunity to inspect the results of the first period during the break. This would invalidate Cook's data.

    However, Steve McIntyre and Brandon Shollenberger protested that the second rating period was dominated by tie-break ratings. This is a particular part of the data sample. Results should be different. Indeed, if we plot the chi-squared statistic against time, testing first/second/third ratings on a particular day against all first/second/third ratings, nothing untoward appears in the later period of active rating.
    First ratings:
    Second ratings:
    Third (tie-break) ratings:
    That said, the tie-break ratings are not without blemish (apart from the fact that 7 out of 46 days are above the 99%ile). Comparing the first ratings that were not challenged against those that were, I find that their distributions differ (chi2=80, p<0.01). Ditto for the second ratings (chi2=29, p<0.01). This is as it should be. However, comparing the unchallenged ratings to the tie-breaks, a large difference appears (chi2=393, p<0.01). That is, the tie-breaks (in the second period) moved away from the original ratings (in the first period). Indeed, in 74 cases, the third rating lies outside the bracket of the first and second rating. And some abstracts were re-rated even though the first two ratings agreed.

    Particularly, the tie-break rating counted 44% and 25% fewer rejections of the hypothesis of anthropogenic warming compared to the first and second ratings, respectively; but 8% fewer and 16% more endorsements than in the first and second ratings, respectively.

    Recall that the tie-break ratings took place after the raters had had the opportunity to look at their results.


    0

    Add a comment

  2. I wrote earlier about the latest data release from the Consensus project, highlighting the frantic ratings by one of Cook's helpers, the lack of inter-rater reliability, and the systematic differences in ratings between days. I explored the latter a little further.

    Cook's original ratings are 1 to 7, with 4 neutral. I rescaled these 3 to -3, with 0 neutral. Adding up all rescaled scores, we find 11594. The number is positive because Cook et al. found that more papers support (3 to 1) than reject (-1 to -3) the hypothesis that human activity contributed to the observed global warming.

    Now that we have date stamps, we can compute the same score per day. This is shown in the figure below. The number goes up and down with the number of abstracts rated on that particular day.

    I bootstrapped the daily data, computed the same score, and its 95% confidence interval. This is shown too in the figure below, with negative deviations in brown (a bias towards rejection) and positive deviations in green (a bias towards endorsement).

    For the first period, the observed scores move about in their confidence intervals, sometimes higher than expected and sometimes lower. Then there is a period in which few papers were rated -- followed by a third period in which paper ratings systematically deviated towards endorsement.

    The second figure confirms this. It shows the histogram of ratings for the first period (January-April, including the quiet month of April) and the third period (May-June). UPDATE: The hypothesis that the two distributions are the same, or identical to the joint distribution, is rejected at the 1% significance level.
    UPDATE 2: The story continues.

    0

    Add a comment

  3. The saga of the 97% consensus continues. My re-analysis was based on a partial data release. Notably, rater IDs and time stamps were missing. The former are needed to test inter-rater reliability, the latter to test for fatigue.

    Thanks to Brandon Shollenberger, we now have rater IDs and date stamps.

    The rater IDs may or may not be protected by a confidentiality agreement. If so, U Queensland has a problem with internet security. If not, U Queensland has a problem with telling the truth.

    The table below shows a test for inter-rater reliability. The first column has the number of abstracts rated (which ranges from 2 to almost 4000). The second column has the fraction of ratings ignored in the final assessment (which ranges from 0 to 78%). The next seven columns have the fraction of endorsements by level and rater. The second column from the right has the Chi-squared statistic for the test of equality of proportions between the respective rater and all raters. The right most column has the level of significance: The null hypothesis that a particular rater equals the average rater is rejected at the 1% level (***) for 15 raters; it is rejected at the 5% level (**) for 2 raters; and at the 10% level (*) for 1 rater. That is, only 6 raters do not deviate from the norm.

    Endorsement
    Number Rejection 1 2 3 4 5 6 7 Chi2
    2208 3.62% 0.91% 7.38% 31.39% 59.74% 0.50% 0.00% 0.09% 51.251 ***
    966 6.73% 0.83% 18.74% 28.05% 51.24% 0.93% 0.21% 0.00% 155.690 ***
    2 50.00% 0.00% 0.00% 0.00% 100.00% 0.00% 0.00% 0.00% 1.084
    2671 12.32% 1.05% 7.00% 22.69% 68.44% 0.49% 0.26% 0.07% 21.049 ***
    31 6.45% 3.23% 16.13% 22.58% 58.06% 0.00% 0.00% 0.00% 4.947
    3791 7.62% 0.29% 4.46% 15.77% 79.00% 0.40% 0.03% 0.05% 338.960 ***
    615 7.48% 1.95% 11.22% 23.25% 63.09% 0.33% 0.00% 0.16% 18.684 ***
    60 0.00% 0.00% 1.67% 23.33% 73.33% 0.00% 1.67% 0.00% 12.296 *
    1945 6.53% 0.41% 6.22% 25.81% 66.89% 0.21% 0.31% 0.15% 24.090 ***
    2940 3.84% 0.44% 7.35% 27.11% 64.18% 0.61% 0.20% 0.10% 14.519 **
    1707 5.98% 0.88% 11.19% 30.40% 56.65% 0.53% 0.18% 0.18% 54.096 ***
    1266 9.48% 0.95% 10.58% 22.27% 65.64% 0.32% 0.16% 0.08% 12.646 **
    22 13.64% 0.00% 9.09% 13.64% 77.27% 0.00% 0.00% 0.00% 2.042
    9 77.78% 0.00% 77.78% 22.22% 0.00% 0.00% 0.00% 0.00% 57.291 ***
    155 5.81% 1.29% 13.55% 36.13% 48.39% 0.65% 0.00% 0.00% 19.696 ***
    1739 3.05% 0.92% 10.01% 35.02% 53.48% 0.46% 0.06% 0.06% 110.504 ***
    314 11.78% 1.91% 7.96% 42.99% 45.22% 1.59% 0.32% 0.00% 70.223 ***
    93 5.38% 0.00% 20.43% 21.51% 58.06% 0.00% 0.00% 0.00% 18.502 ***
    6 50.00% 0.00% 16.67% 33.33% 50.00% 0.00% 0.00% 0.00% 0.947
    3968 3.33% 0.76% 7.06% 22.28% 68.88% 0.68% 0.25% 0.10% 33.751 ***
    17 29.41% 0.00% 17.65% 35.29% 47.06% 0.00% 0.00% 0.00% 3.525
    2 0.00% 0.00% 0.00% 0.00% 100.00% 0.00% 0.00% 0.00% 1.084
    2191 21.82% 1.60% 12.78% 23.82% 60.43% 0.96% 0.37% 0.05% 84.415 ***
    111 47.75% 0.90% 13.51% 50.45% 32.43% 2.70% 0.00% 0.00% 59.313 ***
    26829 7.67% 0.81% 8.44% 25.07% 64.85% 0.56% 0.18% 0.09%

    Time stamps are still unavailable. John Cook has written both that they do exist and that they don't exist. Date stamps are less informative, but useful nonetheless. The heat map of number of ratings per day and per rater is shown below. One rater read and classified 765 abstracts in the course of 72 hours.

    Combining date stamps and abstract ratings, the figure below shows the chi-squared test statistic for whether the ratings on a particular day deviate from the average. Abstracts were rated on 76 days. For 16 days, the null hypothesis that ratings are average is rejected at the 1% level of significance; we would expect this on 1 day only. For 9 days, the null hypothesis is rejected at the 5% level. And for another 4 days, the null is rejected at the 10% level. Peculiar rating days are more common towards the end of the rating period. Recall that raters, survey designers, and analysts are the same people.

    0

    Add a comment

  4. To Peter Sutherland, Chairman of the London School of Economics and Political Science


    Dear Mr Sutherland,

    One of the employees of the London School of Economics, Mr Robert ET Ward BSc, has been waging a smear campaign against me. The campaign consists of a insinuations, half-truths and outright lies, and takes the form of tweets, blog posts, and letters to journal editors, civil servants, and elected politicians. This campaign has been going on since October 2013.

    I have repeatedly asked Mr Ward to end his campaign and suggested that he instead focus on his job, which is to promote the research of the Grantham Research Institute on Climate Change and the Environment.

    When that failed to produce the desired result, I contacted Professor Dr Nicholas Stern, Lord Brentford. Lord Stern denied any responsibility for Mr Ward's behaviour, even though the LSE website lists Lord Stern as the chair of the institute that employs Mr Ward.

    I then contacted the Director of the LSE, Professor Dr Craig Calhoun. I never reached Professor Calhoun, but was stonewalled by his Chief of Staff, Mr Hugh Martin.

    I last tried to contact Professor Calhoun on 2 May 2014.

    Even though I failed to contact the LSE, Mr Ward fell silent so I assumed that I had reached one of my goals and I let the matter rest.

    However, on 8 July 2014, Mr Ward resumed his campaign with a letter to the Rt Hon Lamar Smith, chair of the US House of Representatives Committee on Science, Space and Technology.

    I therefore hereby request that you
    1. inform Professor Calhoun that complaints about the behaviour of LSE staff do require his attention;
    2. stop Mr Ward's campaign of smear and character assassination; and
    3. publicly distance the London School of Economics from Mr Ward's campaign and apologize for the damage and distress caused.

    Looking forward to your timely reply, I remain,

    Yours sincerely



    Richard Tol
    0

    Add a comment

Blog roll
Blog roll
Translate
Translate
Blog Archive
About Me
About Me
Subscribe
Subscribe
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.