Monday, March 23, 2009

Cheating with publication metrics on SSRN

With the advent of online bibliographic databases like RePEc and SSRN, with the increasing importance of readership statistics in the evaluation of a researcher, and with said evaluation becoming more and more competitive, it is obvious that some researchers will try to find ways to manipulate these statistics. There is plenty of anecdotal evidence that such manipulations are taking place successfully at SSRN, with self downloads and in particular teachers telling their classes to download their works. Some even use social forums like Fark or Facebook to ask total strangers to increase their counts. And it is common place to send emails to friends and family encouraging downloads. But that is only anecdotal evidence. Benjamin Edelman and Ian Larkin now provide better evidence that this is taking place. Their paper is on SSRN, so go and increase their download count...

Their strategy is to identify when in the career of a researcher increased downloads would matter and test whether they indeed increase at that time. Where most of the cheating is going on is with SSRN's "top ten" lists. There are many of those, and thus it is relatively easy to get on them, with a little help, especially when one is already close to the 10th position. Edelman and Larkin also find that one or two years before a career move, downloads "mysteriously" tend to increase. However, they find no evidence for approaching tenure decisions. But I think this underestimates the problem, as this study identifies fraud from the historical SSRN logs and how SSRN subsequently identified suspicious downloads according to its new rules. Those rules are very imperfect and thus obviously do not detect successful manipulation which anecdotal evidence shows is happening.

Fortunately in Economics, people do not rely on SSRN statistics, maybe because of this perception that they can be fraudulent (but SSRN claims to have cleaned up their act). RePEc statistics are followed much more closely, none the least because it is perceived as the "good guy" (whereas SSRN is profit seeking), because it has been open about its statistics from the start, because many safeguards have been put in place to prevent fraud from being successful, as explained on the RePEc blog post, and because downloads are not the only relevant metric. And in particular, RePEc does not insist on every page about its download statistics. And the fact that SSRN statistics are updated in real time is just too tempting (refresh a SSRN page to see what I mean).


Anonymous said...

I find it indeed interesting that I regularly get emails enticing me to download stuff on SSRN, but never for RePEc, yet everybody takes RePEc rankings much more seriously. You would think if RePEc stats are more important, people would try to cheat there, but no.

Anonymous said...

I remeber the Fark thread where we brought the SSRN server down. Good timess...

More seriously, SSRN needs to deemphasize these top 10 lists and prevent the counters from incremently live. This is just giving the wrong incentives.

Anonymous said...

Seriously, who in their right mind would believe any of the statistics SSRN spouts?

Anonymous said...

RepEc statistics are quite unreliable as well. If I compare the download statistics of the RePEc archives I am running ( among them with the RePEc statistics, thes differ by a factor of 100 or so.

There are several reasons for that:

1. RePEc counts only those downloads going through some of the RePEc services (EconPapers, Ideas...), yet by far the most downloads occur through Google. The purpose of RePEc statistics is to measure the usage of RePEc services, not total downloads.

2. RePEc counting is much more restrictive than standard counting that follows the "COUNTER"-standard. But whatever standard you use, you have the problem that some downloads by two different persons are counted as one download (because they use the same Proxy), or two downloads by one person are counted as two downloads because they are done from different computers, or the same computer using a different Internet address, or are separated by sufficiently large time span. But whatever you do, download statistics can be manipulated.

As a matter of etiquette, a professor who asks his students to read his paper should give the direct address to the server where the paper is located, rather than the RePEc address, thereby avaoiding that the paper is counted as a download.

For hiring decisions, only the citations may be of relevance, and here RePEc is quite good, as it covers a large range of publications. Google Scholar is even more comprehensive, but you don't get nice statistics. Thom,pson (Web of Science) is, in my view, quite problematic.

Unfortunately hiring committees commit the same mistake banks made in buying assets accordingh to the recommendations of the rating agencies, rather than looking by themselves. Or rating agencies seem to be the journals and these doubtful statistics.