Image via Wikipedia

When it comes to plagiarism in writing assignments, college students turn more frequently to news sources and paper mills, a comparative study of writing assignments reveals.

The white paper, “Plagiarism and the Web: A Comparison of Internet Sources for Secondary and Higher Education Students,” was released this week by iParadigms, a company that offers plagiarism detection services for educators and publishers.

The study classified 128 million content matches discovered in 33 million student writing assignments submitted to Turnitin between June 2010 and June 2011.

For both secondary and college papers, social and content sharing sites topped both groups in terms of total content matches, with homework and academic sites close behind. Wikipedia still leads as the most popular website for text appropriation, representing 8% of all sources for secondary students and nearly 11% for college students. Yahoo!Answers came in second for both groups.

College students, not surprisingly, show proportionally more use of news sources and paper mills than secondary school students, the study revealed.

The report, which is accompanied by an infographic, is sparse on the methodological details, requiring me to contact Chris Harrick, vice president of marketing at Turnitin, for details. Specifically, I was interested in how they defined a content match and determined whether it was a valid case of plagiarism.

According to Harrick, content matches are made using the company’s proprietary algorithm, meaning that a precise definition could not be provided. The software is designed to detect patterns, not just contiguous word sequences, which allows them to identify passages where the author edits a piece of text copied from another source.

Not everyone agrees. Economist David E. Harrington argues that Turnitin software can be gamed by the practice of what he calls, “copying, cloaking, and pasting,” a practice in which simple words and phrases are edited to obscure the original source. If students can edit well-enough to escape software detection, then the true frequency of text appropriation in the report is understated.

The report does not specify the distribution of content matches, but with 128 million content matches in 33 million papers, the overall average is somewhere around four matches per paper. We don’t know what percentage of papers contain zero matches, or the percentage of papers that are complete copies. It is also unclear how much content is appropriated and typical behavioral patterns in the text. This kind of information would help educators identify students at risk, develop better writing assignments, and train students on the appropriate use of text.

The title of the white paper, “Plagiarism and the Web,” uses the infamous “P-word” while the report makes no attempt to distinguish a content match that includes attribution to the original source with one that does not. Text within quotation marks and block quotes is treated as any other form of text and there is no attempt to search for citations, footnotes, endnotes or bibliographies in a paper.

For example, the following text, while being presented in block quote and followed by a citation would be classified as a case of plagiarism for the simple fact that it exists in Wikipedia as well as other free websites:

Was this fair paper, this most goodly book,
Made to write “whore” upon? What committed?
Othello (IV, ii)

Without attempting to verify whether a piece of matched text is devoid of attribution, it is impossible to separate good scholarship from academic misconduct, which is why I am sensitive to using the word “plagiarism” in the context of this study.

Harrick concedes on this point, but offers that sources like Yahoo!Answers, paper mills, and cheat sites are not usually cited as reputable sources.

Phil Davis

Phil Davis

Phil Davis is a publishing consultant specializing in the statistical analysis of citation, readership, publication and survey data. He has a Ph.D. in science communication from Cornell University (2010), extensive experience as a science librarian (1995-2006) and was trained as a life scientist. https://phil-davis.com/


6 Thoughts on "Cheat Sites: Where Students Turn to Crib Papers"

I’m fairly sure that unless TurnItIn uses a different algorithm from iThenticate (the version for journal articles) it can distinguish quotes from regular text, as there’s a option ‘exclude quotes’ in the report settings menu. It should also be able to exclude the bibliography, but in my experience with iThenticate the references get accidentally detected as copied from other sources about 10-15% of the time, even when the ‘exclude bibliography’ option is checked. Since counting all the references as plagiarism can make it look like ~20% of a paper is borrowed, and I doubt that anyone went through the 33 million papers to check for this, it may be that the numbers in the report considerably overestimate the actual distribution of plagiarism among student papers.

In any case, considering that this paper was released by iParadigms itself, it seems unlikely that it would put a lot of effort into eliminating false positives (although I’d be very happy to be corrected).

I was on the phone with Chris Harrick and a software programmer yesterday and they confirmed that they could not exclude quotations, in-text references, footnotes and bibliographies, so perhaps they are using a different search algorithm.

However, it is not necessary to have human eyes check 33 million papers when a random sample of a few hundred papers would do. That would give them a sense of how many false-positives are included in their results.

iThenticate and Turnitin use the same algorithm.

End users of both applications can exclude quotes and bibliographies in similarity reports. If quotes are not indented or do not use quotation marks, they will not be regarded as quotes so they may appear in the Similarity Report even if “Exclude Quotations” are checked.

This applies to the “Exclude Bibliography” filter as well. The system needs a “tell” that it is reading a bibliography so there needs to be a label that says “Bibliography”, “References”, “Works Cited”, etc. That explains why there might be some properly cited matches appearing in a Similarity Report that is using these filters.

As Phillip states in his piece, this study includes all matches from papers, regardless of filter. So the 128 million matches will contain content that is properly sourced but still matches to the Turnitin database.

Hi Chris, thanks for the clarification. It does seem strange that while iParadigms would surely advocate great care when evaluating TurnItIn or iThenticate reports and deciding whether to accuse an author of plagiarism, the bar for false positives is much lower in your own reports. Promotional pieces are all very well, but I’d actually be much more intrigued to see something that takes your amazing dataset and analyses it with the diligence that the problem of academic misconduct deserves.

Comments are closed.