[visionlist] Plagiarism checks in Empirical Manuscripts

Tue Jul 11 15:42:07 -05 2017

Elsevier and other publishers’ efforts to detect "self-plagiarism” is an instance of text mining the world’s scientific literature. Tom points out this also is useful for detecting data duplication and fraud. Unfortunately it cannot easily be used for that – Elsevier shuts down independent researchers who use their journal subscriptions to investigate fraud (http://onsnetwork.org/chartgerink/2015/11/16/elsevier-stopped-me-doing-my-research/  ; http://www.nature.com/news/text-mining-block-prompts-online-response-1.18819) . Text mining the scientific literature could yield thousands of discoveries, about both fraud and new connections between molecules, genes, and diseases, but it can’t be done because major journal publishers own the content and are trying to monetize it all for themselves (https://blogs.ch.cam.ac.uk/pmr/2017/07/11/text-and-data-mining-overview). “Self-plagiarism” also puts publishers at legal risk as a result of them publishing all our articles under restrictive copyright - it can be a copyright violation for them to publish text that happens to be identical to an earlier paper by the same author that happens to have been published by a different publisher. In an email from the journal to Peter Tse, the issue was framed as protecting the author but there was also this sentence: “Another issue to be borne in mind is the matter of copyright in extensive text duplication.”

Thus the traditional system of publishers owning the copyright to our work is both preventing new discoveries (which has to wait until the publishers find a way to use text mining to maintain or increase their profits) and creating ridiculous busywork for ourselves.  Yesterday I attended a university press publishing conference where Kevin Stranack of demo’ed Open Journal Systems version 3, which has already been released and looks significantly easier to use than ScholarOne/Manuscript Central, the system that expensive subscription journals use. The existence of OJS3 allows the creation of journals at very low cost (it already underpins thousands of journals, such as Glossa, which flipped from Elsevier) Unfortunately I seem to be the only researcher at the conference, but I’m tweeting about it and will add some related information to FairOA.org.

Alex

On 7/12/17, 03:03, "visionlist on behalf of Ghuman, Avniel" <visionlist-bounces at visionscience.com on behalf of ghumana at upmc.edu> wrote:

    Dear all and Tom,

    I would argue that there is a very big difference between borrowing previous text that you've written and data duplication. There is little question regarding the ethics of data duplication (e.g. it is not ethical unless you are talking about new analyses of previous data and it is clearly flagged as previously published data) and "self-plagairism" the ethics of which I would argue are still being debated at least with regards to reusing brief passages, particularly in the introduction or methods sections (indeed, this kind of self-plagairism is not considered research misconduct by US agencies, e.g. https://ori.hhs.gov/avoiding-plagiarism-self-plagiarism-and-other-questionable-writing-practices-guide-ethical-writing https://ori.hhs.gov/plagiarism-16a ). Thus, we should be careful not to conflate the two.

    One question I have often had is why is there such a strong dislike of simply using quotation marks, particularly if you are just quoting yourself because then you are not even borrowing ideas from others. Or, for example in a methods section one could just add a sentence that said something like "Below we restate the methods from [paper X] with the specific passages modified to reflect this study." It seems to me that the major concern with this kind of self-plagairism is making sure the reader understands that the sections aren't entirely new, so why not just have an explicit callout to that fact?

    Best wishes,
    Avniel

    Avniel Ghuman, Ph.D.
    Laboratory of Cognitive Neurodynamics
    Director of MEG research
    Assistant Professor of Neurological Surgery, Neurobiology, Psychiatry, and Psychology
    Faculty in the Center for the Neural Basis of Cognition  and the Center for Neuroscience
    University of Pittsburgh
    http://www.lcnd.pitt.edu

    From: Tom Wallis <tsawallis at gmail.com<mailto:tsawallis at gmail.com>>
    Date: Tuesday, July 11, 2017 5:53 AM
    To: Jim Ferwerda <jaf at cis.rit.edu<mailto:jaf at cis.rit.edu>>, "Peter.U.Tse at dartmouth.edu<mailto:Peter.U.Tse at dartmouth.edu>" <Peter.U.Tse at dartmouth.edu<mailto:Peter.U.Tse at dartmouth.edu>>
    Cc: "visionlist at visionscience.com<mailto:visionlist at visionscience.com>" <visionlist at visionscience.com<mailto:visionlist at visionscience.com>>
    Subject: Re: [visionlist] Plagiarism checks in Empirical Manuscripts

    Hi all,

    I'm sympathetic to Peter's desire to avoid "busywork" in re-writing parts of introductions, and of course it's pointless to re-write standard methods (as in Malte's original comment). However, I don't think the guidelines against self-plagarism are so easily dismissed ("I can't steal from myself").

    In my mind they exist to reduce the risk of CV-padding by re-use or "salami-slicing" research work into multiple outputs. Further to Jim's comment, this CV-padding via (self-)plagiarism is by no means limited to students at unknown universities trying to get ahead – see for example the recent highly-publicised implosion of Brian Wansink's Cornell Food and Brand Lab (detailed here: http://www.timvanderzee.com/the-wansink-dossier-an-overview/). Wansink's work includes numerous cases of blatant self-plagiarism, both in text (some articles containing up to 50% of the same text) and in data duplication. Not only does this practice disadvantage those scientists who don't engage in this practice ("candidate A has many more papers than candidate B!"), it also could create a false impression of the empirical support for some theory or guideline ("over 50 studies show that X does Y!"). Without guidelines against self-plagarism, there would be no way to explicitly police these practices.

    While I think it's important that these guidelines exist, I agree with others that they (and automated plagiarism detection software) should be applied with sufficient editorial common sense. Materials and Methods, and a paragraph in the introduction (with appropriate citation) seem fine, when the bulk of the paper presents new results and ideas.

    Best

    Tom

    --
    Thomas Wallis, PhD
    Project Leader, SFB 1233 Robust Vision
    AG Bethge
    Center for Integrative Neuroscience
    Eberhard Karls Universität Tübingen
    Otfried-Müller-Str 25
    72076 Tübingen
    Germany
    www.tomwallis.info<http://www.tomwallis.info>

    On Tue, Jul 11, 2017 at 3:31 AM, Jim Ferwerda <jaf at cis.rit.edu<mailto:jaf at cis.rit.edu>> wrote:
    I too had a “funny” experience relative to journal plagiarism.

    A few years back I wanted to look at a paper I had presented/published several years before at Human Vision and Electronic Imaging. Rather than dig through my computer, I googled the paper title: “Three Varieties of Realism in Computer Graphics”. The paper came up as the first hit, but several hits down was a paper titled “Hi-Fidelity Computer Graphics" published in the “International Journal of Innovative Research in Technology”. Since I was intrigued, I downloaded the paper and started reading. After a page of introduction, the text became strangely familiar. The “authors” of the paper had cut-and-pasted four pages of the eight page document directly from my paper!

    After doing some further investigation I found that the three authors (M.S. students at an obscure Indian university) had done this several times, “borrowing” from different published papers and switching up the author order, with the goal of padding their CV’s. Some further reading revealed that sadly, this is a widespread practice in some countries/fields, and that there are many “pay-to-play” journals that will publish whatever they are given as long as the publishing fee is received.

    So who knows, you, like me, might have more co-authors than you think!

    Cheers

    -Jim

    p.s. In case you're interested in padding /your/ CV the website for the journal is

    http://ijirt.org/

    Submission is easy. Make sure to have your credit card ready!

    On Jul 10, 2017, at 3:46 PM, Robert Sekuler <sekuler at brandeis.edu<mailto:sekuler at brandeis.edu>> wrote:

    I had a funny experience being flagged for possible plagiarism.

    The journal's very thorough plagiarism detector reported that a high proportion of my submission was duplicated from a document that it found on the web.

    No surprise, though. Turns out the duplicated article was an early draft of the submitted article that I had posted on my website. And the editor immediately understood what had happened. No harm done.

    Bob sekuler

    ----------------------------
    Professor of Neuroscience
    F & L Salvage Professor of Psychology
    Brandeis University

    On Jul 10, 2017, at 14:46, Horowitz, Todd (NIH/NCI) [E] <todd.horowitz at nih.gov<mailto:todd.horowitz at nih.gov>> wrote:

    To be blunt, I would like to know the name of the journal. Any journal with such shoddy editorial practices should be avoided.

    Thanks
    Todd Horowitz

    From: "Persike, Malte" <persike at uni-mainz.de<mailto:persike at uni-mainz.de>>
    Date: Monday, July 10, 2017 at 10:59 AM
    To: "visionlist at visionscience.com<mailto:visionlist at visionscience.com>" <visionlist at visionscience.com<mailto:visionlist at visionscience.com>>
    Subject: [visionlist] Plagiarism checks in Empirical Manuscripts

    Dear Vision Community,

    during publishing of a recent manuscript, I received a request from the editorial office to alter a number of sections in said manuscript. The request was triggered by an automated plagiarism check using CrossCheck. The whole process left me so puzzled that I thought I‘d share my experience here, combined with a humble request for a broader debate about the issue of plagiarism in empirical research.

    First, what happened? The report contained a whopping 24 different items, each asserting plagiarism of the works of others. The email from the editorial office was phrased accordingly. It asked to “amend the affected sections by either identifying the fact that it has been reproduced or by using original words”, thus presuming all 24 instances of supposed plagiarism to be veridical. Most of them were not.

    After a very thorough debate about all 24 items with my co-authors and an expert for good scientific conduct at our university’s library, 22 out of all 24 items were discarded. The remaining 2 items were far from verbatim copies of whole sections. They were small parts of larger sentences together with explicit citations of the sources from which those parts were derived. The other 22 items were discarded not due to subjective reasoning but due to obvious glitches in the plagiarism checking algorithms. This amounts to a rate of 91.6% false positives. I’ll describe some of the more silly instances at the end of this text, but that is not the reason for my posting here.

    Instead, the point I very much like to discuss with you is the handling of possible plagiarism in empirical studies. Do we have an agreed code of conduct for authoring pieces of empirical science? Let me highlight only a few points.

    1) How do we treat Materials and Methods? The Stimuli section will necessarily contain similar phrasings when reporting about research that uses identical paradigms. The Apparatus section will also be quite similar between related studies, as will the Participants section. The same holds for the Ethics Statement and the Measures and Analysis. Is it really desirable that we need to come up with ever so slightly different formulations for identical things, only to avoid verbatim copies? Are there not limitations as to how a temporal 2-AFC task can be described with appropriate brevity? And would it – particularly for Methods and Results –perhaps even be prudent to stick to a rather formulaic language protocol in order to make reception easier? I for one would certainly not wish to read a Methods section which goes like “Stimuli were created according to XY (2004). Handling of Participants was similar to XY (2010). Apparatus was as described in XY (1998). Task was taken from XY (2001). Analysis and measures are according to XY (1992).” This does not help me to efficiently understand what’s being done.

    2) How do we handle self-citations? Many of us work on the same topics over long stretches of time, sometimes decades. Good scientific research usually means to advance present knowledge step by step, pulling only very few levers at once for each new experiment. Is it not to be expected that at some point we have arrived at concise, well-formulated, and most comprehensible ways to verbally introduce specific concepts. Is it really necessary that we find ever new ways to phrase the exact same ideas?

    3) Is it the prime virtue of empirical research to be phrased originally? Is it not first and foremost the results and their implications that define original and interesting work? Even if we set high standards of originality for the prose in empirical articles, how should brief verbatim copies be handled? Let me give one example. Suppose, the abstract of a paper reads like “We used faces, non-face objects, Gabors, and colored Gaussian blobs to investigate the role of stimulus complexity on visual processing.” I find it highly questionable to then have a sentence like “XY (2001) used faces, non-face objects, Gabors, and colored Gaussian blobs to investigate the role of stimulus complexity.”, written by another author, qualify as plagiarism or needing quotes. Why should the latter author attempt to rephrase something that had already been so concisely summarized by the original authors?

    4) Is it common consensus that automated plagiarism checking without editorial oversight is the yardstick against which to evaluate the originality of scientific manuscripts?

    I’d very much love to have an informed discussion with you. In part because I imagine that the plagiarism report I received may turn out to be the rule rather than the exception, hence we might all face hours of checking and re-checking during future publishing attempts.

    Kind regards to all of you
      Malte Persike

    --

    And here come some of the highlights from the plagiarism report issued to me.

    (i) My institutional address “Johannes Gutenberg University Mainz, Wallstr. 3, D-55122 Mainz, Germany” and the immediately following heading “Abstract” were flagged as plagiarism.

    (ii) The E-Mail addresses of the authors were flagged as a plagiarism.

    (iii) Citations and year numbers, e.g. “(Persike et al., 2015)” were included in the word count for multiple items. This had ridiculous consequences. To name only two of the most blatant ones: one plagiarism item was defined by the words “Author et al., 1993 […] et al. […] the […] et al., 1997” (with a few unflagged words inbetween), another item was defined by the words “Author1 and Author2, 1994; Author3 and Author4, 2001) and”.

    (iv) Mathematical symbols, brand names and notational terms had been included in the report. One item therefore consisted almost entirely of parts of a mathematical formula, the product names “ViSaGe” and “ColorCal colorimeter”, the brand name “Cambridge Research Systems LLC”, the term “Michelson contrast”, and the phrase “were run in Matlab”.

    (v) Many of the report items contained common phrases used in neuroscience research, some of which were even multiply counted. For example, one item was defined by the mere phrase “to the ability of the visual system”, counted two times, plus a reference. A quick Google search turned up more than 150,000 hits for this exact phrase and Google Scholar yields more than a hundred authors who have also used this phrase in their works.

    (vi) The CrossCheck system invented false positives. One item contained the phrase “V2 neurons are highly selective”, another item referred to the phrase “to a particular combination of line components”. These phrases were not copied from anywhere but are original. In fact, they are so original that Google Scholar yields precisely zero search results for each of them. The sources from where these phrases were claimed to be derived do not include such sequences of words anywhere in the entire texts.

    --
    Dr. Malte Persike

    Department for Statistical Methods
    Psychological Institute
    Johannes Gutenberg University Mainz
    Wallstr. 3
    D-55122 Mainz

    fon:    +49 (6131) 39 39260<tel:06131%203939260>
    fax:    +49 (6131) 39 39186<tel:06131%203939186>
    mobile: +49 (1525) 4223363<tel:01525%204223363>

    _______________________________________________
    visionlist mailing list
    visionlist at visionscience.com<mailto:visionlist at visionscience.com>
    http://visionscience.com/mailman/listinfo/visionlist_visionscience.com
    _______________________________________________
    visionlist mailing list
    visionlist at visionscience.com<mailto:visionlist at visionscience.com>
    http://visionscience.com/mailman/listinfo/visionlist_visionscience.com

    _______________________________________________
    visionlist mailing list
    visionlist at visionscience.com<mailto:visionlist at visionscience.com>
    http://visionscience.com/mailman/listinfo/visionlist_visionscience.com

    _______________________________________________
    visionlist mailing list
    visionlist at visionscience.com
    http://visionscience.com/mailman/listinfo/visionlist_visionscience.com