A Probabilistic Framework for Plagiarism Detection
- Art der Arbeit: Masterarbeit/Diplomarbeit
- Interne Betreuer: Dr. Thomas Gottron
Plagiarism is the “"Use or close imitation of the language and thoughts of another author and the representation of them as one's own original work."(Random House Compact Unabridged Dictionary, 1995 ). Systems for plagiarism detection aim to automatically recognize plagiarised texts. The most common setting is the extrinsic analysis, in which a reference corpus is given from which a suspicious document might have plagiarised text fragments to several extents. In this case a system has to first select candidate documents from the reference corpus, second to analyse in detail which parts have actually been plagiarised and third clean the found fragments in a post-processing step.
Most system use heuristic approaches and a mixture of methods to solve the individual steps in the plagiarism detection task. The objective of this thesis here is to develop and evaluate a sound probabilistic framework for plagiarism detection. Core component would be an existing method to model the likelihood of a text fragment to be a plagiate of another text fragment.
In more detail, the work should cover:
- Complete probabilistic framework for plagiarism detection
- Adaptation of the framework to the steps of candidate selection, detailed analysis and post-processing
- Implementation of a reference system
- Evaluation on corpora for plagiarism detection.
- Good programming skills
- Good knowledge of basic probabilistic maths
- Knowledge of Information Retrieval techniques are of advantage
- Management of large data sets will be necessary