Plagiarism is the “"Use or close imitation of the language and thoughts of another author and the representation of them as one's own original work."(Random House Compact Unabridged Dictionary, 1995 ). Systems for plagiarism detection aim to automatically recognize plagiarised texts. The most common setting is the extrinsic analysis, in which a reference corpus is given from which a suspicious document might have plagiarised text fragments to several extents. In this case a system has to first select candidate documents from the reference corpus, second to analyse in detail which parts have actually been plagiarised and third clean the found fragments in a post-processing step.
Most system use heuristic approaches and a mixture of methods to solve the individual steps in the plagiarism detection task. The objective of this thesis here is to develop and evaluate a sound probabilistic framework for plagiarism detection. Core component would be an existing method to model the likelihood of a text fragment to be a plagiate of another text fragment.
In more detail, the work should cover: