Optimizing Caching and Crawling Strategies for Stream-based SchemEX Computation
- Art der Arbeit: Masterarbeit/Diplomarbeit
- Interne Betreuer: Dr. Thomas Gottron, Prof. Dr. Ansgar Scherp
SchemEX is a stream-based approach to compute an schema index over Linked Open Data (LOD) . The data stream is generated by an RDF crawler harvesting triples from the semantic web. So far, SchemEX uses a FIFO queue as cache on the stream of RDF triples to extract schema information from the crawled resources. The strategy of the RDF crawler so far is not considered at all.
Different caching strategies on a given data stream influence the quality of the resulting schema index. Likewise a guidance of the crawler or the provision of a more suitable crawling strategy might be favourable for a better index quality. The task would be to develop, implement and evaluate different strategies for caching and crawling in the SchemEX scenario.
In more detail, the work should cover:
- Development of caching strategies
- Development of crawling strategies/guidance
- Incorporation of the strategies in the existing system used for computing SchemEX
- Evaluation on a suitable corpus
- Good programming skills
- Knowledge of Semantic Web techniques are of advantage
- Management of large data sets will be necessary
 Mathias Konrath, Thomas Gottron, and Ansgar Scherp. SchemEX -- Web-Scale Indexed Schema Extraction of Linked Open Data, http://www.cs.vu.nl/~pmika/swc/submissions2011/swc2011_submission_5.pdf