Excercise for Information Extraction

Excercise-patterns.txt — Plain Text, 3Kb

File contents

Entity and relationship extraction
--------------------------

(1)
Find named entities in the text
- free tools: Gate Annie demo, lingpipe, Stanford named entity recognizer
- confront them with the ones present in Wikipedia
- can you find the same entities on Web

On-line
http://gate.ac.uk/annie/

http://lingpipe-demos.com:8080/lingpipe-demos/ne_en_news_muc6/textInput.html

Download and run locally
http://nlp.stanford.edu/software/CRF-NER.shtm

Use 10 articles from bbc.com or cnn.com (2 of each kind)
- sports
- politics
- tech
- health
- travel

(you have to copy-paste the original text to the tool)

What entities can you find using these tools?
Do all people get recognized correctly? Check if they exist in Wiki
- what is the ratio?
What about places and organizations?


--------------------------


Use of Google for finding relationships and their support

Another idea: Text runner - finding patterns
  http://www.cs.washington.edu/research/textrunner/

--------------------------


(2)
How to recognize a country?
What is a pattern and its support?

try few countries -- find common patterns

examples:
Germany 	country
Great Britain 	country
France 		country
USA 		country

write your patterns (3-4 should be sufficient)

... we'll test it later ...

--------------------------

(3)

What is the relationship between following people?
How to spot it in the text?

Mel Ferrer 		Audrey Hepburn
Elliott Gould		Barbra Streisand
Christie Brinkley	Billy Joel
Dyan Cannon		Cary Grant
Michael Douglas		Catherine Zeta-Jones
Connie Booth		John Lahr
Connie Booth		John Cleese
Uma Thurman		Gary Oldman
Jeanne Coyne		Gene Kelly

Write your patterns for checking


... we'll test it later ...

--------------------------


(4)

Geo locations.
How to find places taht are located nearby certain places?

Start with touristic locations.

France -- Loire valley castles
Egypt -- Sphinx, pyramids
Spain -- castles: Alhambra, Alc�r of Seville, Castle of Pedraza ...
Italy -- roman empire remains (Colosseum, Roman Forum) 

Do phrase like "located in" or "placed nearby" are useful?
Is there any common patterns/rules to extract it? 

--------------------------


(5)

Timing events on the web

Find patterns for defining people's year of birth.

Take sample people (you may reuse the list from the list)
- check articles in Wikipedia
  -- is there a common pattern
  -- can we reuse information from infoboxes
     (to look at the structure, go to edit mode)
     
- do the same patterns hold in Web environment (outside of Wikipedia)

Write the language patterns.
Write the pattern(s) with templates.

... we'll test it later ...

--------------------------


(6)

Key people in the companies

based on Wikipedia templates + checking text (if needed) 

How to find CEO of a big company?
How to find VPs ?

IBM
Microsoft
Apple Inc.
Boeing
Moss Bros Group
Compaq
Intel
Amazon.com
Google


What you can find in template?
Is it simple -> attribute-value
Are there any patterns there? RegEx?
Does visiting related pages help? (more complex retrieval) How?

What your algorithm for recognizing such people?

(6a)

How to recognize that this resource is a company?
Any help from Wikipedia category hierarchy? Try traversing it up/down/side

Write what to look for in hierarchy - paths / intermediate nodes

... we'll test it later on other companies ...

--------------------------