Excercise for Information Extraction
Excercise-patterns.txt
—
Plain Text,
3Kb
File contents
Entity and relationship extraction
--------------------------
(1)
Find named entities in the text
- free tools: Gate Annie demo, lingpipe, Stanford named entity recognizer
- confront them with the ones present in Wikipedia
- can you find the same entities on Web
On-line
http://gate.ac.uk/annie/
http://lingpipe-demos.com:8080/lingpipe-demos/ne_en_news_muc6/textInput.html
Download and run locally
http://nlp.stanford.edu/software/CRF-NER.shtm
Use 10 articles from bbc.com or cnn.com (2 of each kind)
- sports
- politics
- tech
- health
- travel
(you have to copy-paste the original text to the tool)
What entities can you find using these tools?
Do all people get recognized correctly? Check if they exist in Wiki
- what is the ratio?
What about places and organizations?
--------------------------
Use of Google for finding relationships and their support
Another idea: Text runner - finding patterns
http://www.cs.washington.edu/research/textrunner/
--------------------------
(2)
How to recognize a country?
What is a pattern and its support?
try few countries -- find common patterns
examples:
Germany country
Great Britain country
France country
USA country
write your patterns (3-4 should be sufficient)
... we'll test it later ...
--------------------------
(3)
What is the relationship between following people?
How to spot it in the text?
Mel Ferrer Audrey Hepburn
Elliott Gould Barbra Streisand
Christie Brinkley Billy Joel
Dyan Cannon Cary Grant
Michael Douglas Catherine Zeta-Jones
Connie Booth John Lahr
Connie Booth John Cleese
Uma Thurman Gary Oldman
Jeanne Coyne Gene Kelly
Write your patterns for checking
... we'll test it later ...
--------------------------
(4)
Geo locations.
How to find places taht are located nearby certain places?
Start with touristic locations.
France -- Loire valley castles
Egypt -- Sphinx, pyramids
Spain -- castles: Alhambra, Alc�r of Seville, Castle of Pedraza ...
Italy -- roman empire remains (Colosseum, Roman Forum)
Do phrase like "located in" or "placed nearby" are useful?
Is there any common patterns/rules to extract it?
--------------------------
(5)
Timing events on the web
Find patterns for defining people's year of birth.
Take sample people (you may reuse the list from the list)
- check articles in Wikipedia
-- is there a common pattern
-- can we reuse information from infoboxes
(to look at the structure, go to edit mode)
- do the same patterns hold in Web environment (outside of Wikipedia)
Write the language patterns.
Write the pattern(s) with templates.
... we'll test it later ...
--------------------------
(6)
Key people in the companies
based on Wikipedia templates + checking text (if needed)
How to find CEO of a big company?
How to find VPs ?
IBM
Microsoft
Apple Inc.
Boeing
Moss Bros Group
Compaq
Intel
Amazon.com
Google
What you can find in template?
Is it simple -> attribute-value
Are there any patterns there? RegEx?
Does visiting related pages help? (more complex retrieval) How?
What your algorithm for recognizing such people?
(6a)
How to recognize that this resource is a company?
Any help from Wikipedia category hierarchy? Try traversing it up/down/side
Write what to look for in hierarchy - paths / intermediate nodes
... we'll test it later on other companies ...
--------------------------