Named Entity Recognition What is ne? What isn’t ne?

Yüklə 1,13 Mb.

tarix	29.09.2018
ölçüsü	1,13 Mb.
	#71141

Named Entity Recognition

What is NE?
What isn’t NE?
Problems and solutions with NE task definitions
Problems and solutions with NE task
Some applications

Why do NE Recognition?

Key part of Information Extraction system
Robust handling of proper names essential for many applications
Pre-processing for different classification levels
Information filtering
Information linking

NE Definition

NE involves identification of proper names in texts, and classification into a set of predefined categories of interest.
Three universally accepted categories: person, location and organisation
Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc.
Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.

What NE is NOT

NE is not event recognition.
NE recognises entities in text, and classifies them in some way, but it does not create templates, nor does it perform co-reference or entity linking, though these processes are often implemented alongside NE as part of a larger IE system.
NE is not just matching text strings with pre-defined lists of names. It only recognises entities which are being used as entities in a given context.
NE is not easy!

Problems in NE Task Definition

Category definitions are intuitively quite clear, but there are many grey areas.
Many of these grey area are caused by metonymy.
Person vs. Artefact: “The ham sandwich wants his bill.” vs “Bring me a ham sandwich.”
Organisation vs. Location : “England won the World Cup” vs. “The World Cup took place in England”.
Company vs. Artefact: “shares in MTV” vs. “watching MTV”
Location vs. Organisation: “she met him at Heathrow” vs. “the Heathrow authorities”

Solutions

The task definition must be very clearly specified at the outset.
The definitions adopted at the MUC conferences for each category listed guidelines, examples, counter-examples, and “logic” behind the intuition.
MUC essentially adopted simplistic approach of disregarding metonymous uses of words, e.g. “England” was always identified as a location. However, this is not always useful for practical applications of NER (e.g. football domain).
Idealistic solutions, on the other hand, are not always practical to implement, e.g. making distinctions based on world knowledge.

Basic Problems in NE

Variation of NEs – e.g. John Smith, Mr Smith, John.
Ambiguity of NE types

John Smith (company vs. person)
May (person vs. month)
Washington (person vs. location)
1945 (date vs. time)

Ambiguity with common words, e.g. “may”

List Lookup Approach

System that recognises only entities stored in its lists (gazetteers).
Advantages - Simple, fast, language independent, easy to retarget
Disadvantages – collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

Shallow Parsing Approach

Internal evidence – names often have internal structure. These components can be either stored or guessed.
location:
CapWord + {City, Forest, Center}
e.g. Sherwood Forest
Cap Word + {Street, Boulevard, Avenue, Crescent, Road}
e.g. Portobello Street

Shallow Parsing Approach

External evidence - names are often used in very predictive local contexts
Location:
“to the” COMPASS “of” CapWord
e.g. to the south of Loitokitok
“based in” CapWord
e.g. based in Loitokitok
CapWord “is a” (ADJ)? GeoWord
e.g. Loitokitok is a friendly city

Difficulties in Shallow Parsing Approach

Ambiguously capitalised words (first word in sentence)
[All American Bank] vs. All [State Police]
Semantic ambiguity
“John F. Kennedy” = airport (location)
“Philip Morris” = organisation
Structural ambiguity
[Cable and Wireless] vs. [Microsoft] and [Dell]
[Center for Computational Linguistics] vs. message from [City Hospital] for
[John Smith].

Technology

JAPE (Java Annotations Pattern Engine)
Based on Doug Appelt’s CPSL
Reimplementation of NE recogniser from LaSIE

NE System Architecture

Modules

Tokeniser

segments text into tokens, e.g. words, numbers, punctuation

Gazetteer lists

NEs, e.g. towns, names, countries, ...
key words, e.g. company designators, titles, ...

Grammar

hand-coded rules for NE recognition

JAPE

Set of phases consisting of pattern /action rules
Phases run sequentially and constitute a cascade of FSTs over annotations
LHS - annotation pattern containing regular expression operators
RHS - annotation manipulation statements
Annotations matched on LHS referred to on RHS using labels attached to pattern elements

Tokeniser

Set of rules producing annotations
LHS is regular expression matched on input
RHS describes annotations to be added to AnnotationSet
(UPPERCASE _LETTER) (LOWERCASE_LETTER)* >
Token; orth = upperInitial; kind = word

Gazetteer

Set of lists compiled into Finite State Machines
Each list has attributes MajorType and MinorType (and optionally, Language)
city.lst: location: city
currency_prefix.lst: currency_unit: pre_amount
currency_unit.lst: currency_unit: post_amount

Named entity grammar

hand-coded rules applied to annotations to identify NEs
annotations from format analysis, tokeniser and gazetteer modules
use of contextual information
rule priority based on pattern length, rule status and rule ordering

Example of JAPE Grammar rule

Rule: Location1
Priority: 25
( ( { Lookup.majorType == loc_key,
Lookup.minorType == pre}
{ SpaceToken} )?
{ Lookup.majorType == location}
( {SpaceToken}
{ Lookup.majorType == loc_key,
Lookup.minorType == post} ) ?
)
: locName -->

:locName.Location = { kind = “gazetteer”, rule = Location1
}

MUSE

MUlti-Source Entity recognition
Named entity recognition from a variety of text types, domains and genres.
2 years from Feb 2000 – 2002
Sponsors: GCHQ

PASTA

Protein Active Site Template Acquisition
Aim: Use of IE techniques to create a database of protein active site data to support protein structure analysis
Partners: Dept. of Computer Science, Information Studies, Mol. Biology and Biotechnology, Univ. of Sheffield
Sponsors: BBSRC-EPSRC Bioinformatics Initiative

Molecular Biology

PASTA System Architecture

Recognition of Biological Terminology

MUMIS

MUltiMedia Indexing and Searching environment
Application of IE technology to multimedia, multilingual video indexing in football domain
2 years: June 2000 - 2002
CTIT (NL), University of Sheffield (UK), DFKI (D), Max Planck Institute (D), University of Nijmegen (NL), ESTeam (SWE), VDA (NL)

Yüklə 1,13 Mb.

Dostları ilə paylaş:

Named Entity Recognition What is ne? What isn’t ne?

Named Entity Recognition

What is NE?

What isn’t NE?

Problems and solutions with NE task definitions

Problems and solutions with NE task

Some applications

Why do NE Recognition?

Key part of Information Extraction system

Robust handling of proper names essential for many applications

Pre-processing for different classification levels

Information filtering

Information linking

NE Definition

NE involves identification of proper names in texts, and classification into a set of predefined categories of interest.

Three universally accepted categories: person, location and organisation

Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc.

Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.

What NE is NOT

NE is not event recognition.

NE recognises entities in text, and classifies them in some way, but it does not create templates, nor does it perform co-reference or entity linking, though these processes are often implemented alongside NE as part of a larger IE system.

NE is not just matching text strings with pre-defined lists of names. It only recognises entities which are being used as entities in a given context.

NE is not easy!

Problems in NE Task Definition

Category definitions are intuitively quite clear, but there are many grey areas.

Many of these grey area are caused by metonymy.

Person vs. Artefact: “The ham sandwich wants his bill.” vs “Bring me a ham sandwich.”

Organisation vs. Location : “England won the World Cup” vs. “The World Cup took place in England”.

Company vs. Artefact: “shares in MTV” vs. “watching MTV”

Location vs. Organisation: “she met him at Heathrow” vs. “the Heathrow authorities”

Solutions

The task definition must be very clearly specified at the outset.

The definitions adopted at the MUC conferences for each category listed guidelines, examples, counter-examples, and “logic” behind the intuition.

MUC essentially adopted simplistic approach of disregarding metonymous uses of words, e.g. “England” was always identified as a location. However, this is not always useful for practical applications of NER (e.g. football domain).

Idealistic solutions, on the other hand, are not always practical to implement, e.g. making distinctions based on world knowledge.

Basic Problems in NE

Variation of NEs – e.g. John Smith, Mr Smith, John.

Ambiguity of NE types

Ambiguity with common words, e.g. “may”

More complex problems in NER

Issues of style, structure, domain, genre etc.

Dept. of Computing and Maths

Manchester Metropolitan University

Manchester

United Kingdom

> Tell me more about Leonardo

> Da Vinci

List Lookup Approach

System that recognises only entities stored in its lists (gazetteers).

Advantages - Simple, fast, language independent, easy to retarget

Disadvantages – collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

Shallow Parsing Approach

Internal evidence – names often have internal structure. These components can be either stored or guessed.

location:

CapWord + {City, Forest, Center}

e.g. Sherwood Forest

Cap Word + {Street, Boulevard, Avenue, Crescent, Road}

e.g. Portobello Street

Shallow Parsing Approach

External evidence - names are often used in very predictive local contexts

Location:

“to the” COMPASS “of” CapWord

e.g. to the south of Loitokitok

“based in” CapWord

e.g. based in Loitokitok

CapWord “is a” (ADJ)? GeoWord

e.g. Loitokitok is a friendly city

Difficulties in Shallow Parsing Approach

Ambiguously capitalised words (first word in sentence)

[All American Bank] vs. All [State Police]

Semantic ambiguity

“John F. Kennedy” = airport (location)

“Philip Morris” = organisation

Structural ambiguity

[Cable and Wireless] vs. [Microsoft] and [Dell]

[Center for Computational Linguistics] vs. message from [City Hospital] for

[John Smith].

Technology

JAPE (Java Annotations Pattern Engine)

Based on Doug Appelt’s CPSL