|
Named Entity Recognition What is ne? What isn’t ne?
|
tarix | 29.09.2018 | ölçüsü | 1,13 Mb. | | #71141 |
|
What is NE? What isn’t NE? Problems and solutions with NE task definitions Problems and solutions with NE task Some applications
Why do NE Recognition? Key part of Information Extraction system Robust handling of proper names essential for many applications Pre-processing for different classification levels Information filtering Information linking
NE Definition NE involves identification of proper names in texts, and classification into a set of predefined categories of interest. Three universally accepted categories: person, location and organisation Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc. Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.
What NE is NOT NE is not event recognition. NE recognises entities in text, and classifies them in some way, but it does not create templates, nor does it perform co-reference or entity linking, though these processes are often implemented alongside NE as part of a larger IE system. NE is not just matching text strings with pre-defined lists of names. It only recognises entities which are being used as entities in a given context. NE is not easy!
Problems in NE Task Definition Category definitions are intuitively quite clear, but there are many grey areas. Many of these grey area are caused by metonymy. Person vs. Artefact: “The ham sandwich wants his bill.” vs “Bring me a ham sandwich.” Organisation vs. Location : “England won the World Cup” vs. “The World Cup took place in England”. Company vs. Artefact: “shares in MTV” vs. “watching MTV” Location vs. Organisation: “she met him at Heathrow” vs. “the Heathrow authorities”
Solutions The task definition must be very clearly specified at the outset. The definitions adopted at the MUC conferences for each category listed guidelines, examples, counter-examples, and “logic” behind the intuition. MUC essentially adopted simplistic approach of disregarding metonymous uses of words, e.g. “England” was always identified as a location. However, this is not always useful for practical applications of NER (e.g. football domain). Idealistic solutions, on the other hand, are not always practical to implement, e.g. making distinctions based on world knowledge.
Variation of NEs – e.g. John Smith, Mr Smith, John. Ambiguity of NE types - John Smith (company vs. person)
- May (person vs. month)
- Washington (person vs. location)
- 1945 (date vs. time)
Ambiguity with common words, e.g. “may”
More complex problems in NER Issues of style, structure, domain, genre etc. - Punctuation, spelling, spacing, formatting, ….all have an impact
Manchester Metropolitan University Manchester United Kingdom > Tell me more about Leonardo > Da Vinci
List Lookup Approach System that recognises only entities stored in its lists (gazetteers). Advantages - Simple, fast, language independent, easy to retarget Disadvantages – collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity
Internal evidence – names often have internal structure. These components can be either stored or guessed. location: CapWord + {City, Forest, Center} e.g. Sherwood Forest Cap Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street
Shallow Parsing Approach External evidence - names are often used in very predictive local contexts Location: “to the” COMPASS “of” CapWord e.g. to the south of Loitokitok “based in” CapWord e.g. based in Loitokitok CapWord “is a” (ADJ)? GeoWord e.g. Loitokitok is a friendly city
Difficulties in Shallow Parsing Approach Ambiguously capitalised words (first word in sentence) [All American Bank] vs. All [State Police] Semantic ambiguity “John F. Kennedy” = airport (location) “Philip Morris” = organisation Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell] [Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith].
Technology JAPE (Java Annotations Pattern Engine) Based on Doug Appelt’s CPSL Reimplementation of NE recogniser from LaSIE
NE System Architecture
Modules Tokeniser - segments text into tokens, e.g. words, numbers, punctuation
Gazetteer lists - NEs, e.g. towns, names, countries, ...
- key words, e.g. company designators, titles, ...
Grammar - hand-coded rules for NE recognition
JAPE Set of phases consisting of pattern /action rules Phases run sequentially and constitute a cascade of FSTs over annotations RHS - annotation manipulation statements Annotations matched on LHS referred to on RHS using labels attached to pattern elements
Tokeniser Set of rules producing annotations LHS is regular expression matched on input RHS describes annotations to be added to AnnotationSet (UPPERCASE _LETTER) (LOWERCASE_LETTER)* > Token; orth = upperInitial; kind = word
Gazetteer Set of lists compiled into Finite State Machines Each list has attributes MajorType and MinorType (and optionally, Language) city.lst: location: city currency_prefix.lst: currency_unit: pre_amount currency_unit.lst: currency_unit: post_amount
Named entity grammar hand-coded rules applied to annotations to identify NEs annotations from format analysis, tokeniser and gazetteer modules use of contextual information rule priority based on pattern length, rule status and rule ordering
Example of JAPE Grammar rule Rule: Location1 Priority: 25 ( ( { Lookup.majorType == loc_key, Lookup.minorType == pre} { SpaceToken} )? { Lookup.majorType == location} ( {SpaceToken} { Lookup.majorType == loc_key, Lookup.minorType == post} ) ? ) : locName --> - :locName.Location = { kind = “gazetteer”, rule = Location1
- }
MUSE MUlti-Source Entity recognition Named entity recognition from a variety of text types, domains and genres. 2 years from Feb 2000 – 2002 Sponsors: GCHQ
PASTA Protein Active Site Template Acquisition Aim: Use of IE techniques to create a database of protein active site data to support protein structure analysis Partners: Dept. of Computer Science, Information Studies, Mol. Biology and Biotechnology, Univ. of Sheffield Sponsors: BBSRC-EPSRC Bioinformatics Initiative
Molecular Biology
PASTA System Architecture
Recognition of Biological Terminology
MUMIS MUltiMedia Indexing and Searching environment Application of IE technology to multimedia, multilingual video indexing in football domain 2 years: June 2000 - 2002 CTIT (NL), University of Sheffield (UK), DFKI (D), Max Planck Institute (D), University of Nijmegen (NL), ESTeam (SWE), VDA (NL)
Dostları ilə paylaş: |
|
|