Vladimir Alexiev, Ontotext Corp

Multisensor meeting, 2014-10-08, Bonn, Germany
(HTML,slideshare)
Press O for overview,H for help.
Proudly made in plain text with reveal.js, org-reveal, org-mode and emacs.
There's been a flurry of activity in recent years to represent NLP data as RDF.
NLP data is usually large, why represent it in RDF?
Intro: Christian Chiarcos, John McCrae, Philipp Cimiano, and Christiane Fellbaum. Towards Open Data for Linguistics: Linguistic Linked Data. In New Trends of Research in Ontologies and Lexical Resources. Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg, 2013.
Collaborative bibliography on Linguistic LOD: representing language resources and text annotations as RDF.
Detailed example of annotating one sentence: Turtle, highlighted.
Areas covered include:
Example based on Guardian's article "Goodbye Nuclear Power" with LinguaTec NER: Turtle, highlighted.
Compare to JSONLD or JSONLD without prefixes
We describe briefly the following linguistic ontologies
OLIA includes 34 annotation models (tagsets) for 69 languages
<#Germany-1> nif:oliaLink penn:NNP; nif:oliaClass penn:ProperNoun. <#is-2> nif:oliaLink penn:VBZ; nif:oliaClass penn:BePresentTense. <#the-3> nif:oliaLink penn:DT; nif:oliaClass penn:Determiner.
X-link.owl abstracts over X.owl by providing OLIA subclasses/subproperties, eg
<#Germany-1> nif:oliaClass olia:ProperNoun.
OLIA abstraction doesn't work perfectly in all cases, eg
penn:Determiner doesn't have an OLIA mapping: "Not clear whether this corresponds to
OLiA/EAGLES determiners"penn:BePresentTense is mapped to unionOf that restricts olia:hasTense to have type
olia:Present
<#is-2> nif:oliaClass
[a owl:Class; rdfs:subClassOf
[a owl:Restriction; owl:onProperty olia:hasTense; owl:allValuesFrom olia:Present],
[owl:unionOf (olia:FiniteVerb olia:StrictAuxiliaryVerb)]].
| Ontology | Class | ObjProp | DataProp | Description |
| olia_system | 6 | 3 | 6 | Feature, LinguisticAnnotation, Relation, UnitOfAnnotation, hasTag, hasTier |
| olia_top | 62 | Top categories of the OLiA model | ||
| olia | 857 | 50 | Full OLiA model |
Class: penn:BePresentTense
SubClassOf:
olia:hasTense only olia:Present,
(olia:FiniteVerb or olia:StrictAuxiliaryVerb)
<#Germany-1> nif:oliaLink penn:NNP; nif:oliaClass penn:ProperNoun. <#is-2> nif:oliaLink penn:VBZ; nif:oliaClass penn:BePresentTense. <#the-3> nif:oliaLink penn:DT; nif:oliaClass penn:Determiner. <#work-4> nif:oliaLink penn:NN; nif:oliaClass penn:CommonNoun. <#horse-5> nif:oliaLink penn:NN; nif:oliaClass penn:CommonNoun. <#of-6> nif:oliaLink penn:IN; nif:oliaClass penn:PrepositionOrSubordinatingConjunction. <#the-7> nif:oliaLink penn:DT; nif:oliaClass penn:Determiner. <#European-8> nif:oliaLink penn:NNP; nif:oliaClass penn:ProperNoun. <#Union-9> nif:oliaLink penn:NNP; nif:oliaClass penn:ProperNoun.
Represent as nif:dependency. All are subclasses of stanford:DependencyLabel
nsubj(horse-5,Germany-1): a NominalSubject<Subject<Argument<Dependent cop(horse-5,is-2): a Copula<Auxiliary<Dependent det(horse-5,the-3): a Determiner<Modifier<Dependent nn(horse-5,work-4): a NounCompoundModifier<Modifier<Dependent root(ROOT-0,horse-5): a Root prep(horse-5,of-6): a PrepositionalModifier<Modifier<Dependent det(Union-9,the-7): a Determiner<Modifier<Dependent amod(Union-9,European-8): a AdjectivalModifier<Modifier<Dependent pobj(of-6,Union-9): a ObjectOfPreposition<Object<Complement<Argument<Dependent
stanford:nsubj a stanford:NominalSubject. stanford:NominalSubject rdfs:subClassOf* stanford:DependencyLabel. stanford:DependencyLabel olia_system:Feature.
<#horse-5> nif:dependency <#Germany-1>. <#Germany-1> a stanford:NominalSubject. <#horse-5> nif:dependency <#is-2>. <#is-2> a stanford:Copula. <#horse-5> nif:dependency <#the-3>. <#the-3> a stanford:Determiner. <#horse-5> nif:dependency <#work-4>. <#work-4> a stanford:NounCompoundModifier. <#ROOT-0> nif:dependency <#horse-5>. <#horse-5> a stanford:Root. <#horse-5> nif:dependency <#of-6>. <#of-6> a stanford:PrepositionalModifier. <#Union-9> nif:dependency <#the-7>. <#the-7> a stanford:Determiner. <#Union-9> nif:dependency <#European-8>. <#European-8> a stanford:AdjectivalModifier. <#of-6> nif:dependency <#Union-9>. <#Union-9> a stanford:ObjectOfPreposition.
Internationalization Tag Set (ITS) Version 2.0 is a fairly big W3C spec
We use only the Text Analysis itsrdf: props
taAnnotatorsRef, taConfidence: which software and what confidencetaClassRef: class of annotated text/entity (eg nerd:Company, nerd:PhoneNumber, nerd:Time)taIdentRef: URL of annotated entity:
taSource (eg "Wordnet3.0"), taIdent (eg "301467919"): for entities that are not yet in RDF/resolvable
Common NER types across semantic annotators
IKS (FISE) put the start of Apache Stanbol, a framework for semantic content annotation and management.
Stanbol is only as good as the underlying engines
Analogs (but the properties are in diffent nodes!)
| fise:extracted-from | n/a. Points to the word occurrence |
| fise:start | nif:beginIndex |
| fise:end | end:Index |
| fise:selected-text | nif:contextOf |
| fise:entity-type | itsrdf:taClassRef |
| fise:entity-reference | itsrdf:taIdentReg |
| fise:confidence | itsrdf:taConfidence: number |
| fise:confidence-level | none. owl:Individual: suggestion, uncertain, ambiguous, certain |
| fise:entity-label | eg rdfs:label on the referenced entity |
Sentiment/opinion. Aggregates many opinions (with count), about thing/part/feature
Compare to schema.org Review, Rating, AggregateRating
Representation of websites, folders, pages, forums, postings, users
Ontologies
Thesauri (lists of NLP terms):
Lexicon Model for Ontologies: for representing Wordnets, dictionaries, lexica. See Quick Guide
Extend LEMON with additional features. See Cookbook
LemonGrass (formerly lemon2gf): convertor from Lemon lexicon+ontology
W3C community group. Spec draft (wiki, github, html preview).
Modules:
Best practices:
ISO TC37 Data Category Registry (DCR)
curl -L -Haccept:application/rdf+xml http://www.isocat.org/datcat/DC-65
ontology, extends LEMON
lexinfo:abbreviationFor a owl:ObjectProperty ; dcr:datcat <http://www.isocat.org/datcat/DC-65> ; rdfs:subPropertyOf lexinfo:contractionFor .
Defines 592 entities:
verb, thirdPerson, vulgarRegisterVerb, VerbPOS, VerbPhrase, TensesubstanceHolonym, synonym, translation, tense, voicepronunciation, romanization, transliterationlanguageSpecific, examplegold.ttlDefines
OrthographicSystem, ReferentialVoice, VowelgeneticallyRelated (HumanLanguageVariety), literalTranslation, writtenRealizationabbreviation, phoneticRep, hasExample
Old UI, new UI at DANS (supports Chrome)
In the following slides we describe large-scale Linguistic resources.
Datasets already integrated in FactForge (but old versions):
WordNet: well-known and prototypical lexical resource
http://www.image-net.org: sample images for WordNet
Crowdsourced dictionaries of >300 languages. Eg ancora#Latin at http://en.wiktionary.org:
Dataset that integrates in LEMON format:
Integrates WordNet, Open Multilingual WordNet, Wikipedia, OmegaWiki, Wikidata, Wiktionary
http://babelnet.org/2.0/data/banca_n_IT
bn:banca_n_IT a lemon:LexicalEntry ; rdfs:label "banca"@it ; lemon:canonicalForm bn:banca_n_IT/canonicalForm ; lemon:language "IT" ; lemon:sense bn:banca_IT/s03802146n, bn:banca_IT/s00008371n, bn:banca_IT/s00008364n ; lexinfo:partOfSpeech lexinfo:noun .
http://babelnet.org/2.0/data/banca_IT/s03802146n
bn:Bank_%28topography%29_EN/s03802146n lexinfo:translation bn:banca_IT/s03802146n . bn:Bank_%28sea_floor%29_EN/s03802146n lexinfo:translation bn:banca_IT/s03802146n . bn:banca_IT/s03802146n a lemon:LexicalSense ; bn-lemon:byTrans 1 ; dc:source <http://wikipedia.org/> ; dcterms:license <http://creativecommons.org/licenses/by-sa/3.0/> ; lemon:reference bn:s03802146n .
Eg ancora#lat at http://babelnet.org (3.0 just came out)
Babelfy: annotation API based on BabelNet
Another NER/annotation service; based on DBpedia labels. Too eager, low precision: