Learning formal definitions
for biomedical concepts
ALINA PETROVA
E M C L WO R K S H O P
1 8 . 0 2 . 2 01 4
Examples of reasoning
over structured biomedical knowledge
2
1) Covert et al. 2012: Whole-cell simulation
• computational model of all processes in a bacterium
• 2 years, >1000 articles
2) King et al. 2009: Automation of science
• Adam the Robot Scientist
• generate functional genomic hypothesis
about a yeast
• used knowledge bases and ontologies
for hypothesis generation and analysis
• experimental validation
The growth of biomedical scientific literature
3
Tsatsaronis et al. 2013
Existing biomedical ontologies
4
# concept
year
research/
production
Definitions
UMLS
1,000,000
1986
R,P
textual, triples
SNOMED CT
300,000
1965
P
formal
FMA
75,000
1995
P
triples
GO
42,000
1998
P
textual
GALEN
29,000
1991
R
formal
MeSH
25,000
1963
P
textual
Great need to convert textual definitions to formal representation!
Formalizing biomedical knowledge
5
Atelectasis (Lung collapse) example:
Absence of air in the entire or part of a lung, such as an incompletely inflated neonate
lung or a collapsed adult lung. Pulmonary atelectasis can be caused by airway
obstruction, lung compression, fibrotic contraction, or other factors.
vs.
Atelectasis = Disorder of lung ⊓
∃has_associate_morphology(Collapse) ⊓
∃has_finding_site(Lung structure) ⊓
∃has_episodicity(Episodicities) ⊓
∃has_clinical_course(Courses)
…
An example of a MeSH definition
6
Arthritis is a form of joint disorder that results
from joint inflammation. When bone surfaces
become less well protected by cartilage, bone may be
exposed and damaged.
Is it easy to formalize a definition?
7
Arthritis is a form of joint disorder that results from joint inflammation.
Arthritis
=
Joint_Disorder ⊓ ∃results_from.Joint_Inflammation
YES!
Is it easy to formalize a definition?
8
Temporal logic?
Situation calculus?
When bone surfaces become less well protected by cartilage,
bone may be exposed and damaged.
DL? which one?
Modal logic?
???
NO!
Sources of problems
9
Conceptual modeling
¡ Joint_Inflammation or Inflammation – related_to – Joints ?
Expressive modeling
¡ what exactly do we want to model? to what degree of
sophistication? using which formalism?
Text mining
¡ how to establish the dependencies between words in a
definition?
The Goal
10
A is a B that has property C.
A ≣ B ⊓ ∃property.C
How to extract formal definitions?
11
CONCEPT ANNOTATION
RELATION EXTRACTION
RELATION CLASSIFICATION
Example
12
Abdominal Wall: the outer margins of the abdomen,
extending from the osteocartilaginous thoracic cage to the
pelvis.
Step 1: Concept annotation
13
Abdominal Wall: the outer margins of the abdomen,
extending from the osteocartilaginous thoracic cage to the
pelvis.
the abdomen -> ‘Abdomen’
the osteocartilaginous thoracic cage -> ‘Thorax’
the pelvis -> ‘Pelvis’
Step 2: Relation extraction
14
Abdominal Wall: the outer margins of the abdomen,
extending from the osteocartilaginous thoracic cage to the
pelvis.
“outer margins of” (Abdominal wall, Abdomen)
“that extends from” (Abdominal wall, Thorax)
“that extends to” (Abdominal wall, Pelvis)
Step 3: Relation classification
15
“outer margins of” (Abdominal wall, Abdomen)
“that extends from” (Abdominal wall, Thorax)
“that extends to” (Abdominal wall, Pelvis)
location(Abdominal wall, Abdomen)
starts(Abdominal wall, Thorax)
ends(Abdominal wall, Pelvis)
How to extract formal definitions?
16
CONCEPT ANNOTATION
RELATION EXTRACTION
RELATION CLASSIFICATION
SUPERVISED
17
RELATION EXTRACTION
Approach #1: align existing resources
18
Atelectasis (Lung collapse) example:
Absence of air in the entire or part of a lung, such as an incompletely inflated neonate
lung or a collapsed adult lung. Pulmonary atelectasis can be caused by airway
obstruction, lung compression, fibrotic contraction, or other factors.
vs.
Atelectasis = Disorder of lung ⊓
∃has_associate_morphology(Collapse) ⊓
∃has_finding_site(Lung structure) ⊓
∃has_episodicity(Episodicities) ⊓
∃has_clinical_course(Courses)
…
Results
19
A – relational string – B
A – relation label – B
Relations: extract 3 SNOMED relations from MeSH
textual definitions
Results: 75% success rate for single-label
classification
Results
20
A – relational string – B
A – relation label – B
How to improve 75%?
¡ add new features
¡ use resources with consistent modeling
Be data-driven!
Approach #2: annotate a corpus
21
SemRep:
a rule-based system for biomedical relation extraction
26 relations
a corpus of 500 annotated sentences
1300 relation instances
Top relations:
process_of, location_of, part_of, treats, isa, affects,
causes, interacts_with, uses etc.
SemRep relations
22
Two key improvements
23
Consistent modeling
Before: MeSH texts VS. SNOMED CT relations
After: SemRep texts VS. SemRep relations
The use of concept types
Before: lexical features (ngrams)
After: ngrams + concept types of relation arguments
Concept types
24
Motivation: every relation has a domain and a range
è only specific types of concepts can be used as arguments
UMLS (biggest knowledge source for biomedicine,
thesaurus, upper ontology etc.):
133 semantic types
Tissue, Cell Function, Animal, Behavior, Physical Object,
Molecular Sequence etc.
Hormone – affects – Cell Function
Body Substance – causes – Anatomical Abnormality
Why concept types are useful?
25
given concepts A, B
MeSH triple:
Before:
After:
A “is in some relation with” B
A – relation R1 – B
A – relation R2 – B
A à type Аt, B à type Bt
R1 ⊆ At x Bt
R2 ⊆ Ct x Dt
both are candidates!
RESULTS
26
Before: 424 instances, top 3 relations, 75%
After: 860 instances, top 5 relations, 94%
1144 instances, top 10 relations, 89%
1357 instances, all 26 relations, 83%
Comparison with SemRep
27
SemRep
Quality
ML method
top 5
95%
94%
top 10
94%
89%
all
94%
83%
not scalable
scalable
manually annotated corpus
+ rules = months
annotated corpus + ML =
minutes
Scalability
Training speed
still rely on the labeled corpus è approach #3
UNSUPERVISED
28
RELATION EXTRACTION
Why is no annotated corpus needed?
29
Original approach:
term A – relational string – term B
annotation
concept A – formal relation – B concept
Now add the concept types!
The corpus is not manually annotated!
30
term A – relational string – term B
concept A – relational string – concept B
known from
taxonomy/
thesaurus!
known from
the corpus
concept type A’ – formal relation – concept type B’
Still we use SemRep as a background. Can we do better?
Approach #3: unsupervised relation extraction
31
Yes!
term A – relational string – term B
no manual annotation
no predefined relations
only taxonomy and
annotation needed
semantic clustering
concept A – relational string – concept B
concept type A’ – verb – concept type B’
Cluster examples
32
{attach, bind}
{cause, produce, induce}
{transmit, convey, carry}
{limit, inhibit, reduce}
{result, lead}
etc.
Conclusions
33
decompose the task of formal definition generation
¡ review of the existing approaches
¡ adaptation/creation
¡ implementation
¡ evaluation
explore non-taxonomic relation extraction
¡ feature analysis
¡ performance of 94% on a par with SemRep
suggest workflow for unsupervised relation extraction
¡ faster
¡ less resource dependent
¡ can be generalized to different domains and applications
Thank you!
QUESTIONS?
35
Back-up slides
TRIPLE EXTRACTION
Example
37
Abdominal Wall: the outer margins of the abdomen,
extending from the osteocartilaginous thoracic cage to the
pelvis.
STEP #2: triple extraction
“outer margins of the” (Abdominal wall, Abdomen)
“that extends from the osteocartilaginous” (Abdominal wall,
Thorax)
“to the” (Abdominal wall, Pelvis)
Triple extraction steps
38
1. separate the definition into head and body
2. find the parent term, if there is one
3. group coordinated concepts together
4. organize concepts into concept pairs
5. extract relational string for every pair
6. detect negation
Triple extraction steps
39
separate the definition into head and body
Head: Abdominal wall
Body: the outer margins of the abdomen…
find the parent term, if there is one
group coordinated concepts together
organize concepts into concept pairs
extract relational string for every pair
detect negation
Triple extraction steps
40
separate the definition into head and body
find the parent term, if there is one
“Cancer is a disease that…”
è IS_A(Cancer, Disease)
group coordinated concepts together
organize concepts into concept pairs
extract relational string for every pair
detect negation
Triple extraction steps
41
separate the definition into head and body
find the parent term, if there is one
group coordinated concepts together
“X causes swelling and rashes”
è causes(X, Swelling), causes(X, Rash)
organize concepts into concept pairs
extract relational string for every pair
detect negation
Triple extraction steps
42
separate the definition into head and body
find the parent term, if there is one
group coordinated concepts together
organize concepts into concept pairs
extract relational string for every pair
“that extends to the osteocartilaginous” (Abdominal wall,
Thorax)
detect negation
Triple extraction steps
43
separate the definition into head and body
find the parent term, if there is one
group coordinated concepts together
organize concepts into concept pairs
extract relational string for every pair
detect negation
“that does not respond to the ordinary” (Refractory anemia,
Treatment) è NEGATION
Back-up slides
44
ANNOTATION
Attribute Alignment Annotator
45
Problem # 1: missing annotations
46
Problem #2: ambiguity
47
Back-up slides
DETAILS OF THE APPROACH
Improvements since the last meeting
49
Old approach
New approach
Text source
MeSH definitions
MEDLINE abstracts
Relation set R source
SNOMED CT
UMLS
Feature sources
text of a definition
text of a definition + concept types
Feature representations
BoW, token and character ngrams, character ngrams
combination
Weighting schemes
boolean, per-class weights
boolean
Classification algorithm
SVMs, Random Forests,
Logistic Regression, Naïve Bayes
SVMs
© Copyright 2025 Paperzz