Machine Learning for Understanding Biomedical Publications

MACHINE LEARNING FOR UNDERSTANDING
BIOMEDICAL PUBLICATIONS
Grigorios Tsoumakas,
School of informatics,
Aristotle university of Thessaloniki
with: A. Anagnostou, A. Fachantidis, A. Lagopoulos,
M. Laliotis, N. Markantonatos, Y. Papanikolaou, I. Vlahavas

ENSEMBLE METHODS
2
Pedro Domingos. 2012. A few useful things to know about
machine learning. Commun. ACM 55
“LEARN MANY MODELS, NOT JUST ONE”
Anthony Goldbloom. Kaggle CEO. Oct 2015.
“As long as Kaggle has been around, it has almost
always been ensembles of decision trees that have
won competitions. It used to be random forest that
was the big winner, but over the last six months a new
algorithm called XGboost has cropped up, and it’s
winning practically every competition in the
structured data category.”

MULTI-LABEL LEARNING
3
𝑋1 𝑋2 … 𝑋 𝒑 𝑌1 𝑌2 … 𝑌 𝒒
0.12 1 … 12 0 1 … 1
2.34 9 … -5 1 1 … 0
1.22 3 … 40 1 0 … 1
2.18 2 … 8 ? ? … ?
1.76 7 … 23 ? ? … ?
𝑝 input variables 𝑞 binary output variables
𝑚 training
examples
unknown
instances
Binary Relevance (BR)
• Learns one binary
model per label
• Ignores label
dependencies

MULTI-LABEL LEARNING FROM BIOLOGICAL DATA
Annotation of proteins with functions
 FunCat, 6 levels, 492 labels, ~9 on avg.
 GO, 14 levels, 3997 labels, ~35 on avg.
Drug discovery (Johnson & Johnson)
 743,336 chemical compounds
 ~13m chemical structure features (sparse)
 5,069 biomolecular targets (e.g. proteins)
4

OUTLINE
1. Semantic indexing of biomedical literature
2. Article screening in systematic reviews
3. Modality classification of biomedical figures
4. PICO sentence identification
5. Funding information extraction
5

Literatum is Atypon’s online content hosting and management platform
Atypon is home to more than one-third of the world’s English-language
professional and scholarly journals — more than any other technology
company
Atypon’s clients include Elsevier, IEEE, MIT Press, Oxford University
Press, …
Atypon was acquired by John Wiley & Sons in 2016 for $120,000,000
6

OUTLINE
 Y. Papanikolaou, G. Tsoumakas, M. Laliotis, N. Markantonatos, I. Vlahavas
(2017) Large-Scale Online Semantic Indexing of Biomedical Articles via an
Ensemble of Multi-Label Classification Models, Journal of Biomedical Semantics
7

0
200000
400000
600000
800000
1000000
1200000
1950
1953
1956
1959
1962
1965
1968
1971
1974
1977
1980
1983
1986
1989
1992
1995
1998
2001
2004
2007
2010
2013
x $10
CHALLENGE
PubMed abstracts
 12,834,585 (20Gb)
MeSH terms
 27,773, ~13 per abstract on avg.
Online test setting
 3 phases, 5 weeks per phase
 MeSH terms for 6k to 10k abstracts
requested within 21 hours
8
1 million docs / year ≅ 2,740 docs / day

LABEL FREQUENCY DISTRIBUTION
9
4.3 million abstracts
213 labels with 1 example
1,680 labels with less than
10 examples
4 labels with more than
1 million examples
Label frequency
Numberoflabels

PROGRESS 2013 – 2017
10
0,54
0,56
0,58
0,6
0,62
0,64
0,66
0,68
2013 2014 2015 2016 2017
MicroF-Measure
Year
AUTH Fudan NLM

PRE-PROCESSING
11
text of title
and abstract
parsing
tokenization,
lowercasing,
n-gram
extraction
10,876,004 10,699,707 3,950,721
duplicate
removal
journal
filtering
tf-idf
computation,
unit length
normalization
unigrams and
bigrams with
>5 frequency
final feature
vectors
last 12,000
withheld for
evaluation

kinetics prognosis liposomes polymerskinetics prognosis liposomes polymers
4 1 2 3
LEARNING
12
Biomedical
Document
Label
Ranker
Meta
Labeler
Number of relevant labels: 2
Output is: {prognosis, liposomes}
- Any multi-label learning algorithm that can output a ranking of the labels
- We used a linear SVM per label and considered their unthresholded output
- Regression or (ordinal) classification
using original features or label
scores/ranks
- We used linear SVM regression
based on the original features
Tang, L., Rajan, S., Narayanan, V.K. “Large scale multi-label classification
via metalabeler”, Proc. 18th Int. Conf. on World Wide Web (WWW '09)

TIME AND SPACE
Hardware
 4 10-core processors at 2.26 GHz, 1 Tb RAM and
2.4 Tb storage (6 x 600 Gb SAS 10k disks in RAID 5)
Parallel learning/use of binary SVMs
 With 40 threads training & saving takes 36h
 With 20 threads loading & prediction takes 45m
Serialization
 Storing the models in 2013 required 406 Gb
 10x compression due to sparsity (L1 regularized models)

ENSEMBLE APPROACHES: CLASSIFIER SELECTION
Select the model that improves the corresponding F-measure most
 Jimeno-Yepes, A., Mork, J.G., Demner-Fushman, D., Aronson, A.R.: A one-size-fits-
all indexing method does not exist: Automatic selection based on meta-learning.
JCSE 6(2), 151-160 (2012)
 No good for global non-decomposable evaluation measures, like micro-F
Iteratively select the model that improves micro-F
 Fan, R.E., Lin, C.J.: A study on threshold selection for multi-label classification.
Technical report, National Taiwan University (2007)
 Can we trust the evaluation based on only a few positive samples?
14

MULE: MULTI-LABEL ENSEMBLE
1. Determine the globally best model ℎ∗ on a validation set
 Globally best model has been determined on positive samples of all labels
2. Determine for each label which model(s) would lead to an
improvement of the global evaluation measure compared to ℎ∗
3. Compare the differences in predictions of each one of
these models against the predictions of ℎ∗ using a McNemar test
4. If the NH is rejected for one or more models, select the one for
which the NH has lowest probability, otherwise select ℎ∗
 Robustness to uncertainty due to label rarity
15

EMPIRICAL RESULTS IN BIOASQ
16
Model Micro-F Macro-F
Vanilla SVMs 0.5568 0.4789
Weighted SVMs 0.5665 0.5102
MetaLabeler 0.5855 0.5488
Labeled LDA 0.3698 0.3010
Ensemble Micro-F Macro-F
Improve F
0.5584
(all)
0.5339
(MetaLabeler, Weighted SVM)
Improve Micro-F
0.5867
(all)
-
MULE
0.5892
(all)
0.5492
(MetaLabeler, Labeled LDA)

2017 WORK
Improved version of Labeled LDA in both speed and accuracy
 Y. Papanikolaou, G. Tsoumakas, Subset Labeled LDA for Large-Scale
Multi-Label Classification, arXiv:1709.05480
Employing word2vec features
More ensembles
 Stacking-based, frequency-based
Deep learning models
 Deep MLP, CNN
17

OUTLINE
 A. Anagnostou, A. Lagopoulos, G. Tsoumakas, I. Vlahavas (2017) Combining
Inter-Review Learning-to-Rank and Intra-Review Incremental Training for Title
and Abstract Screening in Systematic Reviews, eHealth Lab of the 8th
Conference and Labs of the Evaluation Forum (CLEF)
18

DIAGNOSTIC TEST ACCURACY (DTA) REVIEWS
Title
 Thromboelastography (TEG) and rotational thromboelastometry (ROTEM) for
trauma-induced coagulopathy in adult trauma patients with bleeding
Ovid MEDLINE query
(Thrombelastogra$ or Thromboelastogra$ or (thromb$ adj2 elastogra$) or
TEG or haemoscope or haemonetics).mp
Thrombelastography/
(thromboelasto$ or thrombelasto$ or (thromb$ adj2 elastom$) or
(rotational adj2 thrombelast) or ROTEM or "tem international").mp.
1 or 2 or 3
exp animals/ not humans.sh.
4 not 5
limit 6 to yr="1970 -Current"
19

THE DATA
Training data
 20 topics, retrieved articles and
relevance after abstract/content
screening
Test data
 30 topics and retrieved articles
20
1
10
100
1000
10000
100000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Pos Neg

OUR APPROACH
Hybrid classification mechanism
Inter-topic model, learning title/abstract
relevance across different topics
Intra-topic model, learning title/abstract
relevance within a specific topic

INTER-TOPIC MODEL
Learning-to-rank binary classifier
Features assessing the similarity
of the document (title, abstract)
with the topic (title, ovid query)
 Common terms
 Levenshtein distance
 Cosine similarity
 BM25

INTRA-TOPIC MODEL: INITIAL TRAINING
23
TF-IDF

INTRA-TOPIC MODEL: ITERATIVE TRAINING
Parameters
 Initial size 𝑘 {5, 10}
 1st batch size {1}
 1st threshold {200, 300}
 2nd batch size {50, 100}
 2nd threshold {1000, 2000}
24
TF-IDF

RESULTS
Inter-Topic: eXtreme Gradient Boosting (XGBoost)
Intra-topic: Support Vector Machine (SVM)
25
k 1st
S 1st
T 2nd
S 2nd
T
5 1 200 100 2000
10 1 300 100 2000
10 1 200 100 1000
10 1 200 50 2000
AP Last R@10 R@20 AUR
0.3 2143 0.66 0.87 0.93
0.29 2124 0.66 0.87 0.92
0.29 2183 0.66 0.87 0.92
0.29 2119 0.66 0.87 0.92

RESULTS
2nd in 14 teams
behind University
of Waterloo
26

FUTURE WORK
More features, semantic representations
 Word2vec
 Glove
 LDA
Under-sampling
CLEF eHEALTH 2018 featuring again the same task

OUTLINE
 A. Lagopoulos, A. Fachantidis, G. Tsoumakas (2017), Multi-Label Modality
Classification for Figures in Biomedical Literature, 30th IEEE International
Symposium on Computer-Based Medical Systems (CBMS)
28

PUBMED CENTRAL (PMC)
More than 4 million figures available
Great source of information for
biomedical research, education and
clinical decision
Lack of associated meta-data
impede access to this information
29

30 different modalities
as proposed by ImageCLEF

SIMPLE VS COMPOUND
Simple (60%) Compound (40%)

THE STANDARD APPROACH
Compound
Figure
Detection
Multi-class
Model
Simple Figure
Compound Subfigures
Figure Separation Algorithm
Figure separation is not perfect (~85%)
Figure isolation ⇒ information loss

OUR APPROACH: MULTI-LABEL CLASSIFICATION
No use of figure separation algorithm
Three different multi-label learning approaches:
• Simple
• Standard
• Extended

SIMPLE MULTI-LABEL APPROACH
Multi-label
model
Compound Simple
Training
Prediction

STANDARD MULTI-LABEL APPROACH
Compound
Figure
Detection
Simple
Compound
Multi-class
Model
Multi-label
Model
Compound
Figure
Detection
Multi-class
Model
Multi-label
Model
Compound Simple
PredictionTraining

EXTENDED MULTI-LABEL APPROACH
Compound
Figure
Detection
Simple
Compound
Multi-class
Model
Multi-label
Model
Compound
Figure
Detection
Multi-class
Model
Multi-label
Model
Compound Simple
PredictionTraining

MODEL TRAINING
Feature Extraction from JPEG
• BVLC model - Caffe1
• Deep learning (1.2 million images)
• 4096 visual features/figure
Linear Support Vector Machines (SVMs)
• Scikit-learn2
• One-vs-Rest transformation (multiple binaries)
37
1
HTTP://CAFFE.BERKELEYVISION.ORG/
2
HTTP://SCIKIT-LEARN.ORG/

IMAGECLEF 2016 DATASET
20.985 Figures
1.568 Compound
No simple figures with categories
Extracted subfigures as simple
Split 40% - 60% (compound –
subfigures) in order to follow the
distribution of PMC

RESULTS
Approach F1-Macro F1-Micro F1-Samples
Standard (100% separation) 0.3569 0.7786 0.7912
Simple multi-label 0.3139 0.7581 0.7215
Standard multi-label 0.3270 0.7667 0.7726
Extended multi-label 0.3309 0.7666 0.7728

THE SYSTEM
Web app
Weekly updates from PMC
Extended multi-label approach
Easy search & filtering by modality
Build with Apache Solr & AngularJS
Available @
atypon.csd.auth.gr/medieval/

FUTURE WORK
Learning approach
 Textual representation based on caption and full text
Medieval system
 Crowdsourcing
 Active learning
 Gamification
48

OUTLINE
49

PICO SENTENCE IDENTIFICATION
1st version, ~0.51 F-measure
 100 abstracts annotated by computer science PhD students
 Sentence representation (w2v > tf-idf), structural features
 MLPs, GaussianNB > SVMs, XGBoost
2nd version
 120 more abstracts to be annotated by medical experts
 Additional feature engineering
 Semi-supervised learning approaches
 Deep learning approaches revisited
51

OUTLINE
52

NEW TASK IN BIOASQ 2017
Challenge tasks
 Full Grant extraction, as combination of Grant ID and Grant Agency
 Grant ID extraction, regardless of the corresponding Grant Agency
 Grant Agency extraction, regardless of the specific Grant ID
104 agencies, as considered in the indexing procedure of NLM
53
Timespan Articles Grant IDs Grant Agencies
Training set 2005 – 2013 63k 112k 128k
Dry run set 2013 – 2015 15k 26k 31k
Test set 2015 – 2017 23k - -

EVALUATION
Micro recall used as evaluation measure
 Up to 20 items per article
 Up to 4 unique Grant Agencies (without Grant ID info) per article
 Up to 2 unique Grant Agencies per Grant ID
Sample ground truth for an article
 { "pmid":"17082206", "pmcid":"1634735",
"grantList": [
{"agency":"Wellcome Trust" },
{"grantID":"BB/C51320X/1","agency":"Biotechnology and
Biological Sciences Research Council"}]},
54

RESULTS
55
AUTH: Simple approach based on regular expressions
Fudan: Regular expressions combined with machine learning
Grant ID Grant Agency Full Grant
Fudan 0.9705 0.9907 0.9526
AUTH 0.9498 0.9862 0.9412
DZG 0.9235 0.9122 0.8443
BioASQ 0.8167 0.8312 0.7174

WRAP UP
56

Machine Learning for Understanding Biomedical Publications

More Related Content

What's hot (19)

Similar to Machine Learning for Understanding Biomedical Publications (20)

Recently uploaded (20)

Machine Learning for Understanding Biomedical Publications