SlideShare a Scribd company logo
MACHINE LEARNING FOR UNDERSTANDING
BIOMEDICAL PUBLICATIONS
Grigorios Tsoumakas,
School of informatics,
Aristotle university of Thessaloniki
with: A. Anagnostou, A. Fachantidis, A. Lagopoulos,
M. Laliotis, N. Markantonatos, Y. Papanikolaou, I. Vlahavas
ENSEMBLE METHODS
2
Pedro Domingos. 2012. A few useful things to know about
machine learning. Commun. ACM 55
“LEARN MANY MODELS, NOT JUST ONE”
Anthony Goldbloom. Kaggle CEO. Oct 2015.
“As long as Kaggle has been around, it has almost
always been ensembles of decision trees that have
won competitions. It used to be random forest that
was the big winner, but over the last six months a new
algorithm called XGboost has cropped up, and it’s
winning practically every competition in the
structured data category.”
MULTI-LABEL LEARNING
3
𝑋1 𝑋2 … 𝑋 𝒑 𝑌1 𝑌2 … 𝑌 𝒒
0.12 1 … 12 0 1 … 1
2.34 9 … -5 1 1 … 0
1.22 3 … 40 1 0 … 1
2.18 2 … 8 ? ? … ?
1.76 7 … 23 ? ? … ?
𝑝 input variables 𝑞 binary output variables
𝑚 training
examples
unknown
instances
Binary Relevance (BR)
• Learns one binary
model per label
• Ignores label
dependencies
MULTI-LABEL LEARNING FROM BIOLOGICAL DATA
Annotation of proteins with functions
 FunCat, 6 levels, 492 labels, ~9 on avg.
 GO, 14 levels, 3997 labels, ~35 on avg.
Drug discovery (Johnson & Johnson)
 743,336 chemical compounds
 ~13m chemical structure features (sparse)
 5,069 biomolecular targets (e.g. proteins)
4
OUTLINE
1. Semantic indexing of biomedical literature
2. Article screening in systematic reviews
3. Modality classification of biomedical figures
4. PICO sentence identification
5. Funding information extraction
5
Literatum is Atypon’s online content hosting and management platform
Atypon is home to more than one-third of the world’s English-language
professional and scholarly journals — more than any other technology
company
Atypon’s clients include Elsevier, IEEE, MIT Press, Oxford University
Press, …
Atypon was acquired by John Wiley & Sons in 2016 for $120,000,000
6
OUTLINE
1. Semantic indexing of biomedical literature
 Y. Papanikolaou, G. Tsoumakas, M. Laliotis, N. Markantonatos, I. Vlahavas
(2017) Large-Scale Online Semantic Indexing of Biomedical Articles via an
Ensemble of Multi-Label Classification Models, Journal of Biomedical Semantics
2. Article screening in systematic reviews
3. Modality classification of biomedical figures
4. PICO sentence identification
5. Funding information extraction
7
0
200000
400000
600000
800000
1000000
1200000
1950
1953
1956
1959
1962
1965
1968
1971
1974
1977
1980
1983
1986
1989
1992
1995
1998
2001
2004
2007
2010
2013
x $10
CHALLENGE
PubMed abstracts
 12,834,585 (20Gb)
MeSH terms
 27,773, ~13 per abstract on avg.
Online test setting
 3 phases, 5 weeks per phase
 MeSH terms for 6k to 10k abstracts
requested within 21 hours
8
1 million docs / year ≅ 2,740 docs / day
LABEL FREQUENCY DISTRIBUTION
9
4.3 million abstracts
213 labels with 1 example
1,680 labels with less than
10 examples
4 labels with more than
1 million examples
Label frequency
Numberoflabels
PROGRESS 2013 – 2017
10
0,54
0,56
0,58
0,6
0,62
0,64
0,66
0,68
2013 2014 2015 2016 2017
MicroF-Measure
Year
AUTH Fudan NLM
PRE-PROCESSING
11
text of title
and abstract
parsing
tokenization,
lowercasing,
n-gram
extraction
10,876,004 10,699,707 3,950,721
duplicate
removal
journal
filtering
tf-idf
computation,
unit length
normalization
unigrams and
bigrams with
>5 frequency
final feature
vectors
last 12,000
withheld for
evaluation
kinetics prognosis liposomes polymerskinetics prognosis liposomes polymers
4 1 2 3
LEARNING
12
Biomedical
Document
Label
Ranker
Meta
Labeler
Number of relevant labels: 2
Output is: {prognosis, liposomes}
- Any multi-label learning algorithm that can output a ranking of the labels
- We used a linear SVM per label and considered their unthresholded output
- Regression or (ordinal) classification
using original features or label
scores/ranks
- We used linear SVM regression
based on the original features
Tang, L., Rajan, S., Narayanan, V.K. “Large scale multi-label classification
via metalabeler”, Proc. 18th Int. Conf. on World Wide Web (WWW '09)
TIME AND SPACE
Hardware
 4 10-core processors at 2.26 GHz, 1 Tb RAM and
2.4 Tb storage (6 x 600 Gb SAS 10k disks in RAID 5)
Parallel learning/use of binary SVMs
 With 40 threads training & saving takes 36h
 With 20 threads loading & prediction takes 45m
Serialization
 Storing the models in 2013 required 406 Gb
 10x compression due to sparsity (L1 regularized models)
ENSEMBLE APPROACHES: CLASSIFIER SELECTION
Select the model that improves the corresponding F-measure most
 Jimeno-Yepes, A., Mork, J.G., Demner-Fushman, D., Aronson, A.R.: A one-size-fits-
all indexing method does not exist: Automatic selection based on meta-learning.
JCSE 6(2), 151-160 (2012)
 No good for global non-decomposable evaluation measures, like micro-F
Iteratively select the model that improves micro-F
 Fan, R.E., Lin, C.J.: A study on threshold selection for multi-label classification.
Technical report, National Taiwan University (2007)
 Can we trust the evaluation based on only a few positive samples?
14
MULE: MULTI-LABEL ENSEMBLE
1. Determine the globally best model ℎ∗ on a validation set
 Globally best model has been determined on positive samples of all labels
2. Determine for each label which model(s) would lead to an
improvement of the global evaluation measure compared to ℎ∗
3. Compare the differences in predictions of each one of
these models against the predictions of ℎ∗ using a McNemar test
4. If the NH is rejected for one or more models, select the one for
which the NH has lowest probability, otherwise select ℎ∗
 Robustness to uncertainty due to label rarity
15
EMPIRICAL RESULTS IN BIOASQ
16
Model Micro-F Macro-F
Vanilla SVMs 0.5568 0.4789
Weighted SVMs 0.5665 0.5102
MetaLabeler 0.5855 0.5488
Labeled LDA 0.3698 0.3010
Ensemble Micro-F Macro-F
Improve F
0.5584
(all)
0.5339
(MetaLabeler, Weighted SVM)
Improve Micro-F
0.5867
(all)
-
MULE
0.5892
(all)
0.5492
(MetaLabeler, Labeled LDA)
2017 WORK
Improved version of Labeled LDA in both speed and accuracy
 Y. Papanikolaou, G. Tsoumakas, Subset Labeled LDA for Large-Scale
Multi-Label Classification, arXiv:1709.05480
Employing word2vec features
More ensembles
 Stacking-based, frequency-based
Deep learning models
 Deep MLP, CNN
17
OUTLINE
1. Semantic indexing of biomedical literature
2. Article screening in systematic reviews
 A. Anagnostou, A. Lagopoulos, G. Tsoumakas, I. Vlahavas (2017) Combining
Inter-Review Learning-to-Rank and Intra-Review Incremental Training for Title
and Abstract Screening in Systematic Reviews, eHealth Lab of the 8th
Conference and Labs of the Evaluation Forum (CLEF)
3. Modality classification of biomedical figures
4. PICO sentence identification
5. Funding information extraction
18
DIAGNOSTIC TEST ACCURACY (DTA) REVIEWS
Title
 Thromboelastography (TEG) and rotational thromboelastometry (ROTEM) for
trauma-induced coagulopathy in adult trauma patients with bleeding
Ovid MEDLINE query
(Thrombelastogra$ or Thromboelastogra$ or (thromb$ adj2 elastogra$) or
TEG or haemoscope or haemonetics).mp
Thrombelastography/
(thromboelasto$ or thrombelasto$ or (thromb$ adj2 elastom$) or
(rotational adj2 thrombelast) or ROTEM or "tem international").mp.
1 or 2 or 3
exp animals/ not humans.sh.
4 not 5
limit 6 to yr="1970 -Current"
19
THE DATA
Training data
 20 topics, retrieved articles and
relevance after abstract/content
screening
Test data
 30 topics and retrieved articles
20
1
10
100
1000
10000
100000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Pos Neg
OUR APPROACH
Hybrid classification mechanism
Inter-topic model, learning title/abstract
relevance across different topics
Intra-topic model, learning title/abstract
relevance within a specific topic
INTER-TOPIC MODEL
Learning-to-rank binary classifier
Features assessing the similarity
of the document (title, abstract)
with the topic (title, ovid query)
 Common terms
 Levenshtein distance
 Cosine similarity
 BM25
INTRA-TOPIC MODEL: INITIAL TRAINING
23
TF-IDF
INTRA-TOPIC MODEL: ITERATIVE TRAINING
Parameters
 Initial size 𝑘 {5, 10}
 1st batch size {1}
 1st threshold {200, 300}
 2nd batch size {50, 100}
 2nd threshold {1000, 2000}
24
TF-IDF
RESULTS
Inter-Topic: eXtreme Gradient Boosting (XGBoost)
Intra-topic: Support Vector Machine (SVM)
25
k 1st
S 1st
T 2nd
S 2nd
T
5 1 200 100 2000
10 1 300 100 2000
10 1 200 100 1000
10 1 200 50 2000
AP Last R@10 R@20 AUR
0.3 2143 0.66 0.87 0.93
0.29 2124 0.66 0.87 0.92
0.29 2183 0.66 0.87 0.92
0.29 2119 0.66 0.87 0.92
RESULTS
2nd in 14 teams
behind University
of Waterloo
26
FUTURE WORK
More features, semantic representations
 Word2vec
 Glove
 LDA
Under-sampling
CLEF eHEALTH 2018 featuring again the same task
OUTLINE
1. Semantic indexing of biomedical literature
2. Article screening in systematic reviews
3. Modality classification of biomedical figures
 A. Lagopoulos, A. Fachantidis, G. Tsoumakas (2017), Multi-Label Modality
Classification for Figures in Biomedical Literature, 30th IEEE International
Symposium on Computer-Based Medical Systems (CBMS)
4. PICO sentence identification
5. Funding information extraction
28
PUBMED CENTRAL (PMC)
More than 4 million figures available
Great source of information for
biomedical research, education and
clinical decision
Lack of associated meta-data
impede access to this information
29
30 different modalities
as proposed by ImageCLEF
SIMPLE VS COMPOUND
Simple (60%) Compound (40%)
THE STANDARD APPROACH
Compound
Figure
Detection
Multi-class
Model
Simple Figure
Compound Subfigures
Figure Separation Algorithm
Figure separation is not perfect (~85%)
Figure isolation ⇒ information loss
OUR APPROACH: MULTI-LABEL CLASSIFICATION
No use of figure separation algorithm
Three different multi-label learning approaches:
• Simple
• Standard
• Extended
SIMPLE MULTI-LABEL APPROACH
Multi-label
model
Compound Simple
Training
Prediction
STANDARD MULTI-LABEL APPROACH
Compound
Figure
Detection
Simple
Compound
Multi-class
Model
Multi-label
Model
Compound
Figure
Detection
Multi-class
Model
Multi-label
Model
Compound Simple
PredictionTraining
EXTENDED MULTI-LABEL APPROACH
Compound
Figure
Detection
Simple
Compound
Multi-class
Model
Multi-label
Model
Compound
Figure
Detection
Multi-class
Model
Multi-label
Model
Compound Simple
PredictionTraining
MODEL TRAINING
Feature Extraction from JPEG
• BVLC model - Caffe1
• Deep learning (1.2 million images)
• 4096 visual features/figure
Linear Support Vector Machines (SVMs)
• Scikit-learn2
• One-vs-Rest transformation (multiple binaries)
37
1
HTTP://CAFFE.BERKELEYVISION.ORG/
2
HTTP://SCIKIT-LEARN.ORG/
IMAGECLEF 2016 DATASET
20.985 Figures
1.568 Compound
No simple figures with categories
Extracted subfigures as simple
Split 40% - 60% (compound –
subfigures) in order to follow the
distribution of PMC
RESULTS
Approach F1-Macro F1-Micro F1-Samples
Standard (100% separation) 0.3569 0.7786 0.7912
Simple multi-label 0.3139 0.7581 0.7215
Standard multi-label 0.3270 0.7667 0.7726
Extended multi-label 0.3309 0.7666 0.7728
THE SYSTEM
Web app
Weekly updates from PMC
Extended multi-label approach
Easy search & filtering by modality
Build with Apache Solr & AngularJS
Available @
atypon.csd.auth.gr/medieval/
FUTURE WORK
Learning approach
 Textual representation based on caption and full text
Medieval system
 Crowdsourcing
 Active learning
 Gamification
48
OUTLINE
1. Semantic indexing of biomedical literature
2. Article screening in systematic reviews
3. Modality classification of biomedical figures
4. PICO sentence identification
5. Funding information extraction
49
50
PICO SENTENCE IDENTIFICATION
1st version, ~0.51 F-measure
 100 abstracts annotated by computer science PhD students
 Sentence representation (w2v > tf-idf), structural features
 MLPs, GaussianNB > SVMs, XGBoost
2nd version
 120 more abstracts to be annotated by medical experts
 Additional feature engineering
 Semi-supervised learning approaches
 Deep learning approaches revisited
51
OUTLINE
1. Semantic indexing of biomedical literature
2. Article screening in systematic reviews
3. Modality classification of biomedical figures
4. PICO sentence identification
5. Funding information extraction
52
NEW TASK IN BIOASQ 2017
Challenge tasks
 Full Grant extraction, as combination of Grant ID and Grant Agency
 Grant ID extraction, regardless of the corresponding Grant Agency
 Grant Agency extraction, regardless of the specific Grant ID
104 agencies, as considered in the indexing procedure of NLM
53
Timespan Articles Grant IDs Grant Agencies
Training set 2005 – 2013 63k 112k 128k
Dry run set 2013 – 2015 15k 26k 31k
Test set 2015 – 2017 23k - -
EVALUATION
Micro recall used as evaluation measure
 Up to 20 items per article
 Up to 4 unique Grant Agencies (without Grant ID info) per article
 Up to 2 unique Grant Agencies per Grant ID
Sample ground truth for an article
 { "pmid":"17082206", "pmcid":"1634735",
"grantList": [
{"agency":"Wellcome Trust" },
{"grantID":"BB/C51320X/1","agency":"Biotechnology and
Biological Sciences Research Council"}]},
54
RESULTS
55
AUTH: Simple approach based on regular expressions
Fudan: Regular expressions combined with machine learning
Grant ID Grant Agency Full Grant
Fudan 0.9705 0.9907 0.9526
AUTH 0.9498 0.9862 0.9412
DZG 0.9235 0.9122 0.8443
BioASQ 0.8167 0.8312 0.7174
WRAP UP
1. Semantic indexing of biomedical literature
2. Article screening in systematic reviews
3. Modality classification of biomedical figures
4. PICO sentence identification
5. Funding information extraction
56
MACHINE LEARNING FOR UNDERSTANDING
BIOMEDICAL PUBLICATIONS
Grigorios Tsoumakas,
School of informatics,
Aristotle university of Thessaloniki
with: A. Anagnostou, A. Fachantidis, A. Lagopoulos,
M. Laliotis, N. Markantonatos, Y. Papanikolaou, I. Vlahavas

More Related Content

PDF
10 Years of Multi-Label Learning
PDF
Multi-label Classification with Meta-labels
PDF
Incremental learning from unbalanced data with concept class, concept drift a...
PPTX
Applications of Machine Learning
PPTX
Machine learning in Data Science
PPTX
Machine learning for Data Science
PDF
RESULT MINING: ANALYSIS OF DATA MINING TECHNIQUES IN EDUCATION
PDF
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
10 Years of Multi-Label Learning
Multi-label Classification with Meta-labels
Incremental learning from unbalanced data with concept class, concept drift a...
Applications of Machine Learning
Machine learning in Data Science
Machine learning for Data Science
RESULT MINING: ANALYSIS OF DATA MINING TECHNIQUES IN EDUCATION
Automatic Feature Subset Selection using Genetic Algorithm for Clustering

What's hot (19)

PDF
10 Algorithms in data mining
PDF
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
PDF
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
PDF
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
DOC
DATA MINING.doc
PDF
Analysis On Classification Techniques In Mammographic Mass Data Set
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PPTX
Crystallization classification semisupervised
PDF
Categorization of Protean Writers by Exploitation of Raspberry Pi
PPT
Lec1-Into
PDF
Machine Learning Real Life Applications By Examples
PDF
I0704047054
PDF
Study of Clustering of Data Base in Education Sector Using Data Mining
PDF
IRJET- Student Placement Prediction using Machine Learning
PDF
11.software modules clustering an effective approach for reusability
PDF
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
PDF
IRJET - Student Pass Percentage Dedection using Ensemble Learninng
10 Algorithms in data mining
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
DATA MINING.doc
Analysis On Classification Techniques In Mammographic Mass Data Set
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Crystallization classification semisupervised
Categorization of Protean Writers by Exploitation of Raspberry Pi
Lec1-Into
Machine Learning Real Life Applications By Examples
I0704047054
Study of Clustering of Data Base in Education Sector Using Data Mining
IRJET- Student Placement Prediction using Machine Learning
11.software modules clustering an effective approach for reusability
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
IRJET - Student Pass Percentage Dedection using Ensemble Learninng
Ad

Similar to Machine Learning for Understanding Biomedical Publications (20)

PPT
Folker Meyer: Metagenomic Data Annotation
PPT
Large scale machine learning challenges for systems biology
PPT
kantorNSF-NIJ-ISI-03-06-04.ppt
ODP
On cascading small decision trees
PDF
Performance Evaluation: A Comparative Study of Various Classifiers
PDF
Machine_Learning_Developments_in_ROOT.pdf
PPTX
Automating Machine Learning - Is it feasible?
PDF
Deep learning methods applied to physicochemical and toxicological endpoints
PPT
ga-2.ppt
PPT
An intelligent retrieval system for Chinese agricultural scientific literature
PDF
ISA - a short overview - Dec 2013
PPTX
From Small-scale to Large-scale Text Classification
PDF
2021 itu challenge_reinforcement_learning
PDF
Feature Extraction for Large-Scale Text Collections
PDF
The International Journal of Engineering and Science (The IJES)
PPTX
Discover How Scientific Data is Used for the Public Good with Natural Languag...
PDF
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
PPT
Computation and Knowledge
PPTX
November 16, Learning
Folker Meyer: Metagenomic Data Annotation
Large scale machine learning challenges for systems biology
kantorNSF-NIJ-ISI-03-06-04.ppt
On cascading small decision trees
Performance Evaluation: A Comparative Study of Various Classifiers
Machine_Learning_Developments_in_ROOT.pdf
Automating Machine Learning - Is it feasible?
Deep learning methods applied to physicochemical and toxicological endpoints
ga-2.ppt
An intelligent retrieval system for Chinese agricultural scientific literature
ISA - a short overview - Dec 2013
From Small-scale to Large-scale Text Classification
2021 itu challenge_reinforcement_learning
Feature Extraction for Large-Scale Text Collections
The International Journal of Engineering and Science (The IJES)
Discover How Scientific Data is Used for the Public Good with Natural Languag...
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
Computation and Knowledge
November 16, Learning
Ad

Recently uploaded (20)

PDF
Sciences of Europe No 170 (2025)
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
Pharmacology of Autonomic nervous system
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
2. Earth - The Living Planet Module 2ELS
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
2. Earth - The Living Planet earth and life
PDF
An interstellar mission to test astrophysical black holes
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
famous lake in india and its disturibution and importance
Sciences of Europe No 170 (2025)
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Pharmacology of Autonomic nervous system
Phytochemical Investigation of Miliusa longipes.pdf
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Placing the Near-Earth Object Impact Probability in Context
2. Earth - The Living Planet Module 2ELS
POSITIONING IN OPERATION THEATRE ROOM.ppt
Science Quipper for lesson in grade 8 Matatag Curriculum
2Systematics of Living Organisms t-.pptx
2. Earth - The Living Planet earth and life
An interstellar mission to test astrophysical black holes
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Introduction to Fisheries Biotechnology_Lesson 1.pptx
lecture 2026 of Sjogren's syndrome l .pdf
famous lake in india and its disturibution and importance

Machine Learning for Understanding Biomedical Publications

  • 1. MACHINE LEARNING FOR UNDERSTANDING BIOMEDICAL PUBLICATIONS Grigorios Tsoumakas, School of informatics, Aristotle university of Thessaloniki with: A. Anagnostou, A. Fachantidis, A. Lagopoulos, M. Laliotis, N. Markantonatos, Y. Papanikolaou, I. Vlahavas
  • 2. ENSEMBLE METHODS 2 Pedro Domingos. 2012. A few useful things to know about machine learning. Commun. ACM 55 “LEARN MANY MODELS, NOT JUST ONE” Anthony Goldbloom. Kaggle CEO. Oct 2015. “As long as Kaggle has been around, it has almost always been ensembles of decision trees that have won competitions. It used to be random forest that was the big winner, but over the last six months a new algorithm called XGboost has cropped up, and it’s winning practically every competition in the structured data category.”
  • 3. MULTI-LABEL LEARNING 3 𝑋1 𝑋2 … 𝑋 𝒑 𝑌1 𝑌2 … 𝑌 𝒒 0.12 1 … 12 0 1 … 1 2.34 9 … -5 1 1 … 0 1.22 3 … 40 1 0 … 1 2.18 2 … 8 ? ? … ? 1.76 7 … 23 ? ? … ? 𝑝 input variables 𝑞 binary output variables 𝑚 training examples unknown instances Binary Relevance (BR) • Learns one binary model per label • Ignores label dependencies
  • 4. MULTI-LABEL LEARNING FROM BIOLOGICAL DATA Annotation of proteins with functions  FunCat, 6 levels, 492 labels, ~9 on avg.  GO, 14 levels, 3997 labels, ~35 on avg. Drug discovery (Johnson & Johnson)  743,336 chemical compounds  ~13m chemical structure features (sparse)  5,069 biomolecular targets (e.g. proteins) 4
  • 5. OUTLINE 1. Semantic indexing of biomedical literature 2. Article screening in systematic reviews 3. Modality classification of biomedical figures 4. PICO sentence identification 5. Funding information extraction 5
  • 6. Literatum is Atypon’s online content hosting and management platform Atypon is home to more than one-third of the world’s English-language professional and scholarly journals — more than any other technology company Atypon’s clients include Elsevier, IEEE, MIT Press, Oxford University Press, … Atypon was acquired by John Wiley & Sons in 2016 for $120,000,000 6
  • 7. OUTLINE 1. Semantic indexing of biomedical literature  Y. Papanikolaou, G. Tsoumakas, M. Laliotis, N. Markantonatos, I. Vlahavas (2017) Large-Scale Online Semantic Indexing of Biomedical Articles via an Ensemble of Multi-Label Classification Models, Journal of Biomedical Semantics 2. Article screening in systematic reviews 3. Modality classification of biomedical figures 4. PICO sentence identification 5. Funding information extraction 7
  • 8. 0 200000 400000 600000 800000 1000000 1200000 1950 1953 1956 1959 1962 1965 1968 1971 1974 1977 1980 1983 1986 1989 1992 1995 1998 2001 2004 2007 2010 2013 x $10 CHALLENGE PubMed abstracts  12,834,585 (20Gb) MeSH terms  27,773, ~13 per abstract on avg. Online test setting  3 phases, 5 weeks per phase  MeSH terms for 6k to 10k abstracts requested within 21 hours 8 1 million docs / year ≅ 2,740 docs / day
  • 9. LABEL FREQUENCY DISTRIBUTION 9 4.3 million abstracts 213 labels with 1 example 1,680 labels with less than 10 examples 4 labels with more than 1 million examples Label frequency Numberoflabels
  • 10. PROGRESS 2013 – 2017 10 0,54 0,56 0,58 0,6 0,62 0,64 0,66 0,68 2013 2014 2015 2016 2017 MicroF-Measure Year AUTH Fudan NLM
  • 11. PRE-PROCESSING 11 text of title and abstract parsing tokenization, lowercasing, n-gram extraction 10,876,004 10,699,707 3,950,721 duplicate removal journal filtering tf-idf computation, unit length normalization unigrams and bigrams with >5 frequency final feature vectors last 12,000 withheld for evaluation
  • 12. kinetics prognosis liposomes polymerskinetics prognosis liposomes polymers 4 1 2 3 LEARNING 12 Biomedical Document Label Ranker Meta Labeler Number of relevant labels: 2 Output is: {prognosis, liposomes} - Any multi-label learning algorithm that can output a ranking of the labels - We used a linear SVM per label and considered their unthresholded output - Regression or (ordinal) classification using original features or label scores/ranks - We used linear SVM regression based on the original features Tang, L., Rajan, S., Narayanan, V.K. “Large scale multi-label classification via metalabeler”, Proc. 18th Int. Conf. on World Wide Web (WWW '09)
  • 13. TIME AND SPACE Hardware  4 10-core processors at 2.26 GHz, 1 Tb RAM and 2.4 Tb storage (6 x 600 Gb SAS 10k disks in RAID 5) Parallel learning/use of binary SVMs  With 40 threads training & saving takes 36h  With 20 threads loading & prediction takes 45m Serialization  Storing the models in 2013 required 406 Gb  10x compression due to sparsity (L1 regularized models)
  • 14. ENSEMBLE APPROACHES: CLASSIFIER SELECTION Select the model that improves the corresponding F-measure most  Jimeno-Yepes, A., Mork, J.G., Demner-Fushman, D., Aronson, A.R.: A one-size-fits- all indexing method does not exist: Automatic selection based on meta-learning. JCSE 6(2), 151-160 (2012)  No good for global non-decomposable evaluation measures, like micro-F Iteratively select the model that improves micro-F  Fan, R.E., Lin, C.J.: A study on threshold selection for multi-label classification. Technical report, National Taiwan University (2007)  Can we trust the evaluation based on only a few positive samples? 14
  • 15. MULE: MULTI-LABEL ENSEMBLE 1. Determine the globally best model ℎ∗ on a validation set  Globally best model has been determined on positive samples of all labels 2. Determine for each label which model(s) would lead to an improvement of the global evaluation measure compared to ℎ∗ 3. Compare the differences in predictions of each one of these models against the predictions of ℎ∗ using a McNemar test 4. If the NH is rejected for one or more models, select the one for which the NH has lowest probability, otherwise select ℎ∗  Robustness to uncertainty due to label rarity 15
  • 16. EMPIRICAL RESULTS IN BIOASQ 16 Model Micro-F Macro-F Vanilla SVMs 0.5568 0.4789 Weighted SVMs 0.5665 0.5102 MetaLabeler 0.5855 0.5488 Labeled LDA 0.3698 0.3010 Ensemble Micro-F Macro-F Improve F 0.5584 (all) 0.5339 (MetaLabeler, Weighted SVM) Improve Micro-F 0.5867 (all) - MULE 0.5892 (all) 0.5492 (MetaLabeler, Labeled LDA)
  • 17. 2017 WORK Improved version of Labeled LDA in both speed and accuracy  Y. Papanikolaou, G. Tsoumakas, Subset Labeled LDA for Large-Scale Multi-Label Classification, arXiv:1709.05480 Employing word2vec features More ensembles  Stacking-based, frequency-based Deep learning models  Deep MLP, CNN 17
  • 18. OUTLINE 1. Semantic indexing of biomedical literature 2. Article screening in systematic reviews  A. Anagnostou, A. Lagopoulos, G. Tsoumakas, I. Vlahavas (2017) Combining Inter-Review Learning-to-Rank and Intra-Review Incremental Training for Title and Abstract Screening in Systematic Reviews, eHealth Lab of the 8th Conference and Labs of the Evaluation Forum (CLEF) 3. Modality classification of biomedical figures 4. PICO sentence identification 5. Funding information extraction 18
  • 19. DIAGNOSTIC TEST ACCURACY (DTA) REVIEWS Title  Thromboelastography (TEG) and rotational thromboelastometry (ROTEM) for trauma-induced coagulopathy in adult trauma patients with bleeding Ovid MEDLINE query (Thrombelastogra$ or Thromboelastogra$ or (thromb$ adj2 elastogra$) or TEG or haemoscope or haemonetics).mp Thrombelastography/ (thromboelasto$ or thrombelasto$ or (thromb$ adj2 elastom$) or (rotational adj2 thrombelast) or ROTEM or "tem international").mp. 1 or 2 or 3 exp animals/ not humans.sh. 4 not 5 limit 6 to yr="1970 -Current" 19
  • 20. THE DATA Training data  20 topics, retrieved articles and relevance after abstract/content screening Test data  30 topics and retrieved articles 20 1 10 100 1000 10000 100000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Pos Neg
  • 21. OUR APPROACH Hybrid classification mechanism Inter-topic model, learning title/abstract relevance across different topics Intra-topic model, learning title/abstract relevance within a specific topic
  • 22. INTER-TOPIC MODEL Learning-to-rank binary classifier Features assessing the similarity of the document (title, abstract) with the topic (title, ovid query)  Common terms  Levenshtein distance  Cosine similarity  BM25
  • 23. INTRA-TOPIC MODEL: INITIAL TRAINING 23 TF-IDF
  • 24. INTRA-TOPIC MODEL: ITERATIVE TRAINING Parameters  Initial size 𝑘 {5, 10}  1st batch size {1}  1st threshold {200, 300}  2nd batch size {50, 100}  2nd threshold {1000, 2000} 24 TF-IDF
  • 25. RESULTS Inter-Topic: eXtreme Gradient Boosting (XGBoost) Intra-topic: Support Vector Machine (SVM) 25 k 1st S 1st T 2nd S 2nd T 5 1 200 100 2000 10 1 300 100 2000 10 1 200 100 1000 10 1 200 50 2000 AP Last R@10 R@20 AUR 0.3 2143 0.66 0.87 0.93 0.29 2124 0.66 0.87 0.92 0.29 2183 0.66 0.87 0.92 0.29 2119 0.66 0.87 0.92
  • 26. RESULTS 2nd in 14 teams behind University of Waterloo 26
  • 27. FUTURE WORK More features, semantic representations  Word2vec  Glove  LDA Under-sampling CLEF eHEALTH 2018 featuring again the same task
  • 28. OUTLINE 1. Semantic indexing of biomedical literature 2. Article screening in systematic reviews 3. Modality classification of biomedical figures  A. Lagopoulos, A. Fachantidis, G. Tsoumakas (2017), Multi-Label Modality Classification for Figures in Biomedical Literature, 30th IEEE International Symposium on Computer-Based Medical Systems (CBMS) 4. PICO sentence identification 5. Funding information extraction 28
  • 29. PUBMED CENTRAL (PMC) More than 4 million figures available Great source of information for biomedical research, education and clinical decision Lack of associated meta-data impede access to this information 29
  • 30. 30 different modalities as proposed by ImageCLEF
  • 31. SIMPLE VS COMPOUND Simple (60%) Compound (40%)
  • 32. THE STANDARD APPROACH Compound Figure Detection Multi-class Model Simple Figure Compound Subfigures Figure Separation Algorithm Figure separation is not perfect (~85%) Figure isolation ⇒ information loss
  • 33. OUR APPROACH: MULTI-LABEL CLASSIFICATION No use of figure separation algorithm Three different multi-label learning approaches: • Simple • Standard • Extended
  • 37. MODEL TRAINING Feature Extraction from JPEG • BVLC model - Caffe1 • Deep learning (1.2 million images) • 4096 visual features/figure Linear Support Vector Machines (SVMs) • Scikit-learn2 • One-vs-Rest transformation (multiple binaries) 37 1 HTTP://CAFFE.BERKELEYVISION.ORG/ 2 HTTP://SCIKIT-LEARN.ORG/
  • 38. IMAGECLEF 2016 DATASET 20.985 Figures 1.568 Compound No simple figures with categories Extracted subfigures as simple Split 40% - 60% (compound – subfigures) in order to follow the distribution of PMC
  • 39. RESULTS Approach F1-Macro F1-Micro F1-Samples Standard (100% separation) 0.3569 0.7786 0.7912 Simple multi-label 0.3139 0.7581 0.7215 Standard multi-label 0.3270 0.7667 0.7726 Extended multi-label 0.3309 0.7666 0.7728
  • 40. THE SYSTEM Web app Weekly updates from PMC Extended multi-label approach Easy search & filtering by modality Build with Apache Solr & AngularJS Available @ atypon.csd.auth.gr/medieval/
  • 41. FUTURE WORK Learning approach  Textual representation based on caption and full text Medieval system  Crowdsourcing  Active learning  Gamification 48
  • 42. OUTLINE 1. Semantic indexing of biomedical literature 2. Article screening in systematic reviews 3. Modality classification of biomedical figures 4. PICO sentence identification 5. Funding information extraction 49
  • 43. 50
  • 44. PICO SENTENCE IDENTIFICATION 1st version, ~0.51 F-measure  100 abstracts annotated by computer science PhD students  Sentence representation (w2v > tf-idf), structural features  MLPs, GaussianNB > SVMs, XGBoost 2nd version  120 more abstracts to be annotated by medical experts  Additional feature engineering  Semi-supervised learning approaches  Deep learning approaches revisited 51
  • 45. OUTLINE 1. Semantic indexing of biomedical literature 2. Article screening in systematic reviews 3. Modality classification of biomedical figures 4. PICO sentence identification 5. Funding information extraction 52
  • 46. NEW TASK IN BIOASQ 2017 Challenge tasks  Full Grant extraction, as combination of Grant ID and Grant Agency  Grant ID extraction, regardless of the corresponding Grant Agency  Grant Agency extraction, regardless of the specific Grant ID 104 agencies, as considered in the indexing procedure of NLM 53 Timespan Articles Grant IDs Grant Agencies Training set 2005 – 2013 63k 112k 128k Dry run set 2013 – 2015 15k 26k 31k Test set 2015 – 2017 23k - -
  • 47. EVALUATION Micro recall used as evaluation measure  Up to 20 items per article  Up to 4 unique Grant Agencies (without Grant ID info) per article  Up to 2 unique Grant Agencies per Grant ID Sample ground truth for an article  { "pmid":"17082206", "pmcid":"1634735", "grantList": [ {"agency":"Wellcome Trust" }, {"grantID":"BB/C51320X/1","agency":"Biotechnology and Biological Sciences Research Council"}]}, 54
  • 48. RESULTS 55 AUTH: Simple approach based on regular expressions Fudan: Regular expressions combined with machine learning Grant ID Grant Agency Full Grant Fudan 0.9705 0.9907 0.9526 AUTH 0.9498 0.9862 0.9412 DZG 0.9235 0.9122 0.8443 BioASQ 0.8167 0.8312 0.7174
  • 49. WRAP UP 1. Semantic indexing of biomedical literature 2. Article screening in systematic reviews 3. Modality classification of biomedical figures 4. PICO sentence identification 5. Funding information extraction 56
  • 50. MACHINE LEARNING FOR UNDERSTANDING BIOMEDICAL PUBLICATIONS Grigorios Tsoumakas, School of informatics, Aristotle university of Thessaloniki with: A. Anagnostou, A. Fachantidis, A. Lagopoulos, M. Laliotis, N. Markantonatos, Y. Papanikolaou, I. Vlahavas