SlideShare a Scribd company logo
*ADAPT Research Centre

^ Insight Centre for Data Analytics 

Dublin City University, Ireland
Chinese Character Decomposition for
Neural MT with Multi-Word Expressions
Lifeng Han*, Gareth J. F. Jones*, Alan F. Smeaton^, and Paolo Bolzoni
research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics
& COLING20:MWE-LEX WS
1
Bonus takeaway:
AlphaMWE multilingual corpus
with MWEs
ADAPT seminar series June,2021
www.adaptcentre.ie
Content
• Background (motivation of this work)

• Related work (Decomposition4MT)

• Our refined decomposition NMT model with MWEs

• Automatic and Crowd-source Human evaluation

• Experts’ analysis with Examples (new insight)

• => AlphaMWE (multilingual lexicon: corpus with MWEs)
Cartoon, https://www.freepik.com/premium-vector/boy-with-lot-books_5762027.htm
Data: https://
github.com/
poethan/
MWE4MT
2
radical4mt
www.adaptcentre.ie
Background
Driven: - two factors
• Sub-character NMT:

• BPE for English and western languages.

• How about asian ideograph script? 

• NMT bottlenecks - MWEs, low-frequency words, OOV words: 

• How to better address Multiword Expression translations

• MWE: “display lexical, syntactic, semantic, pragmatic and/or statistical
idiomaticity”, 

• e.g. kick the bucket, by and large, pull one's leg
3
Parallel corpora
Trainer
Neural networks
Encoding
Source text
Decoder, NNs
Encoder, NNs
RNN
+
+
Decoding
CNN
+ Attention
Learned
Model
MT outputs
4
NMT components
Target language
5
NMT
Linguistic structure &
Knowledge
Learning model
Semantics &
Disambiguation
MWE methodology
Dictionary usage
Bilingual phase table
…
Attention
Coverage
All attention (Transformer,
2017)
BERT (2018google)
Pretraining lang.
model
+
Bi-direc. RNN
Bi-direc. RNN
+
+
Tree2string
Eriguchi et al. 2016ACL
String2tree
Aharoni&Goldberg17ACL
…
Tree2Tree?
T2T NN program Translation
Chen et al.2018ICLR
Dependency
Wu et al. 2017IJCAI
.
.
.
A bigger view of
the belonging,
Within NMT
research paradigm,
NMT branches.
here
Syntax structure &
dependency
www.adaptcentre.ie
Background
Chinese characters, example
• Semantic part + phonetic part

• Semantics: radicals

• Phonetics: related to the overall pronunciation of the character
Background
6
Chinese radical (Dāo, knife) evolution from Pictogram to Regular script
Shang Dynasty 

(1600-1046BC)
Western-
Zhou Dynasty
(1045-771BC)
Warring
States period
(476-221BC)
Han Dynasty
(202BC-220)
Eastern
Han (from 57AD
on)
Bronze
inscriptions
Oracle bone
script
Bronze

Inscription
Silk (on Seal) Regular script
www.adaptcentre.ie
Background
Background
7
(fēng)
(semantic, metal) (phonetic, féng)
…
(jiàn)
(phonetic, qiān) (semantic, knife)
… … …
Level-1:
Level-2:
Level-3:
…
Full-stroke:
Word level 28 / / ⇥⇤ / ⌅ / ⇧⌃ / ⌥ / / ⌦↵ / / ✏
Character 28 ⇥ ⇤ ⌅ ⇧ ⌃ ⌥ ⌦ ↵ ✏
Pronunciation èr shí bā Suì chú shī bèi fā xiàn sǐ yú jiù jīn shān yī jiā shāng chǎng
Radical 28 ⇣⌘ ✓◆  ⌫ ⇠ ⇡⇢ ⌧ !⇡" ↵ #$ %"& '(
 
English Ref. 28-Year-Old Chef Found Dead at San Francisco Mall
www.adaptcentre.ie
Background
Background
8
ZH source:
ZH pinyin:
Nián nián suì suì huā xiāng sì, suì suì nián
nián rén bù tóng.
EN reference:
The flowers are similar each year, while
people are changing every year.
EN MT output:
One year spent similar, each year is
different
Example of MWEs in MT as a challenge. Reference: Han et al. (2020LREC) MultiMWE: Building a Multi-lingual Multi-Word Expression
(MWE) Parallel Corpora. https://www.aclweb.org/anthology/2020.lrec-1.363/
www.adaptcentre.ie
Content
• Background (motivation of this work)

• Related work (Decomposition4MT)
• Our refined decomposition NMT model with MWEs

• Automatic and Crowd-source Human evaluation

• Experts’ analysis with Examples (new insight)

• = AlphaMWE (multilingual lexicon: corpus with MWEs)
cartoon, https://www.amazon.ca/Tweety-Bird-not-
Related
work

9
www.adaptcentre.ie
Related work
Chinese character decomposition
• radical embeddings as additional features for Chinese → English and Japanese Chinese NMT.

• Our own: Han and Kuang (2018) : a range of encoding models including word+character,
word+radical, and word+character+radical (best) with bidirectional RNNs 

• Zhang and Matsumoto (2018): radical embeddings as additional features to character level LSTM-
based NMT on Japanese → Chinese translation

• Bidirectional English Japanese, English Chinese and Chinese Japanese NMT with word,
character, ideograph and stroke levels

• Zhang and Komachi (2018)

• experiments showing that the ideograph level was best for ZH→EN MT, while the stroke level was best
for JP→EN MT

• No intermediate level decomposition testing
10
Han and Kuang (2018) Incorporating Chinese radicals into neural machine translation: deeper than character level. In:
30th European Summer School in Logic, Language and Information (ESSLLI 2018) https://arxiv.org/abs/1805.01565
www.adaptcentre.ie
Related work
Chinese character decomposition
11
Han and Kuang (2018) Incorporating Chinese radicals into neural machine translation: deeper than character level. In:
30th European Summer School in Logic, Language and Information (ESSLLI 2018) https://arxiv.org/abs/1805.01565
www.adaptcentre.ie
Content
• Background (motivation of this work)

• Related work (Decomposition4MT)

• Refined decomposition Neural MT model with MWEs
• Automatic and Crowd-source Human evaluation

• Experts’ analysis with Examples (new insight)

• = AlphaMWE (multilingual lexicon: corpus with MWEs)
12
data: https://github.com/
poethan/MWE4MT
radical4mt
!
www.adaptcentre.ie
IDS files from CHISE
13
Character Decomposition Decomposition
(lì)
[G] [T]
(jù) [GTKV] [J]
(hán) [GTV] [JK]
(yǒng) [GTV] [JK]
Character construction: : up-down, : left-right,
: inside-outside, : embedded
Refined decomposition model with MWEs
CHISE (CHaracter Information
Service Environment) project.
Comprised of 88,940 Chinese
characters from CJK (Chinese,
Japanese, Korean script) Unified
Ideographs
https://github.com/cjkvi/cjkvi-ids
http://www.chise.org/
Extraction procedure
• To obtain a decomposition level L representation of Chinese character α:

• go through the IDS file L times. 

• Each time, we search the IDS file character list to match the newly generated
smaller sized characters and 

• re-place them with decomposed representation recursively.
14
Examples of decomposition/extraction
15
shared bilingual glossaries: https://github.com/poethan/MWE4MT/tree/master/radical4mt
Zh MWE ⾼高尔夫球 俱乐部 (golf club), 汽⻋车 散热器 (car radiator)
Rxd1
{亠⼝口冋𠂊⼩小⼆二⼈人王求 ⺅亻具乐咅阝} , {⺡氵⽓气⻋车 龷攵执⺣灬吅⽝犬吅}
Rxd2
{⼂丶⼀一⼝口⼌冂⼝口𠂊⼩小⼀一⼀一⼈人⼀一⼟土⼀一⺢氺⼂丶 ⺅亻且⼀一乐立⼝口阝}, {⺡氵𠂉⼀一乁⻋车 卄⼀一攵⺘扌丸⺣灬⼝口
⼝口⽝犬⼝口⼝口}
Rxd3 {⼂丶⼀一⼝口⼌冂⼝口𠂊⼩小⼀一⼀一⼈人⼀一⼗十⼀一⼀一⼅亅丷八⼂丶 ⺅亻且⼀一乐亠丷⼀一⼝口阝}, {⺡氵𠂉⼀一乁⻋车 ⼗十⼁丨
⼀一攵⺘扌九⼂丶⺣灬⼝口⼝口⽝犬⼝口⼝口}
level2: generates ⼟土 ⺢氺, then level3 further decomposed them.
Adding MWEs and decomposed MWEs
16
Lifeng Han, Gareth J.F. Jones and Alan F. Smeaton. 2020. MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora.
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2970–2979 Marseille, 11–16 May 2020
www.adaptcentre.ie
Content
• Background (motivation of this work)

• Related work (Decomposition4MT)

• Our refined decomposition NMT model with MWEs

• Automatic and Crowd-source Human evaluation
• Experts’ analysis with Examples (new insight)

• = AlphaMWE (multilingual lexicon: corpus with MWEs)
cartoon, https://images.app.goo.gl/Y6bAfr9oFsswWjUY7
What’s
happening
?
17
www.adaptcentre.ie
BLEU scores with increasing learning steps
Evaluation
18
www.adaptcentre.ie
BLEU scores with increasing learning steps
Evaluation
19
www.adaptcentre.ie
Human Direct Assessment
Evaluation
20
www.adaptcentre.ie
Content
• Background (motivation of this work)

• Related work (Decomposition4MT)

• Our refined decomposition NMT model with MWEs

• Automatic and Crowd-source Human evaluation

• Experts’ analysis with Examples (new insight)
• = AlphaMWE (multilingual lexicon: corpus with MWEs)
cartoon, https://www.bbc.com/news/
radical4mt
21
data: https://github.com/
poethan/MWE4MT
www.adaptcentre.ie
Expert Analysis: new research trend
22
- BLEU has long been criticised as not reflecting real differences between high-performing
MT models.
- Crowd-source human evaluation is not reliable either, with very recent work highlighting
that professional translators disagree with crowd-source human ranking of MT systems
largely via WMT data. cite{Freitag et al. 2021 MT_HA}.
- BLEU  Crowd-source Human Assessment (CSHA):
- tend to favour ‘boring’ translations.
- When lexical diversity improves in MToutput: get lower scores (BLEU, CSHA)
- We look at detailed translation examples from the different system outputs, at 100K
learning steps, by human expert, native speaker: examples reflect the advantages from
decomposition models, e.g. RXD3 (as also RXD1) on MWEs translation.
23
src
28
28
ref
28 @-@ Year @-@ Old Chef Found Dead at San Francisco Mall
a 28 @-@ year @-@ old chef who had recently moved to San Francisco was found dead in the stairwell of a local
mall this week .
rxd3
the 28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef who recently moved to San Francisco has been found dead on a stairwell in a local mall
this week .
base
the 28 @-@ year @-@ old chef was found dead in a shop in San Francisco
a 28 @-@ year @-@ old chef who has moved to San Francisco this week was found dead on the stairs of a local mall .
base
MWE
28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef who recently moved to San Francisco was found dead this week at a local mall .
rxd3
MWE
28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead this week at a local mall .
rxd1
the 28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead in a local shopping mall this week .
rxd2
the 28 @-@ year @-@ old chef was found dead in a San Francisco mall
a 28 @-@ year @-@ old San Francisco chef was found dead in a local mall this week .
www.adaptcentre.ie
Expert analysis: insights on MWEs
24
1) Chinese MWE 商场 (shāng chǎng) in the first sentence: - correctly translated as {mall} by {rxd3}
model {-vs-} translated as {shop} by the baseline character sequence model
2) MWE 楼梯间 (lóutījiān) in the second sentence:
- correctly translated as {stairwell} by the {rxd3} model {-vs-} baseline: as {stairs}
3) MWE 近⽇日(jìn rì) meaning {recently} in the second sentence:
- totally missed out by the original character sequence model = results in a misleading ambiguous
translation of an even larger content, i.e., did the chief moved to San Francisco (SF) {recently} or
{this week}
- MWE 近⽇日(jìn rì) correctly translated by the {rxd3} model = overall meaning of the sentence is
clear.
www.adaptcentre.ie
Expert analysis: on Multi-word Expressions
25
1) It is not reflected by BLEU scores
1) because the lower percentage of MWEs in corpus. However, it is an important part of the corpus
and human languages/expressions.
2) Because of the MWE interpretations, lexical diversity in translation, and reference corpus.
2) it is not reflected by crowd-source human assessment
1) Because they were not well trained
2) Not with clear guidelines in most cases
3) Not from linguistic/translator background
4) Favour the candidate translation with n-gram matching to source/reference
- Whenever Human Expert Assessment is available/possible, do it!
www.adaptcentre.ie
Expert analysis: Diagnose RXD2
26
1) RXD1 separates character into semantic+phonetic.
2) RXD3 decomposes more stroke like sequence with order.
3) RXD2 generates smaller size characters mis-leading langauge understanding model.
example (Figure before):
RXD2: new characters 从(cóng) and 王(wáng) respectively from 劍(Jiàn, {sword}) and 鋒 (fēng, {edge/
sharp point}), but they have no direct meaning from their father characters, instead meaning “from and
“king respectively. (fēng)
(semantic, metal) (phonetic, féng)
…
(jiàn)
(phonetic, qiān) (semantic, knife)
… … …
Level-1:
Level-2:
Level-3:
…
Full-stroke:
www.adaptcentre.ie
Content
• Background (motivation of this work)

• Related work (Decomposition4MT)

• Our refined decomposition NMT model with MWEs

• Automatic and Crowd-source Human evaluation

• Experts’ analysis with Examples (new insight)

• = AlphaMWE (multilingual lexicon: corpus with MWEs)
https://
github.com
/poethan/
AlphaMWE
Bonus corpus
27
www.adaptcentre.ie
AlphaMWE
Procedure for constructing AlphaMWE
AlphaMWE
28
www.adaptcentre.ie
AlphaMWE
Size, coverage, usage - come to join us
• Extracted all 750 English sentences which have vMWE tags included

• English source: Walsh, et al. (2018) https://gitlab.com/parseme/
parseme_corpus_en 

• The target covered so far: Chinese, German, Polish, Italian, with Spanish/French
under editing (why not to join the team?!!). 

• It's comparable to some standard shared task usage. 

• development and test data sets from the annual WMT (Bojar et al., 2017) and also
from the NIST MT challenges - approximately 2,000 sentences for Dev/testing
over some years (https://www.nist.gov/programs-projects/machine-translation)

• In plan to submit for shared tasks: Multilingual/bilingal MT, NLP
AlphaMWE
29
www.adaptcentre.ie
Examples
of
AlphaMWE
sentences:
EN and
DE/PL/ZH/IT
30
Plain
English
Corpus
The chair was comfortable, and the beer had gone slightly to his head.
I was smoking my pipe quietly by my dismantled steamer, and saw them all cutting capers in the light, with
their arms lifted high, when the stout man with mustaches came tearing down to the river, a tin pail in his
hand, assured me that everybody was 'behaving splendidly, splendidly, dipped about a quart of water and
tore back again. (the italic was not annotated in source English)
English
MWEs
gone (slightly) to his head, cutting capers, tearing down, tore back
Target
Chiense
Corpus
[sourceVMWE: gone (slightly) to his head][targetVMWE: ( )
]
“ ”
[sourceVMWE: cutting capers; tearing down; tore back][targetVMWE: ; ; ]
Target
German
Corpus
Der Stuhl war bequem, und das Bier war ihm leicht zu Kopf gestiegen. [sourceVMWE: gone (slightly) to his
head][targetVMWE: (leicht) zu Kopf gestiegen]
Ich rauchte leise meine Pfeife an meinem zerlegten Dampfer und sah, wie sie alle im Licht mit hoch
erhobenen Armen Luftsprünge machten, als der stämmige Mann mit Schnurrbart mit einem Blecheimer in der
Hand zum Fluss hinunterkam und mir versicherte, dass sich alle prächtig, prächtig benahmen, etwa einen
Liter Wasser eintauchte und wieder zurückwankte”. [sourceVMWE: cutting capers; tearing down; tore back]
[targetVMWE: Luftsprünge machten; hinunterkam; zurückwankte]
Target
Polish
Corpus
Krzesło było wygodne, a piwo lekko uderzyło mu do głowy. [ sourceVMWE: gone (slightly) to his head]
[targetVMWE: (lekko) uderzyło mu do głowy]
Cicho paliłem swoją fajkę przy zdemontowanym parowcu i widziałem, jak wszyscy pląsają w świetle, z
podniesionymi wysoko ramionami, gdy twardziel z wąsami przyszedł szybkim krokiem do rzeki, blaszany
wiaderko w dłoni, zapewnił mnie, że wszyscy zachowują się wspaniale, wspaniale, nabrał około ćwiartkę wody
i zawrócił szybkim krokiem”. [sourceVMWE: cutting capers; tearing down; tore back][targetVMWE: pląsają;
przyszedł szybkim krokiem; zawrócił szybkim krokiem]
Target
Italian
Corpus
La sedia era comoda, e la birra gli aveva leggermente dato alla testa. [ sourceVMWE: gone (slightly) to his
head][targetVMWE: aveva (leggermente) dato alla testa ]
Stavo fumando tranquillamente la pipa vicino al mio piroscafo smontato, e li ho visti tutti giocare
gioiosamente alla luce, con le braccia alzate, quando l'uomo robusto con i baffi è venuto giù al fiume
alacremente, un secchio di latta in mano, mi ha assicurato che tutti si stavano comportando splendidamente,
splendidamente, ha preso circa un litro d'acqua ed è tornato indietro velocemente. [ sourceVMWE: cutting
capers; tearing down; tore back] [targetVMWE: giocare gioiosamente; venuto giù alacremente; tornato
indietro velocemente]
31
News!
https://github.com/poethan/AlphaMWE/releases/tag/V1.0
www.adaptcentre.ie
References
• Our work:

• AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations. Forthcoming in Joint Workshop on Multiword Expressions and Electronic
Lexicons (MWE-LEX) @COLING-2020, pages 44–57 https://www.aclweb.org/anthology/2020.mwe-1.6/ 

• Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking, Proceedings
of the 13th Workshop on Multiword Expressions (MWE 2017), Valencia, Spain, April 4, 2017, 114-120 https://www.aclweb.org/anthology/W17-1715/ 

• MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora. In: 12th International Conference on Language Resources and
Evaluation (LREC), 11-16 May, 2020, Marseille, France. (Virtual). https://www.aclweb.org/anthology/2020.lrec-1.363/

• Chinese Character Decomposition for Neural MT with Multi-Word Expressions. 23rd Nordic Conference on Computational Linguistics. Data available
under the subfolder 'radical4mt'. https://www.aclweb.org/anthology/2021.nodalida-main.35/ 

• Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods. @NoDaLiDa21. https://ep.liu.se/ecp/179/003/ecp2021179003.pdf 

• Based on/refer to:

• Agata Savary, et al. 2017. The PARSEME shared task on automatic identification of verbal multiword expressions. In MWE2017. 

• Abigail Walsh, et al. 2018. Constructing an annotated corpus of verbal MWEs for English. In (LAW-MWE-CxG2018), pages 193–200. 

• Carlos Ramisch et al. 2018. Edition 1.1 of the PARSEME shared task on automatic identification of verbal multiword expressions. In LAW-
MWE-CxG-2018)

• Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. 2021 https://arxiv.org/abs/2104.14478
32
We endorse the PARSEME shared task events and the corpus!
www.adaptcentre.ie
References
• MWE:

• Timothy Baldwin and Su Nam Kim. 2010. Multiword expressions. In Handbook of Natural LanguageProcessing, Second Edition, pages 267–292.
Chapman and Hall.

• Mathieu Constant, et al. 2017. Survey: Multiword expression processing: A Survey. Computational Linguistics, 43(4):837–892. 

• Ivan A. Sag, et al. 2002. Multiword expressions: A pain in the neck for nlp. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text
Processing.

• MWE corpus: 

• Akihiko Kato, Hiroyuki Shindo, and Yuji Matsumoto. 2018. Construction of Large-scale English Verbal Multiword Expression Annotated Corpus. In
LREC.

• Nathan Schneider, et al. 2014. Comprehensive annotation of multiword expressions in a social web corpus. In Proceedings of the LREC.

• Veronika Vincze. 2012. Light verb constructions in the SzegedParalellFX English–Hungarian parallel corpus. In LREC. 

• MT with MWE:

• Dhouha Bouamor, Nasredine Semmar, and Pierre Zweigenbaum. 2012. Identifying bilingual multi-word expressions for statistical machine
translation. In LREC.

• Patrik Lambert and Rafael E. Banchs. 2005. Data Inferred Multi-word Expressions for Statistical Machine Translation. In Proceedings of
Machine Translation Summit X, pages 396–403, Thailand.

• Xiaoqing Li, Jinghui Yan, Jiajun Zhang, and Chengqing Zong. 2019. Neural name translation improves neural machine translation. In
Machine Translation, pages 93–100, Singapore. Springer. 

• Matīss Rikters and Ondřej Bojar. 2017. Paying Attention to Multi-Word Expressions in Neural MachineTranslation. In Proceedings of the
16th Machine Translation Summit.

• Inguna Skadina. 2016. Multi-word expressions in english-latvian machine translation. Baltic J. Modern Computing, 4:811–825.
References
33
www.adaptcentre.ie
34
• Dankeschön!
• 谢谢!
• Thank you!
• Gracias!
• Grazie!
• Dziękuję Ci!
• Merci!
• Dank je!
• спасибі!
• धन्यवाद!
• Благодаря ти!
quiz: which language do you recognise? 😉
Go
raibh maith
agat!
tak skal
du have
Takk skal du ha
tack
Kiitos
Þakka þér fyrir
Qujan Qujanaq Qujanarsuaq
Further Reading A.I(MWEs)
• [1] Erwan Moreau, Ashjan Alsulaimani, Alfredo Maldonado, Lifeng Han, Carl Vogel and Koel
Dutta Chowdhury. Semantic Re-Ranking of CRF Label Sequences for Verbal Multi-Word
Expression Extraction. Book Chapter. Stella Markantonatou, Carlos Ramisch, Agata Savary,
and Veronika Vincze Volume Editors. Language Science Press (LangSci). pp.1-24. 2018
• [2] Alfredo Maldonado, Lifeng Han, Erwan Moreau, Ashjan Alsulaimani, Koel Dutta
Chowdhury, Carl Vogel and Qun Liu. Detection of Verbal Multi-Word Expressions via
Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking.
In MWE workshop with EACL 2017, Spain. (one of the three main co-authors)
• Previous related:
• [3]Lifeng Han, Xiaodong Zeng, Derek F. Wong, Lidia S. Chao. Chinese Named Entity
Recognition with Graph-based Semi-supervised Learning Model}{SIGHAN workshop in ACL-
IJCNLP. 2015.
• [4]Lifeng Han, Derek F. Wong, Lidia S. Chao, Liangye He, et al. A Study of Chinese Word
Segmentation Based on the Characteristics of Chinese.Language Processing and
Knowledge in the Web - Proceedings of the International Conference of the German
Society for Computational Linguistics and Language Technology.
• [5]Lifeng Han, Derek F. Wong, Lidia S. Chao. Chinese Named Entity Recognition with
Conditional Random Fields in the Light of Chinese Characteristics.Proceeding of
International Conference of Language Processing and Intelligent Information Systems. IIS
2013, LNCS Vol. 7912, pp. 57-68
Further Reading A.II(MT)
• [6]Lifeng Han, Shaohui Kuang. Incorporating Chinese Radicals Into
Neural Machine Translation: Deeper Than Character Level. In
ESSLLI-2018. August 6-17, Sofia, Bulgaria. http://doras.dcu.ie/
24732/8/esslli_han_incorperating_.pdf
• Previous related:
• [7]Lifeng Han, Derek F. Wong, Lidia S. Chao, et al. A Universal
Phrase Tagset for Multilingual Treebanks. CCL and NLP-NABD
2014, LNAI 8801, pp. 247 - 258.
• [8]Lifeng Han, Derek F. Wong, et al. Phrase Tagset Mapping for
French and English Treebanks and Its Application in Machine
Translation Evaluation. Language Processing and Knowledge in
the Web - Proceedings of the International Conference of the
German Society for Computational Linguistics and Language
Technology, (GSCL 2013), Darmstadt, Germany, on September
25-27, 2013. LNCS Vol. 8105
Further Reading A.III(MTE)
• [9]Lifeng Han. Machine Translation Evaluation Resources and Methods: A Survey.
Presented in IPRC-2018 (Ireland Postgraduate Research Conference, 8-9 November,
Dublin) pp.1-18. arXiv CS.CL(1605.04515)
• Previous related:
• [10]Lifeng Han, Derek F. Wong, et al. Unsupervised Quality Estimation Model for
English to German Translation and Its Application in Extensive Supervised
Evaluation. The Scientific World Journal. Issue: Recent Advances in Information
Technology. ISSN:1537-744X
• [11]Lifeng Han, Derek F. Wong, et al. Language-independent Model for Machine
Translation Evaluation with Reinforced Factors. MT SUMMIT 2013. pp. 215-222.
• [12]Lifeng Han, Derek F. Wong, et al. A Description of Tunable Machine Translation
Evaluation Systems in WMT13 Metrics Task. In ACL-WMT 2013.
• [13]Lifeng Han, Derek F. Wong, et al. Quality Estimation for Machine Translation
Using the Joint Method of Evaluation Criteria and Statistical Modeling. ACL-WMT
2013.
• [14]Lifeng Han, Derek F. Wong, Lidia S. Chao. LEPOR: A Robust Evaluation Metric for
Machine Translation with Augmented Factors. Proceedings of the 24th International
Conference on Computational Linguistics (COLING 2012): Posters, pages 441-450.

More Related Content

PDF
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
PDF
Apply chinese radicals into neural machine translation: deeper than character...
PDF
Meta-evaluation of machine translation evaluation methods
PDF
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
PDF
PubhD talk: MT serving the society
PDF
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
PDF
Successes and Frontiers of Deep Learning
PDF
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Apply chinese radicals into neural machine translation: deeper than character...
Meta-evaluation of machine translation evaluation methods
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
PubhD talk: MT serving the society
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Successes and Frontiers of Deep Learning
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...

What's hot (20)

PPTX
Searching for the Best Machine Translation Combination
PDF
Practical machine learning - Part 1
PDF
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
PPT
Opinion mining for social media and news items in Romanian
PPTX
Arabic question answering ‫‬
PDF
Answer Selection and Validation for Arabic Questions
PDF
Question Answering - Application and Challenges
PDF
Natural language processing for requirements engineering: ICSE 2021 Technical...
PPTX
From TREC to Watson: is open domain question answering a solved problem?
PPTX
Detecting and Describing Historical Periods in a Large Corpora
PDF
Multi-modal Neural Machine Translation - Iacer Calixto
PDF
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
PPTX
Web services for supporting the interactions of learners in the social web - ...
PDF
Challenges in transfer learning in nlp
PPT
How useful are semantic links for the detection of implicit references in csc...
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
NLP Project Full Cycle
PDF
The VoiceMOS Challenge 2022
PDF
Language Models for Information Retrieval
PDF
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
Searching for the Best Machine Translation Combination
Practical machine learning - Part 1
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Opinion mining for social media and news items in Romanian
Arabic question answering ‫‬
Answer Selection and Validation for Arabic Questions
Question Answering - Application and Challenges
Natural language processing for requirements engineering: ICSE 2021 Technical...
From TREC to Watson: is open domain question answering a solved problem?
Detecting and Describing Historical Periods in a Large Corpora
Multi-modal Neural Machine Translation - Iacer Calixto
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Web services for supporting the interactions of learners in the social web - ...
Challenges in transfer learning in nlp
How useful are semantic links for the detection of implicit references in csc...
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
NLP Project Full Cycle
The VoiceMOS Challenge 2022
Language Models for Information Retrieval
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
Ad

More from Lifeng (Aaron) Han (20)

PDF
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
PDF
Measuring Uncertainty in Translation Quality Evaluation (TQE)
PDF
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
PDF
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
PDF
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
PDF
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
PDF
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
PDF
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
PDF
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
PDF
A deep analysis of Multi-word Expression and Machine Translation
PDF
machine translation evaluation resources and methods: a survey
PDF
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
PPTX
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
PDF
Lepor: augmented automatic MT evaluation metric
PDF
Thesis-Master-MTE-Aaron
PDF
Machine translation evaluation: a survey
PDF
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
PDF
LEPOR: an augmented machine translation evaluation metric
PPTX
Pptphrase tagset mapping for french and english treebanks and its application...
PDF
Pptphrase tagset mapping for french and english treebanks and its application...
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
A deep analysis of Multi-word Expression and Machine Translation
machine translation evaluation resources and methods: a survey
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Lepor: augmented automatic MT evaluation metric
Thesis-Master-MTE-Aaron
Machine translation evaluation: a survey
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
Ad

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A comparative analysis of optical character recognition models for extracting...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
1. Introduction to Computer Programming.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Getting Started with Data Integration: FME Form 101
Spectral efficient network and resource selection model in 5G networks
OMC Textile Division Presentation 2021.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectroscopy.pptx food analysis technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
A Presentation on Artificial Intelligence
Network Security Unit 5.pdf for BCA BBA.
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A comparative analysis of optical character recognition models for extracting...

Chinese Character Decomposition for Neural MT with Multi-Word Expressions

  • 1. *ADAPT Research Centre ^ Insight Centre for Data Analytics Dublin City University, Ireland Chinese Character Decomposition for Neural MT with Multi-Word Expressions Lifeng Han*, Gareth J. F. Jones*, Alan F. Smeaton^, and Paolo Bolzoni research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics & COLING20:MWE-LEX WS 1 Bonus takeaway: AlphaMWE multilingual corpus with MWEs ADAPT seminar series June,2021
  • 2. www.adaptcentre.ie Content • Background (motivation of this work) • Related work (Decomposition4MT) • Our refined decomposition NMT model with MWEs • Automatic and Crowd-source Human evaluation • Experts’ analysis with Examples (new insight) • => AlphaMWE (multilingual lexicon: corpus with MWEs) Cartoon, https://www.freepik.com/premium-vector/boy-with-lot-books_5762027.htm Data: https:// github.com/ poethan/ MWE4MT 2 radical4mt
  • 3. www.adaptcentre.ie Background Driven: - two factors • Sub-character NMT: • BPE for English and western languages. • How about asian ideograph script? • NMT bottlenecks - MWEs, low-frequency words, OOV words: • How to better address Multiword Expression translations • MWE: “display lexical, syntactic, semantic, pragmatic and/or statistical idiomaticity”, • e.g. kick the bucket, by and large, pull one's leg 3
  • 4. Parallel corpora Trainer Neural networks Encoding Source text Decoder, NNs Encoder, NNs RNN + + Decoding CNN + Attention Learned Model MT outputs 4 NMT components Target language
  • 5. 5 NMT Linguistic structure & Knowledge Learning model Semantics & Disambiguation MWE methodology Dictionary usage Bilingual phase table … Attention Coverage All attention (Transformer, 2017) BERT (2018google) Pretraining lang. model + Bi-direc. RNN Bi-direc. RNN + + Tree2string Eriguchi et al. 2016ACL String2tree Aharoni&Goldberg17ACL … Tree2Tree? T2T NN program Translation Chen et al.2018ICLR Dependency Wu et al. 2017IJCAI . . . A bigger view of the belonging, Within NMT research paradigm, NMT branches. here Syntax structure & dependency
  • 6. www.adaptcentre.ie Background Chinese characters, example • Semantic part + phonetic part • Semantics: radicals • Phonetics: related to the overall pronunciation of the character Background 6 Chinese radical (Dāo, knife) evolution from Pictogram to Regular script Shang Dynasty (1600-1046BC) Western- Zhou Dynasty (1045-771BC) Warring States period (476-221BC) Han Dynasty (202BC-220) Eastern Han (from 57AD on) Bronze inscriptions Oracle bone script Bronze Inscription Silk (on Seal) Regular script
  • 7. www.adaptcentre.ie Background Background 7 (fēng) (semantic, metal) (phonetic, féng) … (jiàn) (phonetic, qiān) (semantic, knife) … … … Level-1: Level-2: Level-3: … Full-stroke: Word level 28 / / ⇥⇤ / ⌅ / ⇧⌃ / ⌥ / / ⌦↵ / / ✏ Character 28 ⇥ ⇤ ⌅ ⇧ ⌃ ⌥ ⌦ ↵ ✏ Pronunciation èr shí bā Suì chú shī bèi fā xiàn sǐ yú jiù jīn shān yī jiā shāng chǎng Radical 28 ⇣⌘ ✓◆  ⌫ ⇠ ⇡⇢ ⌧ !⇡" ↵ #$ %"& '( English Ref. 28-Year-Old Chef Found Dead at San Francisco Mall
  • 8. www.adaptcentre.ie Background Background 8 ZH source: ZH pinyin: Nián nián suì suì huā xiāng sì, suì suì nián nián rén bù tóng. EN reference: The flowers are similar each year, while people are changing every year. EN MT output: One year spent similar, each year is different Example of MWEs in MT as a challenge. Reference: Han et al. (2020LREC) MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora. https://www.aclweb.org/anthology/2020.lrec-1.363/
  • 9. www.adaptcentre.ie Content • Background (motivation of this work) • Related work (Decomposition4MT) • Our refined decomposition NMT model with MWEs • Automatic and Crowd-source Human evaluation • Experts’ analysis with Examples (new insight) • = AlphaMWE (multilingual lexicon: corpus with MWEs) cartoon, https://www.amazon.ca/Tweety-Bird-not- Related work 9
  • 10. www.adaptcentre.ie Related work Chinese character decomposition • radical embeddings as additional features for Chinese → English and Japanese Chinese NMT. • Our own: Han and Kuang (2018) : a range of encoding models including word+character, word+radical, and word+character+radical (best) with bidirectional RNNs • Zhang and Matsumoto (2018): radical embeddings as additional features to character level LSTM- based NMT on Japanese → Chinese translation • Bidirectional English Japanese, English Chinese and Chinese Japanese NMT with word, character, ideograph and stroke levels • Zhang and Komachi (2018) • experiments showing that the ideograph level was best for ZH→EN MT, while the stroke level was best for JP→EN MT • No intermediate level decomposition testing 10 Han and Kuang (2018) Incorporating Chinese radicals into neural machine translation: deeper than character level. In: 30th European Summer School in Logic, Language and Information (ESSLLI 2018) https://arxiv.org/abs/1805.01565
  • 11. www.adaptcentre.ie Related work Chinese character decomposition 11 Han and Kuang (2018) Incorporating Chinese radicals into neural machine translation: deeper than character level. In: 30th European Summer School in Logic, Language and Information (ESSLLI 2018) https://arxiv.org/abs/1805.01565
  • 12. www.adaptcentre.ie Content • Background (motivation of this work) • Related work (Decomposition4MT) • Refined decomposition Neural MT model with MWEs • Automatic and Crowd-source Human evaluation • Experts’ analysis with Examples (new insight) • = AlphaMWE (multilingual lexicon: corpus with MWEs) 12 data: https://github.com/ poethan/MWE4MT radical4mt !
  • 13. www.adaptcentre.ie IDS files from CHISE 13 Character Decomposition Decomposition (lì) [G] [T] (jù) [GTKV] [J] (hán) [GTV] [JK] (yǒng) [GTV] [JK] Character construction: : up-down, : left-right, : inside-outside, : embedded Refined decomposition model with MWEs CHISE (CHaracter Information Service Environment) project. Comprised of 88,940 Chinese characters from CJK (Chinese, Japanese, Korean script) Unified Ideographs https://github.com/cjkvi/cjkvi-ids http://www.chise.org/
  • 14. Extraction procedure • To obtain a decomposition level L representation of Chinese character α: • go through the IDS file L times. • Each time, we search the IDS file character list to match the newly generated smaller sized characters and • re-place them with decomposed representation recursively. 14
  • 15. Examples of decomposition/extraction 15 shared bilingual glossaries: https://github.com/poethan/MWE4MT/tree/master/radical4mt Zh MWE ⾼高尔夫球 俱乐部 (golf club), 汽⻋车 散热器 (car radiator) Rxd1 {亠⼝口冋𠂊⼩小⼆二⼈人王求 ⺅亻具乐咅阝} , {⺡氵⽓气⻋车 龷攵执⺣灬吅⽝犬吅} Rxd2 {⼂丶⼀一⼝口⼌冂⼝口𠂊⼩小⼀一⼀一⼈人⼀一⼟土⼀一⺢氺⼂丶 ⺅亻且⼀一乐立⼝口阝}, {⺡氵𠂉⼀一乁⻋车 卄⼀一攵⺘扌丸⺣灬⼝口 ⼝口⽝犬⼝口⼝口} Rxd3 {⼂丶⼀一⼝口⼌冂⼝口𠂊⼩小⼀一⼀一⼈人⼀一⼗十⼀一⼀一⼅亅丷八⼂丶 ⺅亻且⼀一乐亠丷⼀一⼝口阝}, {⺡氵𠂉⼀一乁⻋车 ⼗十⼁丨 ⼀一攵⺘扌九⼂丶⺣灬⼝口⼝口⽝犬⼝口⼝口} level2: generates ⼟土 ⺢氺, then level3 further decomposed them.
  • 16. Adding MWEs and decomposed MWEs 16 Lifeng Han, Gareth J.F. Jones and Alan F. Smeaton. 2020. MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2970–2979 Marseille, 11–16 May 2020
  • 17. www.adaptcentre.ie Content • Background (motivation of this work) • Related work (Decomposition4MT) • Our refined decomposition NMT model with MWEs • Automatic and Crowd-source Human evaluation • Experts’ analysis with Examples (new insight) • = AlphaMWE (multilingual lexicon: corpus with MWEs) cartoon, https://images.app.goo.gl/Y6bAfr9oFsswWjUY7 What’s happening ? 17
  • 18. www.adaptcentre.ie BLEU scores with increasing learning steps Evaluation 18
  • 19. www.adaptcentre.ie BLEU scores with increasing learning steps Evaluation 19
  • 21. www.adaptcentre.ie Content • Background (motivation of this work) • Related work (Decomposition4MT) • Our refined decomposition NMT model with MWEs • Automatic and Crowd-source Human evaluation • Experts’ analysis with Examples (new insight) • = AlphaMWE (multilingual lexicon: corpus with MWEs) cartoon, https://www.bbc.com/news/ radical4mt 21 data: https://github.com/ poethan/MWE4MT
  • 22. www.adaptcentre.ie Expert Analysis: new research trend 22 - BLEU has long been criticised as not reflecting real differences between high-performing MT models. - Crowd-source human evaluation is not reliable either, with very recent work highlighting that professional translators disagree with crowd-source human ranking of MT systems largely via WMT data. cite{Freitag et al. 2021 MT_HA}. - BLEU Crowd-source Human Assessment (CSHA): - tend to favour ‘boring’ translations. - When lexical diversity improves in MToutput: get lower scores (BLEU, CSHA) - We look at detailed translation examples from the different system outputs, at 100K learning steps, by human expert, native speaker: examples reflect the advantages from decomposition models, e.g. RXD3 (as also RXD1) on MWEs translation.
  • 23. 23 src 28 28 ref 28 @-@ Year @-@ Old Chef Found Dead at San Francisco Mall a 28 @-@ year @-@ old chef who had recently moved to San Francisco was found dead in the stairwell of a local mall this week . rxd3 the 28 @-@ year @-@ old chef was found dead at a San Francisco mall a 28 @-@ year @-@ old chef who recently moved to San Francisco has been found dead on a stairwell in a local mall this week . base the 28 @-@ year @-@ old chef was found dead in a shop in San Francisco a 28 @-@ year @-@ old chef who has moved to San Francisco this week was found dead on the stairs of a local mall . base MWE 28 @-@ year @-@ old chef was found dead at a San Francisco mall a 28 @-@ year @-@ old chef who recently moved to San Francisco was found dead this week at a local mall . rxd3 MWE 28 @-@ year @-@ old chef was found dead at a San Francisco mall a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead this week at a local mall . rxd1 the 28 @-@ year @-@ old chef was found dead at a San Francisco mall a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead in a local shopping mall this week . rxd2 the 28 @-@ year @-@ old chef was found dead in a San Francisco mall a 28 @-@ year @-@ old San Francisco chef was found dead in a local mall this week .
  • 24. www.adaptcentre.ie Expert analysis: insights on MWEs 24 1) Chinese MWE 商场 (shāng chǎng) in the first sentence: - correctly translated as {mall} by {rxd3} model {-vs-} translated as {shop} by the baseline character sequence model 2) MWE 楼梯间 (lóutījiān) in the second sentence: - correctly translated as {stairwell} by the {rxd3} model {-vs-} baseline: as {stairs} 3) MWE 近⽇日(jìn rì) meaning {recently} in the second sentence: - totally missed out by the original character sequence model = results in a misleading ambiguous translation of an even larger content, i.e., did the chief moved to San Francisco (SF) {recently} or {this week} - MWE 近⽇日(jìn rì) correctly translated by the {rxd3} model = overall meaning of the sentence is clear.
  • 25. www.adaptcentre.ie Expert analysis: on Multi-word Expressions 25 1) It is not reflected by BLEU scores 1) because the lower percentage of MWEs in corpus. However, it is an important part of the corpus and human languages/expressions. 2) Because of the MWE interpretations, lexical diversity in translation, and reference corpus. 2) it is not reflected by crowd-source human assessment 1) Because they were not well trained 2) Not with clear guidelines in most cases 3) Not from linguistic/translator background 4) Favour the candidate translation with n-gram matching to source/reference - Whenever Human Expert Assessment is available/possible, do it!
  • 26. www.adaptcentre.ie Expert analysis: Diagnose RXD2 26 1) RXD1 separates character into semantic+phonetic. 2) RXD3 decomposes more stroke like sequence with order. 3) RXD2 generates smaller size characters mis-leading langauge understanding model. example (Figure before): RXD2: new characters 从(cóng) and 王(wáng) respectively from 劍(Jiàn, {sword}) and 鋒 (fēng, {edge/ sharp point}), but they have no direct meaning from their father characters, instead meaning “from and “king respectively. (fēng) (semantic, metal) (phonetic, féng) … (jiàn) (phonetic, qiān) (semantic, knife) … … … Level-1: Level-2: Level-3: … Full-stroke:
  • 27. www.adaptcentre.ie Content • Background (motivation of this work) • Related work (Decomposition4MT) • Our refined decomposition NMT model with MWEs • Automatic and Crowd-source Human evaluation • Experts’ analysis with Examples (new insight) • = AlphaMWE (multilingual lexicon: corpus with MWEs) https:// github.com /poethan/ AlphaMWE Bonus corpus 27
  • 29. www.adaptcentre.ie AlphaMWE Size, coverage, usage - come to join us • Extracted all 750 English sentences which have vMWE tags included • English source: Walsh, et al. (2018) https://gitlab.com/parseme/ parseme_corpus_en • The target covered so far: Chinese, German, Polish, Italian, with Spanish/French under editing (why not to join the team?!!). • It's comparable to some standard shared task usage. • development and test data sets from the annual WMT (Bojar et al., 2017) and also from the NIST MT challenges - approximately 2,000 sentences for Dev/testing over some years (https://www.nist.gov/programs-projects/machine-translation) • In plan to submit for shared tasks: Multilingual/bilingal MT, NLP AlphaMWE 29
  • 30. www.adaptcentre.ie Examples of AlphaMWE sentences: EN and DE/PL/ZH/IT 30 Plain English Corpus The chair was comfortable, and the beer had gone slightly to his head. I was smoking my pipe quietly by my dismantled steamer, and saw them all cutting capers in the light, with their arms lifted high, when the stout man with mustaches came tearing down to the river, a tin pail in his hand, assured me that everybody was 'behaving splendidly, splendidly, dipped about a quart of water and tore back again. (the italic was not annotated in source English) English MWEs gone (slightly) to his head, cutting capers, tearing down, tore back Target Chiense Corpus [sourceVMWE: gone (slightly) to his head][targetVMWE: ( ) ] “ ” [sourceVMWE: cutting capers; tearing down; tore back][targetVMWE: ; ; ] Target German Corpus Der Stuhl war bequem, und das Bier war ihm leicht zu Kopf gestiegen. [sourceVMWE: gone (slightly) to his head][targetVMWE: (leicht) zu Kopf gestiegen] Ich rauchte leise meine Pfeife an meinem zerlegten Dampfer und sah, wie sie alle im Licht mit hoch erhobenen Armen Luftsprünge machten, als der stämmige Mann mit Schnurrbart mit einem Blecheimer in der Hand zum Fluss hinunterkam und mir versicherte, dass sich alle prächtig, prächtig benahmen, etwa einen Liter Wasser eintauchte und wieder zurückwankte”. [sourceVMWE: cutting capers; tearing down; tore back] [targetVMWE: Luftsprünge machten; hinunterkam; zurückwankte] Target Polish Corpus Krzesło było wygodne, a piwo lekko uderzyło mu do głowy. [ sourceVMWE: gone (slightly) to his head] [targetVMWE: (lekko) uderzyło mu do głowy] Cicho paliłem swoją fajkę przy zdemontowanym parowcu i widziałem, jak wszyscy pląsają w świetle, z podniesionymi wysoko ramionami, gdy twardziel z wąsami przyszedł szybkim krokiem do rzeki, blaszany wiaderko w dłoni, zapewnił mnie, że wszyscy zachowują się wspaniale, wspaniale, nabrał około ćwiartkę wody i zawrócił szybkim krokiem”. [sourceVMWE: cutting capers; tearing down; tore back][targetVMWE: pląsają; przyszedł szybkim krokiem; zawrócił szybkim krokiem] Target Italian Corpus La sedia era comoda, e la birra gli aveva leggermente dato alla testa. [ sourceVMWE: gone (slightly) to his head][targetVMWE: aveva (leggermente) dato alla testa ] Stavo fumando tranquillamente la pipa vicino al mio piroscafo smontato, e li ho visti tutti giocare gioiosamente alla luce, con le braccia alzate, quando l'uomo robusto con i baffi è venuto giù al fiume alacremente, un secchio di latta in mano, mi ha assicurato che tutti si stavano comportando splendidamente, splendidamente, ha preso circa un litro d'acqua ed è tornato indietro velocemente. [ sourceVMWE: cutting capers; tearing down; tore back] [targetVMWE: giocare gioiosamente; venuto giù alacremente; tornato indietro velocemente]
  • 32. www.adaptcentre.ie References • Our work: • AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations. Forthcoming in Joint Workshop on Multiword Expressions and Electronic Lexicons (MWE-LEX) @COLING-2020, pages 44–57 https://www.aclweb.org/anthology/2020.mwe-1.6/ • Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking, Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), Valencia, Spain, April 4, 2017, 114-120 https://www.aclweb.org/anthology/W17-1715/ • MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora. In: 12th International Conference on Language Resources and Evaluation (LREC), 11-16 May, 2020, Marseille, France. (Virtual). https://www.aclweb.org/anthology/2020.lrec-1.363/ • Chinese Character Decomposition for Neural MT with Multi-Word Expressions. 23rd Nordic Conference on Computational Linguistics. Data available under the subfolder 'radical4mt'. https://www.aclweb.org/anthology/2021.nodalida-main.35/ • Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods. @NoDaLiDa21. https://ep.liu.se/ecp/179/003/ecp2021179003.pdf • Based on/refer to: • Agata Savary, et al. 2017. The PARSEME shared task on automatic identification of verbal multiword expressions. In MWE2017. • Abigail Walsh, et al. 2018. Constructing an annotated corpus of verbal MWEs for English. In (LAW-MWE-CxG2018), pages 193–200. • Carlos Ramisch et al. 2018. Edition 1.1 of the PARSEME shared task on automatic identification of verbal multiword expressions. In LAW- MWE-CxG-2018) • Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. 2021 https://arxiv.org/abs/2104.14478 32 We endorse the PARSEME shared task events and the corpus!
  • 33. www.adaptcentre.ie References • MWE: • Timothy Baldwin and Su Nam Kim. 2010. Multiword expressions. In Handbook of Natural LanguageProcessing, Second Edition, pages 267–292. Chapman and Hall. • Mathieu Constant, et al. 2017. Survey: Multiword expression processing: A Survey. Computational Linguistics, 43(4):837–892. • Ivan A. Sag, et al. 2002. Multiword expressions: A pain in the neck for nlp. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing. • MWE corpus: • Akihiko Kato, Hiroyuki Shindo, and Yuji Matsumoto. 2018. Construction of Large-scale English Verbal Multiword Expression Annotated Corpus. In LREC. • Nathan Schneider, et al. 2014. Comprehensive annotation of multiword expressions in a social web corpus. In Proceedings of the LREC. • Veronika Vincze. 2012. Light verb constructions in the SzegedParalellFX English–Hungarian parallel corpus. In LREC. • MT with MWE: • Dhouha Bouamor, Nasredine Semmar, and Pierre Zweigenbaum. 2012. Identifying bilingual multi-word expressions for statistical machine translation. In LREC. • Patrik Lambert and Rafael E. Banchs. 2005. Data Inferred Multi-word Expressions for Statistical Machine Translation. In Proceedings of Machine Translation Summit X, pages 396–403, Thailand. • Xiaoqing Li, Jinghui Yan, Jiajun Zhang, and Chengqing Zong. 2019. Neural name translation improves neural machine translation. In Machine Translation, pages 93–100, Singapore. Springer. • Matīss Rikters and Ondřej Bojar. 2017. Paying Attention to Multi-Word Expressions in Neural MachineTranslation. In Proceedings of the 16th Machine Translation Summit. • Inguna Skadina. 2016. Multi-word expressions in english-latvian machine translation. Baltic J. Modern Computing, 4:811–825. References 33
  • 34. www.adaptcentre.ie 34 • Dankeschön! • 谢谢! • Thank you! • Gracias! • Grazie! • Dziękuję Ci! • Merci! • Dank je! • спасибі! • धन्यवाद! • Благодаря ти! quiz: which language do you recognise? 😉 Go raibh maith agat! tak skal du have Takk skal du ha tack Kiitos Þakka þér fyrir Qujan Qujanaq Qujanarsuaq
  • 35. Further Reading A.I(MWEs) • [1] Erwan Moreau, Ashjan Alsulaimani, Alfredo Maldonado, Lifeng Han, Carl Vogel and Koel Dutta Chowdhury. Semantic Re-Ranking of CRF Label Sequences for Verbal Multi-Word Expression Extraction. Book Chapter. Stella Markantonatou, Carlos Ramisch, Agata Savary, and Veronika Vincze Volume Editors. Language Science Press (LangSci). pp.1-24. 2018 • [2] Alfredo Maldonado, Lifeng Han, Erwan Moreau, Ashjan Alsulaimani, Koel Dutta Chowdhury, Carl Vogel and Qun Liu. Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking. In MWE workshop with EACL 2017, Spain. (one of the three main co-authors) • Previous related: • [3]Lifeng Han, Xiaodong Zeng, Derek F. Wong, Lidia S. Chao. Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model}{SIGHAN workshop in ACL- IJCNLP. 2015. • [4]Lifeng Han, Derek F. Wong, Lidia S. Chao, Liangye He, et al. A Study of Chinese Word Segmentation Based on the Characteristics of Chinese.Language Processing and Knowledge in the Web - Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology. • [5]Lifeng Han, Derek F. Wong, Lidia S. Chao. Chinese Named Entity Recognition with Conditional Random Fields in the Light of Chinese Characteristics.Proceeding of International Conference of Language Processing and Intelligent Information Systems. IIS 2013, LNCS Vol. 7912, pp. 57-68
  • 36. Further Reading A.II(MT) • [6]Lifeng Han, Shaohui Kuang. Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than Character Level. In ESSLLI-2018. August 6-17, Sofia, Bulgaria. http://doras.dcu.ie/ 24732/8/esslli_han_incorperating_.pdf • Previous related: • [7]Lifeng Han, Derek F. Wong, Lidia S. Chao, et al. A Universal Phrase Tagset for Multilingual Treebanks. CCL and NLP-NABD 2014, LNAI 8801, pp. 247 - 258. • [8]Lifeng Han, Derek F. Wong, et al. Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation. Language Processing and Knowledge in the Web - Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, (GSCL 2013), Darmstadt, Germany, on September 25-27, 2013. LNCS Vol. 8105
  • 37. Further Reading A.III(MTE) • [9]Lifeng Han. Machine Translation Evaluation Resources and Methods: A Survey. Presented in IPRC-2018 (Ireland Postgraduate Research Conference, 8-9 November, Dublin) pp.1-18. arXiv CS.CL(1605.04515) • Previous related: • [10]Lifeng Han, Derek F. Wong, et al. Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation. The Scientific World Journal. Issue: Recent Advances in Information Technology. ISSN:1537-744X • [11]Lifeng Han, Derek F. Wong, et al. Language-independent Model for Machine Translation Evaluation with Reinforced Factors. MT SUMMIT 2013. pp. 215-222. • [12]Lifeng Han, Derek F. Wong, et al. A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task. In ACL-WMT 2013. • [13]Lifeng Han, Derek F. Wong, et al. Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical Modeling. ACL-WMT 2013. • [14]Lifeng Han, Derek F. Wong, Lidia S. Chao. LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors. Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pages 441-450.