SlideShare a Scribd company logo
Constrained text generation to measure reading performance:
A new approach based on constraint programming
JOINT WORK WITH ALEXANDRE BONLARRON(1,3), AURÉLIE CALABRÈSE(2), PIERRE KORNPROBST(1).
1
Jean-Charles Régin(3)
(1) Université Côte d’Azur, Inria, France
(2) Aix Marseille Université, CNRS, LPC, Marseille, France
(3) Université Côte d’Azur, I3S, France
• Standardized text: sentences read at the same speed
• Usability: to assess reading performance
Standardized Text
(Mansfield et al., 1993)
3
Constrained text generation
This is a rules (constraints) dominated problem
MNREAD Chart
=
4
MNREAD Rules
Display Rules
Lexical Rules
Grammatical Rules
Length Rules
E.g. Entering the rectangle
E.g. 3000 words from CE2 textbooks
E.g. No Punctuation
E.g. 60 characters, between 9 and 15 words
Example of a MNREAD sentence
There are 38 MNREAD sentences in French
5
Are there enough sentences?
No, a few thousand sentences are needed to detect and
monitor visual pathology throughout life.
Is it really difficult to have more sentences that
respect the rules ?
Questions :
There are 38 MNREAD phrases in French.
6
Naive method
Search for MNREAD type sentences in a corpus.
Problem : This method does not scale up
Solution : We have to generate them, but how ?
Display
Lexicon
Grammar
Length
1 000 000 sentences
10 000 sentences
6 sentences
2300 books 10 000 000 sentences 3 sentences
7
How to generate standardized sentences?
8
LLM-based approach (GPT, BERT) + SEARCH := good text quality, but unlikely to find in an
instance that satisfies the constraints.
Generates word-by-word sentences and selects the next word as the most suitable (use token
instead of words)
prompt (ChatGPT 3.5): give me a sentence of sixty characters with spaces included
“Elephants march majestically through the savannah at sunset, their presence captivating”
prompt (ChatGPT 3.5): give me a sentence of sixty characters
“The cat sat on the mat and purred softly”
How to generate standardized sentences?
• One "good" sentence out of 8000 (only); remembering bias
• A semi-automatic method for the English language
• Non-trivial extension to Latin languages!
mon ami est beau; mon amie est belle
Ad hoc method: a recent method proposed by the creators of MNREAD and based on hand-
defined models (Mansfield et al., 2019).
9
How to generate standardized sentences?
n-gram based methods (Papadopoulos et al., 2015)
Corpus n-gram Génération
When using with a random walk it produces sentences in the style of an author
Problems:
How to integrate constraints?
How to manage the meaning of the sentences?
• A generalization of Binary Decision Diagrams (BDD)
• Each layer represents a variable
• Each path between root and tt is a valid assignment of the variables
• An MDD models all tuples that satisfy a constraint
Multi-Valued Decision Diagram (MDD)
11
MDD having 3 solutions :
(a,b) (a,a) (b,b)
• Data structures for calculating and storing problem solutions in a
compressed form using an acyclic directed graph
• Advantage: Powerful modeling tools. With one billion of arcs we can
represent 10^90 solutions!
MDD and compression
● Sum of 3 variables
● Corresponds to an automata
● The last layer is the sum value
Reduction
● Operation which merges equivalent
nodes
● Two nodes are equivalent if they
have
○ the same outgoing edges
(same destination + same labels)
root
a
0
b
1
c
0 1
d
2
e
0 1
0 1 1 0 1
● Minimization of finite automata
tt
Reduction
● Operation which merges equivalent
nodes
● Two nodes are equivalent if they
have
○ the same outgoing edges
(same destination + same labels)
root
a
0
b
1
ce
d
2 0 1 1 0
1 1 0
● Minimization of finite automata
tt
Reduction
● Reduction may gain an exponential factor
● Consequence:
○ MDD can be exponentially smaller than an equivalent automaton
Compression gain
● Compression may gain an exponential factor
● It often does!
● Example
○ MDD requiring 600,000 edges for representing 10^90 solutions, that is a
compression factor of 10^86.
● Sometimes it can be subtle
Alldiff constraint
● #node = 2^n
● #solutions = n!
● n!/2^n ???
● is exponential
MDD: creation
● MDD can be created without enumerating the solution set
● Can be created from Dynamic Programming
● Kind of Search compression
● So what?
● Operations!
MDD: operations
● Intersection, union, difference, negation etc…
● Operations are performed without decompression
● Intersection of 2 MDDs is equivalent to make the conjunction of the 2
constraints represented by the MDDs
● Relation between MDD operations and constraints combination
○ Intersection : conjunction
○ Union : disjunction
○ Negation : negation
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
1.20
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
1.21
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
1.22
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
1.23
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
cel'
0
1.24
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
cel'
0
1.25
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
cel'
0
dl'
2
1.26
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
cel'
0
dl'
2
1.27
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
cel'
0
dl'
2
1.28
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
cel'
0
dl'
2
tt
1
1.29
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
cel'
0
dl'
2
tt
1
1.30
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
cel'
0
dl'
2
tt
1 1
1.31
Intersection
root1
a
0
b
1
ce
d
2 0 1 1 0
tt1
1 1 0
root2
k
0 2
l
0 2
tt2
1 2
root
ak'
0
cel'
0
dl'
2
tt
1 1
1.32
Intersection
● Be careful, do not think that the number of nodes/edges of the intersection
will be reduced
● The resulting MDD can be exponentially larger! Because it can be locally
decompressed
Operations
● In-place
● On-the-fly (i.e avoid having to define the MDD ina dvance and proceeds by
level)
3 IDEAS
35
First idea
store and retrieve n-grams efficiently
36
Successions Constraint
• Assuming all n-grams are inserted in
the MDD as solutions.
• MDD as a TRIE
• To store and reTRIEve n-grams.
What is the next word of The white cat ? 37
Second idea
Integrate constraints on-the-fly
38
- Using the first MDD (successions)
- We compile the second one
-Constraint are checked on-the-fly
The girl and the boy walked through the forest under t he majestic oak trees
39
MDD Unfolding (top-down)
The modeling properties of MDD leads to solve
the problem by representing each rule by an
MDD and by intersecting them.
40
Modeling
From Rules to MDD
9 to 15 words Language Restrictions (3000 lemma) 59 characters corpus
An arc is a word
An arc is a lemma
An arc is the number of characters of a word A state is the sum
An arc is a word A state is a k-gram
MDD Universel MDD Lexique MDD Size MDD Corpus
41
Intersection
From MDD to sentences
9 to 15 words Language Restrictions (3000 lemma) 59 characters corpus
MDD Taille MDD Corpus
MDD Universel MDD Lexique
# Le
Le sac
42
The intersections of MDD gives: Le sac noir
Third idea
Use an LLM to select best sentences
43
• Transformers (very large context window):
• Perplexity is derived from Shannon entropy.
• It quantify the uncertainty of a model with respect to a sample
• Lower the better, range is [1 ; + inf[
LLM sentences scoring : Perplexity
44
• Input : 443 books belonging to the youth category (FR)
• Input : 75 books belonging to the fiction category (EN)
• Evaluation :
• MNREAD candidate sentence set (syntax and meaning correct)
• Ineligible set of sentences (syntax and/or meaning problems)
• Software & Hardware :
• The model is implemented in Java 17 in an MDD solver (MDDLib) @I3S.
• The LLM use to rank sentences is GPT-2
• Machine: Ubuntu 18.04 using an Intel(R) Xeon(R) Gold 5222 @ 3.80GHz CPU and 256 GB RAM.
45
Experimental conditions
• With 1% of the corpus, i.e. 63000 sentences, in 3-grams we obtain 9899 sentences
•J'aimerais bien que le soleil commence à se rendre au salon
Phrases generated in 3-grams
Do we have sentences?
•Ils sont morts et les yeux sur le nom de ce que vous croyez
•Mes yeux se posent sur le nom qui lui a dit que vous croyez
•Ses mains n'étaient pas de sa mère dans ses bras et le même
•Aucun de ses pieds nus sur les yeux de ce qui ne se passera
•L'expression a pris un coup de poing et de leur sort demain
•Y en a pas de nous préparer à tout bout de sa petite bouche
•
•
J'en ai dit que si je vous en emparez et vous ne pouvez pas
Entrez là et tu as de ma part de sa main dans le monde voit
Bien que je ne veux pas que les yeux de ce que ça me plaira
I wish the sun would start coming into the living room
• Those sentences are not admissible.
• In 3-grams we produce a large majority of sentences with problems of meaning and syntax
• None of his bare feet on the eyes of what will happen
46
Are MNREAD sentences generated?
• YES ! In 5-grams , with 443 books , we generate hundreds of sentences (7028).
47
Are MNREAD sentences generated?
• YES ! In 5-grams , with 75 books , we generate hundreds of sentences (204).
48
Performances analysis
FR 443 3Go 72s 7028
EN 75 <<1Go 3s 204
MNREAD sentence generation
49
Scoring takes roughly 1 hours for 7000 sentences. GPT-2 (pylia)
Scoring takes roughly 30 mins for 7000 sentences. GPT-3 (OpenAI cloud)
Recent benchmark :
569.77 ms / 15 tokens ( 37.98 ms per token) ~~ 1 sentence, llama.cpp (comparable to
GPT-3) (pylia)
• Select sentences by using GPT2 or similar generative model.
• Examples (everybody can have a personal opinion about the score!):
• Very good : The two men looked at each other in a state of stupefaction (10)
• Moderatley good: The wolves had for the most part wholly ignorant of warfare (270)
• Bad: The farmer sat down on the Museum steps except the nice one (930)
• Poetic (medium): Il est tombé dans le vide avec une sorte de douceur absente (100)
• Complex:
• The aircraft will be as common as I can to hinder their way (380)
• The depth was very great and it seemed to me to do as I did (97)
Discussion
• Score are related to frequency of occurrence
• The wolves had for the most part wholly ignorant of warfare (272)
• We change words by using more frequent ones
• The wolves had for the most part completely ignored the war (90)
Discussion
English Ranking
52
The two men looked at each other in a state of stupefaction ,10.47
The wolves had for the most part wholly ignorant of warfare ,272
Constrained text generation to measure reading performance: A new approach based on constraint programming
• Generate sentences having only ten words (or 12, 11…): no problem
• Changing the level of vocabulary: no problem
• Modifying the size: no problem
• Other constraints: be careful with the combinatorics. If main constraints are relaxed then number of
solutions explodes!
New Constraints
• Promising method: more suitable than generic methods for handling constraints (e.g., GPT, Bert) and
more flexible than the ad-hoc method of Mansfield et al [3].
• Advantages: modularity (easy to add and/or remove rules), constraints taken into account at generation,
potentially applicable to other languages
• Perspectives: a perplexity constraint
55
Conclusion
Thanks.
56
Ad

Recommended

PDF
HaiqingWang-MasterThesis
Haiqing Wang
 
PDF
Unsupervised program synthesis
Amrith Krishna
 
PPTX
Natural langaugea processing n gram models
tivoy24550
 
PPTX
A Neural Probabilistic Language Model
Rama Irsheidat
 
PDF
A beginner's Guide to Natural language processing
sivasurya santhanam
 
PDF
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
Paris Women in Machine Learning and Data Science
 
PPT
Natural Language Processing: N-Gram Language Models
JCGonzaga1
 
PPT
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
varshakumari296060
 
PDF
2_Corpora_and_Smoothing_2024.pdf
GeraldPenn2
 
PPT
Moore_slides.ppt
butest
 
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
Anuj Gupta
 
PDF
Code Evolution Day 2024 = Opening talk: Demystifying LLMs
riki_kurniawan
 
PDF
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
AbdurrahimDerric
 
PPTX
NLP Bootcamp
Anuj Gupta
 
PDF
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
ijaia
 
PDF
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
gerogepatton
 
PDF
The NLP Muppets revolution!
Fabio Petroni, PhD
 
PDF
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
gerogepatton
 
PDF
12_applications.pdf
KSChidanandKumarJSSS
 
PPTX
Deep Neural Methods for Retrieval
Bhaskar Mitra
 
PPT
haenelt.ppt
ssuser4293bd
 
PDF
genai principles booklet with details of
adityakalra2015
 
PPTX
Natural Language Processing
Gabe Wilberscheid
 
PDF
LLM Reasoning - Key Ideas and Limitations
VincentLui15
 
PPTX
NLP Introduction and basics of natural language processing
mailtoahmedhassan
 
DOCX
Language Modeling.docx
AnuradhaRaheja1
 
PPTX
A Panorama of Natural Language Processing
Ted Xiao
 
PDF
LARGE LANGUAGE MODELS FOR CIPHERS
gerogepatton
 
PDF
„Die Klimakrise ist da! Wo führt sie hin?“
Förderverein Technische Fakultät
 
PPTX
Greening local government units: Current status and required competences
Förderverein Technische Fakultät
 

More Related Content

Similar to Constrained text generation to measure reading performance: A new approach based on constraint programming (20)

PDF
2_Corpora_and_Smoothing_2024.pdf
GeraldPenn2
 
PPT
Moore_slides.ppt
butest
 
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
Anuj Gupta
 
PDF
Code Evolution Day 2024 = Opening talk: Demystifying LLMs
riki_kurniawan
 
PDF
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
AbdurrahimDerric
 
PPTX
NLP Bootcamp
Anuj Gupta
 
PDF
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
ijaia
 
PDF
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
gerogepatton
 
PDF
The NLP Muppets revolution!
Fabio Petroni, PhD
 
PDF
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
gerogepatton
 
PDF
12_applications.pdf
KSChidanandKumarJSSS
 
PPTX
Deep Neural Methods for Retrieval
Bhaskar Mitra
 
PPT
haenelt.ppt
ssuser4293bd
 
PDF
genai principles booklet with details of
adityakalra2015
 
PPTX
Natural Language Processing
Gabe Wilberscheid
 
PDF
LLM Reasoning - Key Ideas and Limitations
VincentLui15
 
PPTX
NLP Introduction and basics of natural language processing
mailtoahmedhassan
 
DOCX
Language Modeling.docx
AnuradhaRaheja1
 
PPTX
A Panorama of Natural Language Processing
Ted Xiao
 
PDF
LARGE LANGUAGE MODELS FOR CIPHERS
gerogepatton
 
2_Corpora_and_Smoothing_2024.pdf
GeraldPenn2
 
Moore_slides.ppt
butest
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
Anuj Gupta
 
Code Evolution Day 2024 = Opening talk: Demystifying LLMs
riki_kurniawan
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
AbdurrahimDerric
 
NLP Bootcamp
Anuj Gupta
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
ijaia
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
gerogepatton
 
The NLP Muppets revolution!
Fabio Petroni, PhD
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
gerogepatton
 
12_applications.pdf
KSChidanandKumarJSSS
 
Deep Neural Methods for Retrieval
Bhaskar Mitra
 
haenelt.ppt
ssuser4293bd
 
genai principles booklet with details of
adityakalra2015
 
Natural Language Processing
Gabe Wilberscheid
 
LLM Reasoning - Key Ideas and Limitations
VincentLui15
 
NLP Introduction and basics of natural language processing
mailtoahmedhassan
 
Language Modeling.docx
AnuradhaRaheja1
 
A Panorama of Natural Language Processing
Ted Xiao
 
LARGE LANGUAGE MODELS FOR CIPHERS
gerogepatton
 

More from Förderverein Technische Fakultät (20)

PDF
„Die Klimakrise ist da! Wo führt sie hin?“
Förderverein Technische Fakultät
 
PPTX
Greening local government units: Current status and required competences
Förderverein Technische Fakultät
 
PDF
Supervisory control of business processes
Förderverein Technische Fakultät
 
PPTX
The Digital Transformation of Education: A Hyper-Disruptive Era through Block...
Förderverein Technische Fakultät
 
PDF
A Game of Chess is Like a Swordfight.pdf
Förderverein Technische Fakultät
 
PDF
From Mind to Meta.pdf
Förderverein Technische Fakultät
 
PDF
Miniatures Design for Tabletop Games.pdf
Förderverein Technische Fakultät
 
PPTX
Distributed Systems in the Post-Moore Era.pptx
Förderverein Technische Fakultät
 
PPTX
Don't Treat the Symptom, Find the Cause!.pptx
Förderverein Technische Fakultät
 
PDF
Engineering Serverless Workflow Applications in Federated FaaS.pdf
Förderverein Technische Fakultät
 
PDF
The Role of Machine Learning in Fluid Network Control and Data Planes.pdf
Förderverein Technische Fakultät
 
PDF
Nonequilibrium Network Dynamics_Inference, Fluctuation-Respones & Tipping Poi...
Förderverein Technische Fakultät
 
PDF
Towards a data driven identification of teaching patterns.pdf
Förderverein Technische Fakultät
 
PPTX
Förderverein Technische Fakultät.pptx
Förderverein Technische Fakultät
 
PDF
The Computing Continuum.pdf
Förderverein Technische Fakultät
 
PPTX
East-west oriented photovoltaic power systems: model, benefits and technical ...
Förderverein Technische Fakultät
 
PDF
Machine Learning in Finance via Randomization
Förderverein Technische Fakultät
 
PPTX
Advances in Visual Quality Restoration with Generative Adversarial Networks
Förderverein Technische Fakultät
 
PDF
Recent Trends in Personalization at Netflix
Förderverein Technische Fakultät
 
„Die Klimakrise ist da! Wo führt sie hin?“
Förderverein Technische Fakultät
 
Greening local government units: Current status and required competences
Förderverein Technische Fakultät
 
Supervisory control of business processes
Förderverein Technische Fakultät
 
The Digital Transformation of Education: A Hyper-Disruptive Era through Block...
Förderverein Technische Fakultät
 
A Game of Chess is Like a Swordfight.pdf
Förderverein Technische Fakultät
 
From Mind to Meta.pdf
Förderverein Technische Fakultät
 
Miniatures Design for Tabletop Games.pdf
Förderverein Technische Fakultät
 
Distributed Systems in the Post-Moore Era.pptx
Förderverein Technische Fakultät
 
Don't Treat the Symptom, Find the Cause!.pptx
Förderverein Technische Fakultät
 
Engineering Serverless Workflow Applications in Federated FaaS.pdf
Förderverein Technische Fakultät
 
The Role of Machine Learning in Fluid Network Control and Data Planes.pdf
Förderverein Technische Fakultät
 
Nonequilibrium Network Dynamics_Inference, Fluctuation-Respones & Tipping Poi...
Förderverein Technische Fakultät
 
Towards a data driven identification of teaching patterns.pdf
Förderverein Technische Fakultät
 
Förderverein Technische Fakultät.pptx
Förderverein Technische Fakultät
 
The Computing Continuum.pdf
Förderverein Technische Fakultät
 
East-west oriented photovoltaic power systems: model, benefits and technical ...
Förderverein Technische Fakultät
 
Machine Learning in Finance via Randomization
Förderverein Technische Fakultät
 
Advances in Visual Quality Restoration with Generative Adversarial Networks
Förderverein Technische Fakultät
 
Recent Trends in Personalization at Netflix
Förderverein Technische Fakultät
 
Ad

Recently uploaded (20)

PPTX
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Josef Weingand
 
PDF
The Growing Value and Application of FME & GenAI
Safe Software
 
PDF
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
PDF
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
PDF
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
PPTX
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
 
PDF
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
PDF
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
PDF
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
PDF
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
PDF
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
PDF
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
PPTX
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
PDF
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
PPTX
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
PDF
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
PPTX
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
PDF
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
 
PDF
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Josef Weingand
 
The Growing Value and Application of FME & GenAI
Safe Software
 
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
 
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Ad

Constrained text generation to measure reading performance: A new approach based on constraint programming

  • 1. Constrained text generation to measure reading performance: A new approach based on constraint programming JOINT WORK WITH ALEXANDRE BONLARRON(1,3), AURÉLIE CALABRÈSE(2), PIERRE KORNPROBST(1). 1 Jean-Charles Régin(3) (1) Université Côte d’Azur, Inria, France (2) Aix Marseille Université, CNRS, LPC, Marseille, France (3) Université Côte d’Azur, I3S, France
  • 2. • Standardized text: sentences read at the same speed • Usability: to assess reading performance Standardized Text (Mansfield et al., 1993) 3
  • 3. Constrained text generation This is a rules (constraints) dominated problem MNREAD Chart = 4
  • 4. MNREAD Rules Display Rules Lexical Rules Grammatical Rules Length Rules E.g. Entering the rectangle E.g. 3000 words from CE2 textbooks E.g. No Punctuation E.g. 60 characters, between 9 and 15 words Example of a MNREAD sentence There are 38 MNREAD sentences in French 5
  • 5. Are there enough sentences? No, a few thousand sentences are needed to detect and monitor visual pathology throughout life. Is it really difficult to have more sentences that respect the rules ? Questions : There are 38 MNREAD phrases in French. 6
  • 6. Naive method Search for MNREAD type sentences in a corpus. Problem : This method does not scale up Solution : We have to generate them, but how ? Display Lexicon Grammar Length 1 000 000 sentences 10 000 sentences 6 sentences 2300 books 10 000 000 sentences 3 sentences 7
  • 7. How to generate standardized sentences? 8 LLM-based approach (GPT, BERT) + SEARCH := good text quality, but unlikely to find in an instance that satisfies the constraints. Generates word-by-word sentences and selects the next word as the most suitable (use token instead of words) prompt (ChatGPT 3.5): give me a sentence of sixty characters with spaces included “Elephants march majestically through the savannah at sunset, their presence captivating” prompt (ChatGPT 3.5): give me a sentence of sixty characters “The cat sat on the mat and purred softly”
  • 8. How to generate standardized sentences? • One "good" sentence out of 8000 (only); remembering bias • A semi-automatic method for the English language • Non-trivial extension to Latin languages! mon ami est beau; mon amie est belle Ad hoc method: a recent method proposed by the creators of MNREAD and based on hand- defined models (Mansfield et al., 2019). 9
  • 9. How to generate standardized sentences? n-gram based methods (Papadopoulos et al., 2015) Corpus n-gram Génération When using with a random walk it produces sentences in the style of an author Problems: How to integrate constraints? How to manage the meaning of the sentences?
  • 10. • A generalization of Binary Decision Diagrams (BDD) • Each layer represents a variable • Each path between root and tt is a valid assignment of the variables • An MDD models all tuples that satisfy a constraint Multi-Valued Decision Diagram (MDD) 11 MDD having 3 solutions : (a,b) (a,a) (b,b) • Data structures for calculating and storing problem solutions in a compressed form using an acyclic directed graph • Advantage: Powerful modeling tools. With one billion of arcs we can represent 10^90 solutions!
  • 11. MDD and compression ● Sum of 3 variables ● Corresponds to an automata ● The last layer is the sum value
  • 12. Reduction ● Operation which merges equivalent nodes ● Two nodes are equivalent if they have ○ the same outgoing edges (same destination + same labels) root a 0 b 1 c 0 1 d 2 e 0 1 0 1 1 0 1 ● Minimization of finite automata tt
  • 13. Reduction ● Operation which merges equivalent nodes ● Two nodes are equivalent if they have ○ the same outgoing edges (same destination + same labels) root a 0 b 1 ce d 2 0 1 1 0 1 1 0 ● Minimization of finite automata tt
  • 14. Reduction ● Reduction may gain an exponential factor ● Consequence: ○ MDD can be exponentially smaller than an equivalent automaton
  • 15. Compression gain ● Compression may gain an exponential factor ● It often does! ● Example ○ MDD requiring 600,000 edges for representing 10^90 solutions, that is a compression factor of 10^86. ● Sometimes it can be subtle
  • 16. Alldiff constraint ● #node = 2^n ● #solutions = n! ● n!/2^n ??? ● is exponential
  • 17. MDD: creation ● MDD can be created without enumerating the solution set ● Can be created from Dynamic Programming ● Kind of Search compression ● So what? ● Operations!
  • 18. MDD: operations ● Intersection, union, difference, negation etc… ● Operations are performed without decompression ● Intersection of 2 MDDs is equivalent to make the conjunction of the 2 constraints represented by the MDDs ● Relation between MDD operations and constraints combination ○ Intersection : conjunction ○ Union : disjunction ○ Negation : negation
  • 19. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root 1.20
  • 20. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 1.21
  • 21. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 1.22
  • 22. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 1.23
  • 23. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 cel' 0 1.24
  • 24. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 cel' 0 1.25
  • 25. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 cel' 0 dl' 2 1.26
  • 26. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 cel' 0 dl' 2 1.27
  • 27. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 cel' 0 dl' 2 1.28
  • 28. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 cel' 0 dl' 2 tt 1 1.29
  • 29. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 cel' 0 dl' 2 tt 1 1.30
  • 30. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 cel' 0 dl' 2 tt 1 1 1.31
  • 31. Intersection root1 a 0 b 1 ce d 2 0 1 1 0 tt1 1 1 0 root2 k 0 2 l 0 2 tt2 1 2 root ak' 0 cel' 0 dl' 2 tt 1 1 1.32
  • 32. Intersection ● Be careful, do not think that the number of nodes/edges of the intersection will be reduced ● The resulting MDD can be exponentially larger! Because it can be locally decompressed
  • 33. Operations ● In-place ● On-the-fly (i.e avoid having to define the MDD ina dvance and proceeds by level)
  • 35. First idea store and retrieve n-grams efficiently 36
  • 36. Successions Constraint • Assuming all n-grams are inserted in the MDD as solutions. • MDD as a TRIE • To store and reTRIEve n-grams. What is the next word of The white cat ? 37
  • 38. - Using the first MDD (successions) - We compile the second one -Constraint are checked on-the-fly The girl and the boy walked through the forest under t he majestic oak trees 39 MDD Unfolding (top-down)
  • 39. The modeling properties of MDD leads to solve the problem by representing each rule by an MDD and by intersecting them. 40
  • 40. Modeling From Rules to MDD 9 to 15 words Language Restrictions (3000 lemma) 59 characters corpus An arc is a word An arc is a lemma An arc is the number of characters of a word A state is the sum An arc is a word A state is a k-gram MDD Universel MDD Lexique MDD Size MDD Corpus 41
  • 41. Intersection From MDD to sentences 9 to 15 words Language Restrictions (3000 lemma) 59 characters corpus MDD Taille MDD Corpus MDD Universel MDD Lexique # Le Le sac 42 The intersections of MDD gives: Le sac noir
  • 42. Third idea Use an LLM to select best sentences 43
  • 43. • Transformers (very large context window): • Perplexity is derived from Shannon entropy. • It quantify the uncertainty of a model with respect to a sample • Lower the better, range is [1 ; + inf[ LLM sentences scoring : Perplexity 44
  • 44. • Input : 443 books belonging to the youth category (FR) • Input : 75 books belonging to the fiction category (EN) • Evaluation : • MNREAD candidate sentence set (syntax and meaning correct) • Ineligible set of sentences (syntax and/or meaning problems) • Software & Hardware : • The model is implemented in Java 17 in an MDD solver (MDDLib) @I3S. • The LLM use to rank sentences is GPT-2 • Machine: Ubuntu 18.04 using an Intel(R) Xeon(R) Gold 5222 @ 3.80GHz CPU and 256 GB RAM. 45 Experimental conditions
  • 45. • With 1% of the corpus, i.e. 63000 sentences, in 3-grams we obtain 9899 sentences •J'aimerais bien que le soleil commence à se rendre au salon Phrases generated in 3-grams Do we have sentences? •Ils sont morts et les yeux sur le nom de ce que vous croyez •Mes yeux se posent sur le nom qui lui a dit que vous croyez •Ses mains n'étaient pas de sa mère dans ses bras et le même •Aucun de ses pieds nus sur les yeux de ce qui ne se passera •L'expression a pris un coup de poing et de leur sort demain •Y en a pas de nous préparer à tout bout de sa petite bouche • • J'en ai dit que si je vous en emparez et vous ne pouvez pas Entrez là et tu as de ma part de sa main dans le monde voit Bien que je ne veux pas que les yeux de ce que ça me plaira I wish the sun would start coming into the living room • Those sentences are not admissible. • In 3-grams we produce a large majority of sentences with problems of meaning and syntax • None of his bare feet on the eyes of what will happen 46
  • 46. Are MNREAD sentences generated? • YES ! In 5-grams , with 443 books , we generate hundreds of sentences (7028). 47
  • 47. Are MNREAD sentences generated? • YES ! In 5-grams , with 75 books , we generate hundreds of sentences (204). 48
  • 48. Performances analysis FR 443 3Go 72s 7028 EN 75 <<1Go 3s 204 MNREAD sentence generation 49 Scoring takes roughly 1 hours for 7000 sentences. GPT-2 (pylia) Scoring takes roughly 30 mins for 7000 sentences. GPT-3 (OpenAI cloud) Recent benchmark : 569.77 ms / 15 tokens ( 37.98 ms per token) ~~ 1 sentence, llama.cpp (comparable to GPT-3) (pylia)
  • 49. • Select sentences by using GPT2 or similar generative model. • Examples (everybody can have a personal opinion about the score!): • Very good : The two men looked at each other in a state of stupefaction (10) • Moderatley good: The wolves had for the most part wholly ignorant of warfare (270) • Bad: The farmer sat down on the Museum steps except the nice one (930) • Poetic (medium): Il est tombé dans le vide avec une sorte de douceur absente (100) • Complex: • The aircraft will be as common as I can to hinder their way (380) • The depth was very great and it seemed to me to do as I did (97) Discussion
  • 50. • Score are related to frequency of occurrence • The wolves had for the most part wholly ignorant of warfare (272) • We change words by using more frequent ones • The wolves had for the most part completely ignored the war (90) Discussion
  • 51. English Ranking 52 The two men looked at each other in a state of stupefaction ,10.47 The wolves had for the most part wholly ignorant of warfare ,272
  • 53. • Generate sentences having only ten words (or 12, 11…): no problem • Changing the level of vocabulary: no problem • Modifying the size: no problem • Other constraints: be careful with the combinatorics. If main constraints are relaxed then number of solutions explodes! New Constraints
  • 54. • Promising method: more suitable than generic methods for handling constraints (e.g., GPT, Bert) and more flexible than the ad-hoc method of Mansfield et al [3]. • Advantages: modularity (easy to add and/or remove rules), constraints taken into account at generation, potentially applicable to other languages • Perspectives: a perplexity constraint 55 Conclusion