SlideShare a Scribd company logo
Exploring Natural Language
Processing in Ruby
Kevin Dias!
Tokyo Rubyist Meetup - April 9th, 2015
Rubyで自然言語処理の世界を探求してみよう
Developer at
Twitter: @diasks2!
GitHub: diasks2
Pragmatic Segmenter
Chat Correct
Word Count Analyzer
? ? ?
Pragmatic Segmenter
A rule-based sentence boundary
detection gem that works out-of-the-box
across many languages.
What is segmentation?
Segmentation is the process of splitting a text
into segments or sentences. In other words,
deciding where sentences begin and end.
Pragmatic Segmenter
text = ”Hello Tokyo Rubyists. Let’s try segmentation.”
segment #1: Hello Tokyo Rubyists.
segment #2: Let’s try segmentation.
Why care about segmentation?
Pragmatic Segmenter
Sentence segmentation is the foundation of many
common NLP tasks:!
• Translation!
• Machine translation!
• Bitext alignment!
• Summarization!
• Part-of-speech tagging!
• Grammar parsing
Errors in segmentation compound
into errors in these other NLP tasks
Why reinvent the wheel?
Pragmatic Segmenter
• Most segmentation libraries are built to
support only English (or English plus a few
other languages)!
• Current solutions do not handle ill-formatted
content well!
• Some libraries perform really well when
trained with a data in a specific language and
a specific domain, but what happens when
your data could come from any language
and/or domain
Sentence segmentation methods
Pragmatic Segmenter
• Machine learning !
• Rule-based!
• Tokenize-first group-later (e.g. Stanford CoreNLP)
How can we achieve the following
in Ruby1?
string = “Hello world. Let’s try segmentation.”
Desired output: [“Hello world.”, “Let’s try segmentation.”]
Pragmatic Segmenter1 Using the core or standard library (no gems)
Time to check your solutions
Pragmatic Segmenter
Some potential answers
• string.scan(/[^.]+[.]/).map(&:strip)!
• string.scan(/(?<=s|A)[^.]+[.]/)!
• string.split(/(?<=.)s*/)!
• string.split(/(?<=.)/).map(&:strip)!
• string.split('.').map { |segment| segment.strip.insert(-1, '.') }!
• … your answer
Pragmatic Segmenter
Let’s change the original string
string = “Hello from Mt. Fuji. Let’s try segmentation.”
Desired output: [“Hello from Mt. Fuji.”, “Let’s try segmentation.”]
Pragmatic Segmenter
Uh oh…
string = “Hello from Mt. Fuji. Let’s try segmentation.”
=> [“Hello from Mt.”, “Fuji.”, “Let’s try segmentation.”]
string.scan(/[^.]+[.]/).map(&:strip)
Pragmatic Segmenter
Let’s brainstorm other edge cases
that will make our first solution fail
• abbreviations!
• …!
• …!
• …!
• …!
• …
Pragmatic Segmenter
Golden Rules
Pragmatic Segmenter
Currently 52 English Golden Rules covering edge cases such as:!
• abbreviations!
• abbreviations at the end of a sentence!
• numbers!
• parentheticals!
• email addresses!
• web addresses!
• quotations!
• lists!
• geo coordinates!
• ellipses
Rubyists like to keep it DRY
Pragmatic Segmenter
Most researchers either use the WSJ corpus or Brown corpus from the Penn
Treebank to test their segmentation algorithm!
!
There are limits to using these corpora:!
1. The corpora may be too expensive for some people ($1,700)!
2. The majority of the sentences in the corpora are sentences that end
with a regular word followed by a period, thus testing the same thing
over and over again
In the Brown Corpus 92% of potential sentence boundaries come after a regular word.
The WSJ Corpus is richer with abbreviations and only 83% of sentences end with a
regular word followed by a period.!
!
Andrei Mikheev - Periods, Capitalized Words, etc.
A comparison of segmentation libraries
Pragmatic Segmenter
Name Language License
Golden Rule Score !
(English)
Golden Rule Score
(Other Languages)
Speed
Pragmatic Segmenter Ruby MIT 98.08% 100.00% 3.84 s
TactfulTokenizer Ruby GNU GPLv3 65.38% 48.57% 46.32 s
Open NLP Java APLv2 59.62% 45.71% 1.27 s
Stanford CoreNLP Java GNU GPLv3 59.62% 31.43% 0.92 s
Splitta Python APLv2 55.77% 37.14% N/A
Punkt Python APLv2 46.15% 48.57% 1.79 s
SRX English Ruby GNU GPLv3 30.77% 28.57% 6.19 s
Scapel Ruby GNU GPLv3 28.85% 20.00% 0.13 s
† The performance test takes the 50 English Golden Rules combined into one string and runs it 100 times through each library. The number is an average of 10 runs.
The Holy Grail
Pragmatic Segmenter
A.M. / P.M. as non sentence boundary and sentence boundary
At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.
Golden Rule #18
All tested segmentation libraries failed this spec
["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."]
Chat Correct
A Ruby gem that shows the errors
and error types when a correct
English sentence is diffed with an
incorrect English sentence.
The problem
Chat Correct
I was giving a weekly Skype English lesson
and the student was focusing on writing
practice for the TOEFL test
I would correct the student’s sentence, but it
would often seem as if he was missing some
of my corrections - even if I read it with a
LOT OF STRESS!!
The idea
Chat Correct
A color coded way to
a student’s mistake(s)
PoInT OuT
The solution
Chat Correct
Word Count Analyzer
Analyzes a string for potential areas
of the text that might cause word
count discrepancies depending on
the tool used.
The problem
Word Count Analyzer
• Translation is typically billed on a per
word basis!
• Different tools often report different
word counts
I wanted to understand what was
causing these differences in word count
Word count gray areas
Word Count Analyzer
Common word count gray areas include:!
• Ellipses!
• Hyperlinks!
• Contractions!
• Hyphenated Words!
• Dates!
• Numbers!
• Numbered Lists!
• XML and HTML tags!
• Forward slashes and backslashes!
• Punctuation
Visualize the gray areas
Word Count Analyzer
? ? ?
A bitext alignment (aka parallel text
alignment) tool with a focus on high
accuracy
What’s it used for?
• Translation memory!
• Machine translation
? ? ?
Bitext alignment
Current commercial state-of-the-art!
• Gale-Church sentence-length information plus
dictionary if available (e.g. hunalign)!
? ? ?
Areas for improvement
? ? ?
•Early misalignment compounds into
errors throughout!
•Accuracy may suffer for non-Roman
languages unless the algorithm is
properly tuned!
•Does not handle cross alignments
nor uneven alignments
A method for higher accuracy
• Machine translate A - B and B - A!
• Relative sentence length!
• Order or position in the document
? ? ?
0 1 2 3 4 5
0
1 X
2 X
3
4 X
5 X
X
The trade-offs
Pros!
• better accuracy!
• can handle crossing alignments!
• can handle uneven segments matches !
(1 to 2, 2 to 1, 1 to 3, 3 to 1, 2 to 3, and 3 to 2)
? ? ?
Cons!
• slower!
• potential data privacy issues !
(depending on method to obtain machine translation)
Small framework for thinking about new
problems
Step 1!
Use your ignorance as a weapon to think about a problem
from first principles (you aren’t yet weighed down with any
bias).
Step 3!
Diff your conceptual framework and your research. Look
at where it diverges and try to understand why.!
!
Has tech changed/advanced? Were you missing something?
Step 2!
Do your research.
Ruby NLP Resources
https://github.com/diasks2/ruby-nlp

More Related Content

PDF
Natural Language Processing in Ruby
PPT
Class9
PDF
Why Ruby
KEY
Introduction to Ruby
PPTX
Learning at the Speed of JavaScript
PPTX
Computers for kids
PDF
Number of Computer Languages = 3
PDF
JavaScript Speech Recognition
Natural Language Processing in Ruby
Class9
Why Ruby
Introduction to Ruby
Learning at the Speed of JavaScript
Computers for kids
Number of Computer Languages = 3
JavaScript Speech Recognition

What's hot (14)

PPTX
BDD with F# at DDD9
PPTX
Ruby programming
PPT
NLP new words
PPS
Ruby Introduction
KEY
Week2
PDF
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
PPTX
Tools for the Toolmakers
PDF
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
PDF
Ruby monsters
ODP
PDF
Intro to NLP. Lecture 2
PPTX
Semana Interop: Trabalhando com IronPython e com Ironruby
PPT
Programming languages vienna
PDF
BDD with F# at DDD9
Ruby programming
NLP new words
Ruby Introduction
Week2
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Tools for the Toolmakers
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Ruby monsters
Intro to NLP. Lecture 2
Semana Interop: Trabalhando com IronPython e com Ironruby
Programming languages vienna
Ad

Similar to Exploring Natural Language Processing in Ruby (20)

PDF
This talk lasts 三十分钟
PPTX
Lexing and parsing
PDF
Text processing_the_university_of_azad_kashmir
PDF
JRuby: The Hard Parts
PPTX
Ruby, the language of devops
PPT
Tips and tricks for PE
PDF
Build your own ASR engine
ZIP
Meta Programming in Ruby - Code Camp 2010
PPTX
NLP Deep Dive - recurrent neural networks .pptx
KEY
Rails development environment talk
PDF
Programming Languages #devcon2013
PDF
Go language presentation
PPTX
From Programming to Modeling And Back Again
PDF
A Static Type Analyzer of Untyped Ruby Code for Ruby 3
PPTX
Mind your lang (for role=drinks at CSUN 2017)
PPTX
Intro to nlp
PDF
introtonlp-190218095523 (1).pdf
PDF
How to Make Your Strings Translator Friendly
PDF
Metaprogramming Go
PPTX
Apex for humans
This talk lasts 三十分钟
Lexing and parsing
Text processing_the_university_of_azad_kashmir
JRuby: The Hard Parts
Ruby, the language of devops
Tips and tricks for PE
Build your own ASR engine
Meta Programming in Ruby - Code Camp 2010
NLP Deep Dive - recurrent neural networks .pptx
Rails development environment talk
Programming Languages #devcon2013
Go language presentation
From Programming to Modeling And Back Again
A Static Type Analyzer of Untyped Ruby Code for Ruby 3
Mind your lang (for role=drinks at CSUN 2017)
Intro to nlp
introtonlp-190218095523 (1).pdf
How to Make Your Strings Translator Friendly
Metaprogramming Go
Apex for humans
Ad

More from Kevin Dias (20)

PDF
TM-Town - Getting the Most out of Your Translation Memories
PDF
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
PDF
TM-Town TAUS Translation Technology Webinar (April 2015)
PDF
Putter King Education Program - Physics Level 2 (Teacher's Guide English)
PDF
Putter King Education Program - Physics Level 2 (English)
PDF
Putter King Education Program - Physics Level 1 (Teacher's Guide Japanese)
PDF
Putter King Education Program - Physics Level 1 (Teacher's Guide English)
PDF
Putter King Education Program - Physics Level 1 (Japanese)
PDF
Putter King Education Program - Physics Level 1 (English)
PDF
Putter King Education Program - Math Level 3 (Teacher's Guide English)
PDF
Putter King Education Program - Math Level 3 (English)
PDF
Putter King Education Program - Math Level 2 (Teacher's Guide Japanese)
PDF
Putter King Education Program - Math Level 2 (Teacher's Guide English)
PDF
Putter King Education Program - Math Level 2 (Japanese)
PDF
Putter King Education Program - Math Level 2 (English)
PDF
Putter King Education Program - Math Level 1 (Teacher's Guide Japanese)
PDF
Putter King Education Program - Math Level 1 (Japanese)
PDF
Putter King Education Program - Math Level 1 (English)
PDF
Putter King Business Plan
PDF
Student Database Presentation 1.14.10
TM-Town - Getting the Most out of Your Translation Memories
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
TM-Town TAUS Translation Technology Webinar (April 2015)
Putter King Education Program - Physics Level 2 (Teacher's Guide English)
Putter King Education Program - Physics Level 2 (English)
Putter King Education Program - Physics Level 1 (Teacher's Guide Japanese)
Putter King Education Program - Physics Level 1 (Teacher's Guide English)
Putter King Education Program - Physics Level 1 (Japanese)
Putter King Education Program - Physics Level 1 (English)
Putter King Education Program - Math Level 3 (Teacher's Guide English)
Putter King Education Program - Math Level 3 (English)
Putter King Education Program - Math Level 2 (Teacher's Guide Japanese)
Putter King Education Program - Math Level 2 (Teacher's Guide English)
Putter King Education Program - Math Level 2 (Japanese)
Putter King Education Program - Math Level 2 (English)
Putter King Education Program - Math Level 1 (Teacher's Guide Japanese)
Putter King Education Program - Math Level 1 (Japanese)
Putter King Education Program - Math Level 1 (English)
Putter King Business Plan
Student Database Presentation 1.14.10

Recently uploaded (20)

PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
medical staffing services at VALiNTRY
PPTX
assetexplorer- product-overview - presentation
PDF
Nekopoi APK 2025 free lastest update
PDF
System and Network Administraation Chapter 3
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
System and Network Administration Chapter 2
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Digital Strategies for Manufacturing Companies
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
L1 - Introduction to python Backend.pptx
Computer Software and OS of computer science of grade 11.pptx
Digital Systems & Binary Numbers (comprehensive )
medical staffing services at VALiNTRY
assetexplorer- product-overview - presentation
Nekopoi APK 2025 free lastest update
System and Network Administraation Chapter 3
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
System and Network Administration Chapter 2
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Operating system designcfffgfgggggggvggggggggg
Upgrade and Innovation Strategies for SAP ERP Customers
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Designing Intelligence for the Shop Floor.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Digital Strategies for Manufacturing Companies
CHAPTER 2 - PM Management and IT Context
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool

Exploring Natural Language Processing in Ruby

  • 1. Exploring Natural Language Processing in Ruby Kevin Dias! Tokyo Rubyist Meetup - April 9th, 2015 Rubyで自然言語処理の世界を探求してみよう
  • 4. Pragmatic Segmenter A rule-based sentence boundary detection gem that works out-of-the-box across many languages.
  • 5. What is segmentation? Segmentation is the process of splitting a text into segments or sentences. In other words, deciding where sentences begin and end. Pragmatic Segmenter text = ”Hello Tokyo Rubyists. Let’s try segmentation.” segment #1: Hello Tokyo Rubyists. segment #2: Let’s try segmentation.
  • 6. Why care about segmentation? Pragmatic Segmenter Sentence segmentation is the foundation of many common NLP tasks:! • Translation! • Machine translation! • Bitext alignment! • Summarization! • Part-of-speech tagging! • Grammar parsing Errors in segmentation compound into errors in these other NLP tasks
  • 7. Why reinvent the wheel? Pragmatic Segmenter • Most segmentation libraries are built to support only English (or English plus a few other languages)! • Current solutions do not handle ill-formatted content well! • Some libraries perform really well when trained with a data in a specific language and a specific domain, but what happens when your data could come from any language and/or domain
  • 8. Sentence segmentation methods Pragmatic Segmenter • Machine learning ! • Rule-based! • Tokenize-first group-later (e.g. Stanford CoreNLP)
  • 9. How can we achieve the following in Ruby1? string = “Hello world. Let’s try segmentation.” Desired output: [“Hello world.”, “Let’s try segmentation.”] Pragmatic Segmenter1 Using the core or standard library (no gems)
  • 10. Time to check your solutions Pragmatic Segmenter
  • 11. Some potential answers • string.scan(/[^.]+[.]/).map(&:strip)! • string.scan(/(?<=s|A)[^.]+[.]/)! • string.split(/(?<=.)s*/)! • string.split(/(?<=.)/).map(&:strip)! • string.split('.').map { |segment| segment.strip.insert(-1, '.') }! • … your answer Pragmatic Segmenter
  • 12. Let’s change the original string string = “Hello from Mt. Fuji. Let’s try segmentation.” Desired output: [“Hello from Mt. Fuji.”, “Let’s try segmentation.”] Pragmatic Segmenter
  • 13. Uh oh… string = “Hello from Mt. Fuji. Let’s try segmentation.” => [“Hello from Mt.”, “Fuji.”, “Let’s try segmentation.”] string.scan(/[^.]+[.]/).map(&:strip) Pragmatic Segmenter
  • 14. Let’s brainstorm other edge cases that will make our first solution fail • abbreviations! • …! • …! • …! • …! • … Pragmatic Segmenter
  • 15. Golden Rules Pragmatic Segmenter Currently 52 English Golden Rules covering edge cases such as:! • abbreviations! • abbreviations at the end of a sentence! • numbers! • parentheticals! • email addresses! • web addresses! • quotations! • lists! • geo coordinates! • ellipses
  • 16. Rubyists like to keep it DRY Pragmatic Segmenter Most researchers either use the WSJ corpus or Brown corpus from the Penn Treebank to test their segmentation algorithm! ! There are limits to using these corpora:! 1. The corpora may be too expensive for some people ($1,700)! 2. The majority of the sentences in the corpora are sentences that end with a regular word followed by a period, thus testing the same thing over and over again In the Brown Corpus 92% of potential sentence boundaries come after a regular word. The WSJ Corpus is richer with abbreviations and only 83% of sentences end with a regular word followed by a period.! ! Andrei Mikheev - Periods, Capitalized Words, etc.
  • 17. A comparison of segmentation libraries Pragmatic Segmenter Name Language License Golden Rule Score ! (English) Golden Rule Score (Other Languages) Speed Pragmatic Segmenter Ruby MIT 98.08% 100.00% 3.84 s TactfulTokenizer Ruby GNU GPLv3 65.38% 48.57% 46.32 s Open NLP Java APLv2 59.62% 45.71% 1.27 s Stanford CoreNLP Java GNU GPLv3 59.62% 31.43% 0.92 s Splitta Python APLv2 55.77% 37.14% N/A Punkt Python APLv2 46.15% 48.57% 1.79 s SRX English Ruby GNU GPLv3 30.77% 28.57% 6.19 s Scapel Ruby GNU GPLv3 28.85% 20.00% 0.13 s † The performance test takes the 50 English Golden Rules combined into one string and runs it 100 times through each library. The number is an average of 10 runs.
  • 18. The Holy Grail Pragmatic Segmenter A.M. / P.M. as non sentence boundary and sentence boundary At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store. Golden Rule #18 All tested segmentation libraries failed this spec ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."]
  • 19. Chat Correct A Ruby gem that shows the errors and error types when a correct English sentence is diffed with an incorrect English sentence.
  • 20. The problem Chat Correct I was giving a weekly Skype English lesson and the student was focusing on writing practice for the TOEFL test I would correct the student’s sentence, but it would often seem as if he was missing some of my corrections - even if I read it with a LOT OF STRESS!!
  • 21. The idea Chat Correct A color coded way to a student’s mistake(s) PoInT OuT
  • 23. Word Count Analyzer Analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used.
  • 24. The problem Word Count Analyzer • Translation is typically billed on a per word basis! • Different tools often report different word counts I wanted to understand what was causing these differences in word count
  • 25. Word count gray areas Word Count Analyzer Common word count gray areas include:! • Ellipses! • Hyperlinks! • Contractions! • Hyphenated Words! • Dates! • Numbers! • Numbered Lists! • XML and HTML tags! • Forward slashes and backslashes! • Punctuation
  • 26. Visualize the gray areas Word Count Analyzer
  • 27. ? ? ? A bitext alignment (aka parallel text alignment) tool with a focus on high accuracy
  • 28. What’s it used for? • Translation memory! • Machine translation ? ? ?
  • 29. Bitext alignment Current commercial state-of-the-art! • Gale-Church sentence-length information plus dictionary if available (e.g. hunalign)! ? ? ?
  • 30. Areas for improvement ? ? ? •Early misalignment compounds into errors throughout! •Accuracy may suffer for non-Roman languages unless the algorithm is properly tuned! •Does not handle cross alignments nor uneven alignments
  • 31. A method for higher accuracy • Machine translate A - B and B - A! • Relative sentence length! • Order or position in the document ? ? ? 0 1 2 3 4 5 0 1 X 2 X 3 4 X 5 X X
  • 32. The trade-offs Pros! • better accuracy! • can handle crossing alignments! • can handle uneven segments matches ! (1 to 2, 2 to 1, 1 to 3, 3 to 1, 2 to 3, and 3 to 2) ? ? ? Cons! • slower! • potential data privacy issues ! (depending on method to obtain machine translation)
  • 33. Small framework for thinking about new problems Step 1! Use your ignorance as a weapon to think about a problem from first principles (you aren’t yet weighed down with any bias). Step 3! Diff your conceptual framework and your research. Look at where it diverges and try to understand why.! ! Has tech changed/advanced? Were you missing something? Step 2! Do your research.