SlideShare a Scribd company logo
Practical NLP with Lisp

   Vsevolod Dyomkin
       Grammarly
Topics

*   Overview of NLP practice
*   Getting Data
*   Using Lisp: pros & cons
*   A couple of examples
A bit about Grammarly




        (c) xkcd
An example of what
   we deal with
NLP practice
R - research work:
set a goal →
devise an algorithm →
train the algorithm →
test its accuracy
NLP practice
R - research work:
set a goal →
devise an algorithm →
train the algorithm →
test its accuracy

D - development work:
implement the algorithm as an API with
sufficient performance and scaling
characteristics
Research
1. Set a goal
Business goal:

* Develop best/good enough/better than
Word/etc spellchecker

* Develop a set of grammar rules, that will
catch errors according to MLA Style

* Develop a thesaurus, that will produce
synonyms relevant to context
Translate it to measurable goal
* On a test corpus of 10000 sentences with
common errors achieve smaller number of FNs
(and FPs), that other spellcheckers/Word
spellchecker/etc

* On a corpus of examples of sentences with
each kind of error (and similar sentences
without this kind of error) find all
sentences with errors and do not find
errors in correct sentences

* On a test corpus of 1000 sentences
suggest synonyms for all meaningful words
that will be considered relevant by human
linguists in 90% of the cases
A Note on
       Terminology
FN and FP instead of
precision (P), recall (R)

FN = 1-R
FP = 1-P or ???
f1 = P*R/(P+R) =
(1-FN-FP+FN*FP)/(2-(FN+FP))
Research contd.
2. Devise an algorithm
3. Train & improve the
algorithm
Research contd.
2. Devise an algorithm
3. Train & improve the
algorithm

http://nlp-class.org
4. Test its performance
ML: one corpus, divided into
training,development,test
4. Test its performance
ML: one corpus, divided into
training,development,test

Often — different corpora:
* for training some part (not
whole) of the algorithm
* for testing the whole
system
Theoretical maxima
Theoretical maxima are rarely
achievable. Why?
Theoretical maxima
Theoretical maxima are rarely
achievable. Why?

* Because you need their
data. (And data is key)
Theoretical maxima
Theoretical maxima are rarely
achievable. Why?

* Because you need their
data. (And data is key)

* Domains might differ
Pre/post-processing
What ultimately matters is
not crude performance, but...
Pre/post-processing
What ultimately matters is
not crude performance, but...

Acceptance to users (much
harder to measure & depends
on domain).
Pre/post-processing
What ultimately matters is
not crude performance, but...

Acceptance to users (much
harder to measure & depends
on domain).

Real-world is messier, than
any lab set-up.
Examples of
    pre-processing
For spellcheck:

* some people tend to use
words, separated by slashes,
like: spell/grammar check

* handling of abbreviations
Where to get data?
Well-known sources:
* Penn Tree Bank
* Wordnet
* Web1T Google N-gram Corpus
* Linguistic Data Consortium
  (http://www.ldc.upenn.edu/)
More data
Also well-known sources, but
with a twist:
* Wikipedia & Wiktionary,
DBPedia
* OpenWeb Common Crawl
(updated: 2010)
* Public APIs of some
services: Twitter, Wordnik
Obscure corpora
Academic resources:
* Stanford
* CoNLL
* Oxford (http://www.ota.ox.ac.uk/)
* CMU, MIT,...
* LingPipe, OpenNLP, NLTK,...
Human-powered?


http://goo.gl/hs4qB
Beyond corpora?

* Bootstrapping
* Seeding
And remember...
“Data is ten times more
powerful than algorithms.”
-- Peter Norvig, “The Unreasonable
Effectiveness of Data.”
http://youtu.be/yvDCzhbjYWs
Using Lisp for NLP




      (c) xkcd
Why Lisp?
Lisp is a carefully crafted
tool for:

*   Engineers
*   Practical researchers
*   Artists
*   Entrepreneurs
Some examples
*   Piano.aero
*   ITA Software
*   Secure Outcomes
*   Impromptu

* Land of Lisp
http://youtu.be/HM1Zb3xmvMc
Research
       requirements
*   Interactivity
*   Mathematical basis
*   Expressiveness
*   Agility Malleability
*   Advanced tools
Specific NLP
     requirements
* Good support for statistics
& number-crunching (matrices)
– Statistical AI

* Good support for working
with trees & symbols
– Symbolic AI
Production
       requirements
*   Scalability
*   Maintainability
*   Integrability
*   ...
...eventually

* Speed
...eventually

* Speed
* Speed
...eventually

* Speed
* Speed
* Speed
Heterogeneous
        systems
You have to split the system
and communicate:

“Java” way vs. “Unix” way

* Sockets, Redis, ZeroMQ, etc
for communication
* JSON, SEXPs, etc for data
Lisp drawbacks
There's no OpenNLP or SciPy &
generally there's fewer
libraries.
Lisp drawbacks
There's no OpenNLP or SciPy &
generally there's fewer
libraries.

But...
*   github: eslick/cl-langutils
*   github: mathematical-systems/clml
*   github: tpapp/lla
*   github: blindglobe/common-lisp-stat
*   … and http://quicklisp.org
But #2
Porter stemmer:
http://tartarus.org/~martin/PorterStemmer
& http://www.cliki.net/PorterStemmer

or Soundex:
http://www.cs.cmu.edu/afs/cs/project/ai-
repository/ai/lang/lisp/code/0.html

are irrelevant with good data
More drawbacks

Lisp is a fringe language

   Not special language
  (like R, J or Octave)
Example #1


API interaction
Example #2
Lisp FTW
* truly interactive
environment
* very flexible => DSLs
* native tree support
* fast and solid
Take-aways
* Take nlp-class

* Data is key, collect it, build tools
to work with it easily and efficiently

* A good language for R&D should be
first of all interactive & malleable,
with as few barriers as possible

* ... it also helps if you don't need to
port your code for production

* Lisp is one of the good examples
Thanks!

Vsevolod Dyomkin
    @vseloved
Ad

Recommended

PR Planning
PR Planning
NC Group
 
lisp (vs ruby) metaprogramming
lisp (vs ruby) metaprogramming
Antonio Garrote Hernández
 
NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language Identification
Vsevolod Dyomkin
 
LISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love Parantheses
Dominic Graefen
 
Aspects of NLP Practice
Aspects of NLP Practice
Vsevolod Dyomkin
 
Lisp for Python Programmers
Lisp for Python Programmers
Vsevolod Dyomkin
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st Century
Vsevolod Dyomkin
 
NLP Project Full Cycle
NLP Project Full Cycle
Vsevolod Dyomkin
 
Crash-course in Natural Language Processing
Crash-course in Natural Language Processing
Vsevolod Dyomkin
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common Lisp
Damien Cassou
 
Lisp Machine Prunciples
Lisp Machine Prunciples
Vsevolod Dyomkin
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
Vsevolod Dyomkin
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
Vsevolod Dyomkin
 
Lisp как универсальная обертка
Lisp как универсальная обертка
Vsevolod Dyomkin
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelines
Vsevolod Dyomkin
 
CL-NLP
CL-NLP
Vsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
Vsevolod Dyomkin
 
Экосистема Common Lisp
Экосистема Common Lisp
Vsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
Vsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin
 
Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"
GeeksLab Odessa
 
The State of #NLProc
The State of #NLProc
Vsevolod Dyomkin
 
Enhancing non-Perl bioinformatic applications with Perl
Enhancing non-Perl bioinformatic applications with Perl
Christos Argyropoulos
 
Enhancing non-Perl bioinformatic applications with Perl
Enhancing non-Perl bioinformatic applications with Perl
ChristosArgyropoulos7
 
Web data from R
Web data from R
schamber
 
Survey of Program Transformation Technologies
Survey of Program Transformation Technologies
Chunhua Liao
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
Dhruv Gohil
 
Perl::Lint - Yet Another Perl Source Code Linter
Perl::Lint - Yet Another Perl Source Code Linter
moznion
 
Perl Myths 200909
Perl Myths 200909
Tim Bunce
 
What we can learn from Rebol?
What we can learn from Rebol?
lichtkind
 

More Related Content

Viewers also liked (12)

Crash-course in Natural Language Processing
Crash-course in Natural Language Processing
Vsevolod Dyomkin
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common Lisp
Damien Cassou
 
Lisp Machine Prunciples
Lisp Machine Prunciples
Vsevolod Dyomkin
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
Vsevolod Dyomkin
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
Vsevolod Dyomkin
 
Lisp как универсальная обертка
Lisp как универсальная обертка
Vsevolod Dyomkin
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelines
Vsevolod Dyomkin
 
CL-NLP
CL-NLP
Vsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
Vsevolod Dyomkin
 
Экосистема Common Lisp
Экосистема Common Lisp
Vsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
Vsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin
 
Crash-course in Natural Language Processing
Crash-course in Natural Language Processing
Vsevolod Dyomkin
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common Lisp
Damien Cassou
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
Vsevolod Dyomkin
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
Vsevolod Dyomkin
 
Lisp как универсальная обертка
Lisp как универсальная обертка
Vsevolod Dyomkin
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelines
Vsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
Vsevolod Dyomkin
 
Экосистема Common Lisp
Экосистема Common Lisp
Vsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
Vsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin
 

Similar to Practical NLP with Lisp (20)

Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"
GeeksLab Odessa
 
The State of #NLProc
The State of #NLProc
Vsevolod Dyomkin
 
Enhancing non-Perl bioinformatic applications with Perl
Enhancing non-Perl bioinformatic applications with Perl
Christos Argyropoulos
 
Enhancing non-Perl bioinformatic applications with Perl
Enhancing non-Perl bioinformatic applications with Perl
ChristosArgyropoulos7
 
Web data from R
Web data from R
schamber
 
Survey of Program Transformation Technologies
Survey of Program Transformation Technologies
Chunhua Liao
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
Dhruv Gohil
 
Perl::Lint - Yet Another Perl Source Code Linter
Perl::Lint - Yet Another Perl Source Code Linter
moznion
 
Perl Myths 200909
Perl Myths 200909
Tim Bunce
 
What we can learn from Rebol?
What we can learn from Rebol?
lichtkind
 
Python: The Programmer's Lingua Franca
Python: The Programmer's Lingua Franca
ActiveState
 
Php extensions
Php extensions
Elizabeth Smith
 
An Introduction to NLP4L
An Introduction to NLP4L
Koji Sekiguchi
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Machine Learning Prague
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R, user2014
Edwin de Jonge
 
Devfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-Koans
Tomoki Aburatani
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Apache OpenNLP
 
Mastering Python lesson3b_for_loops
Mastering Python lesson3b_for_loops
Ruth Marvin
 
Natural Language Processing using Java
Natural Language Processing using Java
Sangameswar Venkatraman
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
Dimitris Kontokostas
 
Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"
GeeksLab Odessa
 
Enhancing non-Perl bioinformatic applications with Perl
Enhancing non-Perl bioinformatic applications with Perl
Christos Argyropoulos
 
Enhancing non-Perl bioinformatic applications with Perl
Enhancing non-Perl bioinformatic applications with Perl
ChristosArgyropoulos7
 
Web data from R
Web data from R
schamber
 
Survey of Program Transformation Technologies
Survey of Program Transformation Technologies
Chunhua Liao
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
Dhruv Gohil
 
Perl::Lint - Yet Another Perl Source Code Linter
Perl::Lint - Yet Another Perl Source Code Linter
moznion
 
Perl Myths 200909
Perl Myths 200909
Tim Bunce
 
What we can learn from Rebol?
What we can learn from Rebol?
lichtkind
 
Python: The Programmer's Lingua Franca
Python: The Programmer's Lingua Franca
ActiveState
 
An Introduction to NLP4L
An Introduction to NLP4L
Koji Sekiguchi
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Machine Learning Prague
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R, user2014
Edwin de Jonge
 
Devfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-Koans
Tomoki Aburatani
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Apache OpenNLP
 
Mastering Python lesson3b_for_loops
Mastering Python lesson3b_for_loops
Ruth Marvin
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
Dimitris Kontokostas
 
Ad

Recently uploaded (20)

Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
The Growing Value and Application of FME & GenAI
The Growing Value and Application of FME & GenAI
Safe Software
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
The Growing Value and Application of FME & GenAI
The Growing Value and Application of FME & GenAI
Safe Software
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
Ad

Practical NLP with Lisp

  • 1. Practical NLP with Lisp Vsevolod Dyomkin Grammarly
  • 2. Topics * Overview of NLP practice * Getting Data * Using Lisp: pros & cons * A couple of examples
  • 3. A bit about Grammarly (c) xkcd
  • 4. An example of what we deal with
  • 5. NLP practice R - research work: set a goal → devise an algorithm → train the algorithm → test its accuracy
  • 6. NLP practice R - research work: set a goal → devise an algorithm → train the algorithm → test its accuracy D - development work: implement the algorithm as an API with sufficient performance and scaling characteristics
  • 7. Research 1. Set a goal Business goal: * Develop best/good enough/better than Word/etc spellchecker * Develop a set of grammar rules, that will catch errors according to MLA Style * Develop a thesaurus, that will produce synonyms relevant to context
  • 8. Translate it to measurable goal * On a test corpus of 10000 sentences with common errors achieve smaller number of FNs (and FPs), that other spellcheckers/Word spellchecker/etc * On a corpus of examples of sentences with each kind of error (and similar sentences without this kind of error) find all sentences with errors and do not find errors in correct sentences * On a test corpus of 1000 sentences suggest synonyms for all meaningful words that will be considered relevant by human linguists in 90% of the cases
  • 9. A Note on Terminology FN and FP instead of precision (P), recall (R) FN = 1-R FP = 1-P or ??? f1 = P*R/(P+R) = (1-FN-FP+FN*FP)/(2-(FN+FP))
  • 10. Research contd. 2. Devise an algorithm 3. Train & improve the algorithm
  • 11. Research contd. 2. Devise an algorithm 3. Train & improve the algorithm http://nlp-class.org
  • 12. 4. Test its performance ML: one corpus, divided into training,development,test
  • 13. 4. Test its performance ML: one corpus, divided into training,development,test Often — different corpora: * for training some part (not whole) of the algorithm * for testing the whole system
  • 14. Theoretical maxima Theoretical maxima are rarely achievable. Why?
  • 15. Theoretical maxima Theoretical maxima are rarely achievable. Why? * Because you need their data. (And data is key)
  • 16. Theoretical maxima Theoretical maxima are rarely achievable. Why? * Because you need their data. (And data is key) * Domains might differ
  • 17. Pre/post-processing What ultimately matters is not crude performance, but...
  • 18. Pre/post-processing What ultimately matters is not crude performance, but... Acceptance to users (much harder to measure & depends on domain).
  • 19. Pre/post-processing What ultimately matters is not crude performance, but... Acceptance to users (much harder to measure & depends on domain). Real-world is messier, than any lab set-up.
  • 20. Examples of pre-processing For spellcheck: * some people tend to use words, separated by slashes, like: spell/grammar check * handling of abbreviations
  • 21. Where to get data? Well-known sources: * Penn Tree Bank * Wordnet * Web1T Google N-gram Corpus * Linguistic Data Consortium (http://www.ldc.upenn.edu/)
  • 22. More data Also well-known sources, but with a twist: * Wikipedia & Wiktionary, DBPedia * OpenWeb Common Crawl (updated: 2010) * Public APIs of some services: Twitter, Wordnik
  • 23. Obscure corpora Academic resources: * Stanford * CoNLL * Oxford (http://www.ota.ox.ac.uk/) * CMU, MIT,... * LingPipe, OpenNLP, NLTK,...
  • 26. And remember... “Data is ten times more powerful than algorithms.” -- Peter Norvig, “The Unreasonable Effectiveness of Data.” http://youtu.be/yvDCzhbjYWs
  • 27. Using Lisp for NLP (c) xkcd
  • 28. Why Lisp? Lisp is a carefully crafted tool for: * Engineers * Practical researchers * Artists * Entrepreneurs
  • 29. Some examples * Piano.aero * ITA Software * Secure Outcomes * Impromptu * Land of Lisp http://youtu.be/HM1Zb3xmvMc
  • 30. Research requirements * Interactivity * Mathematical basis * Expressiveness * Agility Malleability * Advanced tools
  • 31. Specific NLP requirements * Good support for statistics & number-crunching (matrices) – Statistical AI * Good support for working with trees & symbols – Symbolic AI
  • 32. Production requirements * Scalability * Maintainability * Integrability * ...
  • 36. Heterogeneous systems You have to split the system and communicate: “Java” way vs. “Unix” way * Sockets, Redis, ZeroMQ, etc for communication * JSON, SEXPs, etc for data
  • 37. Lisp drawbacks There's no OpenNLP or SciPy & generally there's fewer libraries.
  • 38. Lisp drawbacks There's no OpenNLP or SciPy & generally there's fewer libraries. But... * github: eslick/cl-langutils * github: mathematical-systems/clml * github: tpapp/lla * github: blindglobe/common-lisp-stat * … and http://quicklisp.org
  • 39. But #2 Porter stemmer: http://tartarus.org/~martin/PorterStemmer & http://www.cliki.net/PorterStemmer or Soundex: http://www.cs.cmu.edu/afs/cs/project/ai- repository/ai/lang/lisp/code/0.html are irrelevant with good data
  • 40. More drawbacks Lisp is a fringe language Not special language (like R, J or Octave)
  • 43. Lisp FTW * truly interactive environment * very flexible => DSLs * native tree support * fast and solid
  • 44. Take-aways * Take nlp-class * Data is key, collect it, build tools to work with it easily and efficiently * A good language for R&D should be first of all interactive & malleable, with as few barriers as possible * ... it also helps if you don't need to port your code for production * Lisp is one of the good examples