SlideShare a Scribd company logo
4
Most read
13
Most read
14
Most read
Retrieval Augmented Generation in Practice
Scalable GenAI platforms with k8s, LangChain, HuggingFace and Vector
DBs
Mihai Criveti, Principal Architect, CKA, RHCA III
September 7, 2023
1
Large Language Models and their Limitations
Retrieval Augmented Generation or Conversation with your Documents
2
Introduction
Mihai Criveti, Principal Architect, Platform Engineering
• Responsible for large scale Cloud Native and AI Solutions
• Red Hat Certified Architect III, CKA/CKS/CKAD
• Driving the development of Retrieval Augmentation Generation platforms, and
solutions for Generative AI at IBM that leverage WatsonX, Vector databases,
LangChain, HuggingFace and open source AI models.
Abstract
• Large Language Models: use cases and limitations
• Scaling Large Language Models: Retrieval Augmented Generation
• LLAMA2, HuggingFace TGIS, SentenceTransformers, Python, LangChain, Weaviate,
ChromaDB vector databases, deployment to k8s
3
Large Language Models and
their Limitations
GenAI and Large Language Models Explained
Think of LLMs like mathematical functions, or your phone’s autocomplete
f(x) = x'
• Where the input (x) and the output (x') are strings
A more accurate representation
f(training_data, model_parameters, input_string) = output_string
• training_data represents the data it was trained.
• model_parameters represent things like “temperature”
• input_string is the combination of prompt and context you give to the model. Ex:
“What is Kubernetes” or “Summarize the following document: ${DOCUMENT}”
• the ‘prompt’ is usually an optional instruction like “summarize”, “extract”,
“translate”, “classify” etc. but more complex prompts are usually used. “Be a helpful
assistant that responds to my question.. etc.”
• The function can process a maximum of TOKEN_LIMIT (total input and output),
usually ~4096 tokens (~3000 words in English, fewer in say.. Japanese).
4
What Large Language Models DON’T DO
Learn
A model will not ‘learn’ from interactions (unless specifically trained/fine-tuned).
Remember
A model doesn’t remember previous prompts. In fact, it’s all done with prompt trickery:
previous prompts are injected. The API does a LOT of of filtering and heavy lifting!
Reason
Think of LLMs like your phone’s autocomplete, it doesn’t reason, or do math.
Use your data
LLMs don’t provide responses based on YOUR data (databases or files), unless it’s
include in the training dataset, or the prompt (ex: RAG).
Use the Internet
• A LLM doesn’t have the capacity to ‘search the internet’, or make API calls.
• In fact, a model does not perform any activity other than converting one string of text
into another string of text.
5
GenAI Use Cases
Figure 1: Use Cases
6
Think of adding this to your architecture
7
In fact, even a 9600 baud modem is much faster. Think teletype!
LLMs are really slow
• A WPM = ((BPS / 10) / 5) * 60, a 9600 baud modem will generate 11520 words /
minute.
• At an average 30 tokens / second (20 words) for LLAMA-70B, you’re getting 1200
words / minute!
• This is slower than a punch card reader :-)
LLMs are also expensive to run
• Running your own LLAMA2 70B might cost as much as 20K / month if you’re using a
dedicated GPU instance!
8
Model Limitations in Practice
Latency and Bandwidth: Tokens per second
• Large models (70B) such as LLAMA2 can be painfully slow
• Smaller models (20B, 13B, 7B) are faster, and can perform inference on a cheaper
GPU (less VRAM)
• Even so, models will perform at anywhere between 10 - 100 tokens / second.
Token Limits
• Models have a token limit as well
• Usually 4096 tokens (roughly ~3000 words) of total input and output
9
Getting LLMs to work with our data
Training
• Very Expensive, takes a long time
Fine Tuning
• Expensive, takes considerable time as well, but achievable
Retrieval Augmented Generation
• Insert your data into prompts every time
• Cheap, and can work with vast amounts of data
• While LLMs are SLOW, Vector Databases are FAST!
• Can help overcome model limitations (such as token limits) - as you’re only feeding
‘top search results’ to the LLM, instead of whole documents.
10
Retrieval Augmented
Generation or Conversation with
your Documents
RAG Explained
Figure 3: RAG Explained 11
Loading Documents
12
Scaling factor for RAG
• Vector Database: consider sharding and High Availability
• Fine Tuning: collecting data to be used for fine tuning
• Governance and Model Benchmarking: how are you testing your model performance
over time, with different prompts, one-shot, and various parameters
• Chain of Reasoning and Agents
• Caching embeddings and responses
• Personalization and Conversational Memory Database
• Streaming Responses and optimizing performance. A fine tuned 13B model may
perform better than a poor 70B one!
• Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are
terrible at reasoning and prediction, consider calling other models)
• Fallback techniques: fallback to a different model, or default answers
• API scaling techniques, rate limiting, etc.
• Async, streaming and parallelization, multiprocessing, GPU acceleration (including
embeddings), generating your API using OpenAPI, etc.
13
Contact
This talk can be found on GitHub
• https://github.com/crivetimihai/shipitcon-scaling-retrieval-augmented-generation
Social media
• https://twitter.com/CrivetiMihai - follow for more LLM content
• https://youtube.com/CrivetiMihai - more LLM videos to follow
• https://www.linkedin.com/in/crivetimihai/
14

More Related Content

PPTX
Introduction to RAG (Retrieval Augmented Generation) and its application
PDF
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
PDF
generative-ai-fundamentals and Large language models
PDF
Intro to LLMs
PDF
Unlocking the Power of Generative AI An Executive's Guide.pdf
PDF
AI presentation and introduction - Retrieval Augmented Generation RAG 101
PDF
Leveraging Generative AI & Best practices
PDF
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
Introduction to RAG (Retrieval Augmented Generation) and its application
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
generative-ai-fundamentals and Large language models
Intro to LLMs
Unlocking the Power of Generative AI An Executive's Guide.pdf
AI presentation and introduction - Retrieval Augmented Generation RAG 101
Leveraging Generative AI & Best practices
AI and ML Series - Introduction to Generative AI and LLMs - Session 1

What's hot (20)

PDF
Customizing LLMs
PDF
Large Language Models Bootcamp
PDF
Large Language Models - Chat AI.pdf
PDF
LLMs Bootcamp
PDF
Use Case Patterns for LLM Applications (1).pdf
PDF
Introduction to LLMs
PPTX
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
PDF
Microsoft + OpenAI: Recent Updates (Machine Learning 15minutes! Broadcast #74)
PPTX
How to fine-tune and develop your own large language model.pptx
PDF
LLMs in Production: Tooling, Process, and Team Structure
PPTX
Generative AI Use-cases for Enterprise - First Session
PDF
Landscape of AI/ML in 2023
PDF
Introduction to Open Source RAG and RAG Evaluation
PDF
The Art of the Possible with Graphs
PDF
Generative-AI-in-enterprise-20230615.pdf
PDF
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
PPTX
Generative AI Use cases for Enterprise - Second Session
PDF
Understanding GenAI/LLM and What is Google Offering - Felix Goh
PDF
LanGCHAIN Framework
PDF
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
Customizing LLMs
Large Language Models Bootcamp
Large Language Models - Chat AI.pdf
LLMs Bootcamp
Use Case Patterns for LLM Applications (1).pdf
Introduction to LLMs
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Microsoft + OpenAI: Recent Updates (Machine Learning 15minutes! Broadcast #74)
How to fine-tune and develop your own large language model.pptx
LLMs in Production: Tooling, Process, and Team Structure
Generative AI Use-cases for Enterprise - First Session
Landscape of AI/ML in 2023
Introduction to Open Source RAG and RAG Evaluation
The Art of the Possible with Graphs
Generative-AI-in-enterprise-20230615.pdf
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Generative AI Use cases for Enterprise - Second Session
Understanding GenAI/LLM and What is Google Offering - Felix Goh
LanGCHAIN Framework
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
Ad

Similar to Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s, LangChain, HuggingFace and Vector (20)

PDF
10 Limitations of Large Language Models and Mitigation Options
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PPTX
AI presentation for dummies LLM Generative AI.pptx
PDF
Nisha talagala keynote_inflow_2016
PPTX
AI presentation Genrative LLM for users.pptx
PDF
Infrastructure Challenges in Scaling RAG with Custom AI models
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PPT
UnConference for Georgia Southern Computer Science March 31, 2015
PPTX
RAG Techniques – for engineering student
PDF
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
PPTX
LangChain + Docugami Webinar
PPTX
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
PDF
Big Data Analytics (ML, DL, AI) hands-on
PPT
Apache Con 2008 Top 10 Mistakes
PPTX
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
PDF
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
PPTX
The Challenges of Bringing Machine Learning to the Masses
ODP
Scaling Streaming - Concepts, Research, Goals
PDF
Building a modern data platform in the cloud. AWS DevDay Nordics
PDF
moveMountainIEEE
10 Limitations of Large Language Models and Mitigation Options
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
AI presentation for dummies LLM Generative AI.pptx
Nisha talagala keynote_inflow_2016
AI presentation Genrative LLM for users.pptx
Infrastructure Challenges in Scaling RAG with Custom AI models
Big Data, Ingeniería de datos, y Data Lakes en AWS
UnConference for Georgia Southern Computer Science March 31, 2015
RAG Techniques – for engineering student
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
LangChain + Docugami Webinar
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Big Data Analytics (ML, DL, AI) hands-on
Apache Con 2008 Top 10 Mistakes
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
The Challenges of Bringing Machine Learning to the Masses
Scaling Streaming - Concepts, Research, Goals
Building a modern data platform in the cloud. AWS DevDay Nordics
moveMountainIEEE
Ad

More from Mihai Criveti (13)

PDF
ContextForge MCP Gateway - the missing proxy for AI Agents and Tools
PDF
Ansible Workshop for Pythonistas
PDF
Mihai Criveti - PyCon Ireland - Automate Everything
PDF
Data Science at Scale - The DevOps Approach
PDF
ShipItCon - Continuous Deployment and Multicloud with Ansible and Kubernetes
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
OpenShift Virtualization - VM and OS Image Lifecycle
PDF
Kubernetes Story - Day 3: Deploying and Scaling Applications on OpenShift
PDF
Kubernetes Story - Day 2: Quay.io Container Registry for Publishing, Building...
PDF
Kubernetes Story - Day 1: Build and Manage Containers with Podman
PDF
Container Technologies and Transformational value
PDF
OpenShift Commons - Adopting Podman, Skopeo and Buildah for Building and Mana...
PDF
AnsibleFest 2021 - DevSecOps with Ansible, OpenShift Virtualization, Packer a...
ContextForge MCP Gateway - the missing proxy for AI Agents and Tools
Ansible Workshop for Pythonistas
Mihai Criveti - PyCon Ireland - Automate Everything
Data Science at Scale - The DevOps Approach
ShipItCon - Continuous Deployment and Multicloud with Ansible and Kubernetes
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
OpenShift Virtualization - VM and OS Image Lifecycle
Kubernetes Story - Day 3: Deploying and Scaling Applications on OpenShift
Kubernetes Story - Day 2: Quay.io Container Registry for Publishing, Building...
Kubernetes Story - Day 1: Build and Manage Containers with Podman
Container Technologies and Transformational value
OpenShift Commons - Adopting Podman, Skopeo and Buildah for Building and Mana...
AnsibleFest 2021 - DevSecOps with Ansible, OpenShift Virtualization, Packer a...

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Business Analytics and business intelligence.pdf
PPTX
modul_python (1).pptx for professional and student
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Knowledge Engineering Part 1
PDF
annual-report-2024-2025 original latest.
PPT
Predictive modeling basics in data cleaning process
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
[EN] Industrial Machine Downtime Prediction
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Business Analytics and business intelligence.pdf
modul_python (1).pptx for professional and student
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Miokarditis (Inflamasi pada Otot Jantung)
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
Galatica Smart Energy Infrastructure Startup Pitch Deck
.pdf is not working space design for the following data for the following dat...
SAP 2 completion done . PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Quality review (1)_presentation of this 21
Introduction to Knowledge Engineering Part 1
annual-report-2024-2025 original latest.
Predictive modeling basics in data cleaning process

Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s, LangChain, HuggingFace and Vector

  • 1. Retrieval Augmented Generation in Practice Scalable GenAI platforms with k8s, LangChain, HuggingFace and Vector DBs Mihai Criveti, Principal Architect, CKA, RHCA III September 7, 2023 1
  • 2. Large Language Models and their Limitations Retrieval Augmented Generation or Conversation with your Documents 2
  • 3. Introduction Mihai Criveti, Principal Architect, Platform Engineering • Responsible for large scale Cloud Native and AI Solutions • Red Hat Certified Architect III, CKA/CKS/CKAD • Driving the development of Retrieval Augmentation Generation platforms, and solutions for Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI models. Abstract • Large Language Models: use cases and limitations • Scaling Large Language Models: Retrieval Augmented Generation • LLAMA2, HuggingFace TGIS, SentenceTransformers, Python, LangChain, Weaviate, ChromaDB vector databases, deployment to k8s 3
  • 4. Large Language Models and their Limitations
  • 5. GenAI and Large Language Models Explained Think of LLMs like mathematical functions, or your phone’s autocomplete f(x) = x' • Where the input (x) and the output (x') are strings A more accurate representation f(training_data, model_parameters, input_string) = output_string • training_data represents the data it was trained. • model_parameters represent things like “temperature” • input_string is the combination of prompt and context you give to the model. Ex: “What is Kubernetes” or “Summarize the following document: ${DOCUMENT}” • the ‘prompt’ is usually an optional instruction like “summarize”, “extract”, “translate”, “classify” etc. but more complex prompts are usually used. “Be a helpful assistant that responds to my question.. etc.” • The function can process a maximum of TOKEN_LIMIT (total input and output), usually ~4096 tokens (~3000 words in English, fewer in say.. Japanese). 4
  • 6. What Large Language Models DON’T DO Learn A model will not ‘learn’ from interactions (unless specifically trained/fine-tuned). Remember A model doesn’t remember previous prompts. In fact, it’s all done with prompt trickery: previous prompts are injected. The API does a LOT of of filtering and heavy lifting! Reason Think of LLMs like your phone’s autocomplete, it doesn’t reason, or do math. Use your data LLMs don’t provide responses based on YOUR data (databases or files), unless it’s include in the training dataset, or the prompt (ex: RAG). Use the Internet • A LLM doesn’t have the capacity to ‘search the internet’, or make API calls. • In fact, a model does not perform any activity other than converting one string of text into another string of text. 5
  • 7. GenAI Use Cases Figure 1: Use Cases 6
  • 8. Think of adding this to your architecture 7
  • 9. In fact, even a 9600 baud modem is much faster. Think teletype! LLMs are really slow • A WPM = ((BPS / 10) / 5) * 60, a 9600 baud modem will generate 11520 words / minute. • At an average 30 tokens / second (20 words) for LLAMA-70B, you’re getting 1200 words / minute! • This is slower than a punch card reader :-) LLMs are also expensive to run • Running your own LLAMA2 70B might cost as much as 20K / month if you’re using a dedicated GPU instance! 8
  • 10. Model Limitations in Practice Latency and Bandwidth: Tokens per second • Large models (70B) such as LLAMA2 can be painfully slow • Smaller models (20B, 13B, 7B) are faster, and can perform inference on a cheaper GPU (less VRAM) • Even so, models will perform at anywhere between 10 - 100 tokens / second. Token Limits • Models have a token limit as well • Usually 4096 tokens (roughly ~3000 words) of total input and output 9
  • 11. Getting LLMs to work with our data Training • Very Expensive, takes a long time Fine Tuning • Expensive, takes considerable time as well, but achievable Retrieval Augmented Generation • Insert your data into prompts every time • Cheap, and can work with vast amounts of data • While LLMs are SLOW, Vector Databases are FAST! • Can help overcome model limitations (such as token limits) - as you’re only feeding ‘top search results’ to the LLM, instead of whole documents. 10
  • 12. Retrieval Augmented Generation or Conversation with your Documents
  • 13. RAG Explained Figure 3: RAG Explained 11
  • 15. Scaling factor for RAG • Vector Database: consider sharding and High Availability • Fine Tuning: collecting data to be used for fine tuning • Governance and Model Benchmarking: how are you testing your model performance over time, with different prompts, one-shot, and various parameters • Chain of Reasoning and Agents • Caching embeddings and responses • Personalization and Conversational Memory Database • Streaming Responses and optimizing performance. A fine tuned 13B model may perform better than a poor 70B one! • Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are terrible at reasoning and prediction, consider calling other models) • Fallback techniques: fallback to a different model, or default answers • API scaling techniques, rate limiting, etc. • Async, streaming and parallelization, multiprocessing, GPU acceleration (including embeddings), generating your API using OpenAPI, etc. 13
  • 16. Contact This talk can be found on GitHub • https://github.com/crivetimihai/shipitcon-scaling-retrieval-augmented-generation Social media • https://twitter.com/CrivetiMihai - follow for more LLM content • https://youtube.com/CrivetiMihai - more LLM videos to follow • https://www.linkedin.com/in/crivetimihai/ 14