SlideShare a Scribd company logo
Jukka Zitting  |  Senior DeveloperRepository performance tuning
AgendaPerformance tuning stepsRepository internalsBasic content accessBatch processingClusteringQuery performanceFull text indexingQuestions and answers2
Performance tuning stepsStep 1: Identify the symptomCreate a test case that consistently measures current performanceDefine the performance target if current level unacceptableMake sure that the test case and the target performance are really relevantStep 2: Identify the causeMain suspects: Hardware, Repository, Application, ClientRevise the test case until the problem no longer occurs;for example: Selenium, JMeter, JUnit, IometerStep 3: Identify/implement possible solutionsChange content, configuration, code or upgrade hardwareStep 4: Verify resultsIf target not reached, iterate the process or revise the goal3
Repository internals4DataStorePersistenceManagerQueryIndexClusterJournal
Data StoreContent-addressed storage for large binary propertiesArbitrarily sized binary streamsAddressed by MD5 hashString properties not included, use UTF-8 to map to binaryFast delivery of binary contentRead directly from diskCan also be read in rangesImproved write throughputMultiple uploads can proceed concurrently (within hardware limits)Cheap copiesGarbage collection used to reclaim disk spaceLogically shared by the entire cluster5DataStore
Cluster JournalJournal of all persisted changes in the repositoryContent changesNamespace, nodetype registrations, etc.Used to keep all cluster nodes in syncObservation events to all cluster nodes (see JackrabbitEvent.isExternal)Search index updatesInternal cache invalidationOld events need to be discarded eventuallyNo notable performance impact, just extra disk spaceKeep events for the longest possible time a node can be offline without getting completely recreatedLogically shared by the entire clusterWrites synchronized over the entire cluster6ClusterJournal
Persistence ManagerIdentifier-addressed storage for nodes and propertiesEach node has a UUID, even if not mix:referenceableEssentially a key-value store, even when backed by a RDBMSAlso keeps track of node referencesBundles as units of contentBundle = UUID, type, properties, child node references, etc.Only large binaries stored elsewhere in the data storeDesigned for balanced content hierarchies, avoid too many child nodesAtomic updatesA save() call persists the entire transient space as a single atomic operationOne PM per workspace (and one for the shared version store)Logically (often also physically) shared across a cluster7PersistenceManager
Query IndexInverse index based on Apache LuceneFlexible mapping from terms to node identifiersSpecial handling for the path structureMostly synchronous index updatesLong full text extraction tasks handled in backgroundOther cluster nodes will update their indexes at next cluster sync Everything indexed by defaultIndexing configuration for tweaking functionality, performance and disk usageOne index per workspace (and one for the shared version store)Not shared across a cluster, indexes are local to each cluster nodeSee http://wiki.apache.org/jackrabbit/Search#Search_Configuration8QueryIndex
AgendaPerformance tuning stepsRepository internalsBasic content accessBatch processingClusteringQuery performanceIndexing configurationQuestions and answers9
Basic content accessVery fast access by path and IDUnderlying storage addressed by ID, but path traversal is in any case needed for ACL checksRelevant caches:Path to ID map (internal structure, not configurable)Item state caches (automatically balanced, configurable for special cases)Bundle cache (default fairly low, increase for large deployments)Also some PM-specific options (TarPM index, etc.)Caches optimized for a reasonably sized active working settypical web access pattern: handful of key resources and a long tail of less frequently accessed content, few writesPerformance hit especially when updating nodes with lots of child nodesFineGrainedISMLocking for concurrent, non-overlapping writes10
Example: Bundle cache configuration11<!-- In …/repository/worspaces/${wsp.name}/workspace.xml --><Workspace …>  <PersistenceManager class=“…">  <paramname="bundleCacheSize" value="8"/>  </PersistenceManager></Workspace>
Batch processingTwo issues: read and writeReading lots of contentTree traversal the best approach, but will flood cachesSchedule for off-peak timesAdd explicit delay (used by the garbage collectors)Use a dedicated cluster node for batch processingWriting lots of content (including deleting large subtrees)The entire transient space is kept in memory and committed atomicallySplit the operation to smaller piecesSave after every ~1k nodesLeverage the data store if possible12
ClusteringGood for horizontally scaling readsPractically zero overhead on read accessNot so good for heavy concurrent writesExclusive lock over the whole clusterDirect all writes to a single master nodeLeverage the data storeNote the cluster sync interval for query consistency, etc.Session.refresh() can be used to force a cluster sync13
Query performanceWhat’s really fast?Constraints on properties, node types, full textTypically O(n) where n is the number of results, vs. the total number of nodes What’s pretty fast?Path constraintsWhat needs some planning?Constraints on the child axisSorting, limit/offset JoinsWhat’s not yet available?Aggregate queries (COUNT, SUM, DISTINCT, etc.)Faceting14
Join engine15SELECT a.* FROM [nt:unstructured] AS a JOIN [nt:unstructured] AS b  <PersistenceManager class=“…">  <paramname="bundleCacheSize" value="8"/>  </PersistenceManager></Workspace>
Indexing configurationDefault configurationIndex all non-binary propertiesIndex binary jcr:data properties (think nt:file/nt:resource)Full text extraction support for all major document formatsFull text extraction from images, packages, etc. is explicitly disabledCQ5 / WEM comes with default aggregate indexing rules for cq:Pages, etc.Why change the configuration?Reduce the index size (by default almost as large as the PM)Enable features like aggregate indexesAssign boost values for selected properties to improve search result relevance16
Indexing configurationHow to change the configuration?indexing_configuration.xml file in the workspace directoryReferenced by the indexingConfiguration option in the workspace.xml fileSee http://wiki.apache.org/jackrabbit/IndexingConfigurationExample:17<?xml version="1.0"?><!DOCTYPE configuration SYSTEM"http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd"><configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"xmlns:nt="http://www.jcp.org/jcr/nt/1.0">  <aggregateprimaryType="nt:file">   <include>jcr:content</include> </aggregate></configuration>
Question and Answers18
Repository performance tuning

More Related Content

PDF
13 mongoose
PDF
[2D4]Python에서의 동시성_병렬성
PPTX
Mastering the Sling Rewriter
PDF
Advanced task management with Celery
PDF
Celery: The Distributed Task Queue
PDF
OOW15 - Getting Optimal Performance from Oracle E-Business Suite
PPTX
Rosa Parks
PPTX
Inside the jvm
13 mongoose
[2D4]Python에서의 동시성_병렬성
Mastering the Sling Rewriter
Advanced task management with Celery
Celery: The Distributed Task Queue
OOW15 - Getting Optimal Performance from Oracle E-Business Suite
Rosa Parks
Inside the jvm

What's hot (20)

PPTX
Cloning Oracle EBS R12: A Step by Step Procedure
ODP
Europython 2011 - Playing tasks with Django & Celery
PPTX
React js programming concept
PDF
Java Concurrency by Example
PPTX
Performance Testing using Loadrunner
PDF
Celery with python
PPTX
SessionTrackServlets.pptx
PDF
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
PDF
Data processing with celery and rabbit mq
PDF
Curso mongo db com php
PDF
스프링 부트와 로깅
PDF
OOP in PHP
PDF
Rest web services
PDF
ReactorKit으로 단방향 반응형 앱 만들기
PDF
Apache Sling : JCR, OSGi, Scripting and REST
PDF
RESTful Web Services
PPTX
Spring Boot and REST API
PPTX
Java Spring
PPT
Oracle Forms-Canvas types
PPTX
AWS Amplify, AppSync를 이용한 모던 어플리케이션 개발
Cloning Oracle EBS R12: A Step by Step Procedure
Europython 2011 - Playing tasks with Django & Celery
React js programming concept
Java Concurrency by Example
Performance Testing using Loadrunner
Celery with python
SessionTrackServlets.pptx
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
Data processing with celery and rabbit mq
Curso mongo db com php
스프링 부트와 로깅
OOP in PHP
Rest web services
ReactorKit으로 단방향 반응형 앱 만들기
Apache Sling : JCR, OSGi, Scripting and REST
RESTful Web Services
Spring Boot and REST API
Java Spring
Oracle Forms-Canvas types
AWS Amplify, AppSync를 이용한 모던 어플리케이션 개발
Ad

Viewers also liked (20)

PPTX
Apache Jackrabbit @ Swiss Open Source Awards 2011
PPTX
OSGifying the repository
PPTX
Oak, the architecture of Apache Jackrabbit 3
PPTX
MicroKernel & NodeStore
PPT
The return of the hierarchical model
KEY
Open source masterclass - Life in the Apache Incubator
PPTX
/path/to/content - the Apache Jackrabbit content repository
PPTX
Apache development with GitHub and Travis CI
KEY
Content extraction with apache tika
PPT
Content Management With Apache Jackrabbit
PPTX
The new repository in AEM 6
PDF
Enterprise Manager: Write powerful scripts with EMCLI
PDF
JCR, Sling or AEM? Which API should I use and when?
PDF
Oracle Enterprise Manager Cloud Control 13c for DBAs
PDF
新浪云平台的经验和教训
PPS
Good Luck
PPTX
Shakespeare revealed 02.ppt
PDF
Digital thinking
PDF
Open Cultuur Data Masterclass #3 - Open State - Lex Slaghuis
Apache Jackrabbit @ Swiss Open Source Awards 2011
OSGifying the repository
Oak, the architecture of Apache Jackrabbit 3
MicroKernel & NodeStore
The return of the hierarchical model
Open source masterclass - Life in the Apache Incubator
/path/to/content - the Apache Jackrabbit content repository
Apache development with GitHub and Travis CI
Content extraction with apache tika
Content Management With Apache Jackrabbit
The new repository in AEM 6
Enterprise Manager: Write powerful scripts with EMCLI
JCR, Sling or AEM? Which API should I use and when?
Oracle Enterprise Manager Cloud Control 13c for DBAs
新浪云平台的经验和教训
Good Luck
Shakespeare revealed 02.ppt
Digital thinking
Open Cultuur Data Masterclass #3 - Open State - Lex Slaghuis
Ad

Similar to Repository performance tuning (20)

PPTX
Overview of MongoDB and Other Non-Relational Databases
PDF
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
PDF
Performance and predictability
PPT
PPTX
NoSQL Introduction, Theory, Implementations
PPT
IntelliJ IDEA Architecture and Performance
PPT
Optimizing your java applications for multi core hardware
PPT
Planning for-high-performance-web-application
PPTX
Apache ignite as in-memory computing platform
PPTX
Unit-4 swapping.pptx
PDF
Performance and predictability
PPT
Climbing the beanstalk
PPTX
Drupal Backend Performance and Scalability
PPT
tittle
PPT
Ch9 OS
 
PPT
PPT
Chapter 8 - Main Memory
PPT
FOWA Scaling The Lamp Stack Workshop
PPT
Main memory os - prashant odhavani- 160920107003
Overview of MongoDB and Other Non-Relational Databases
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Performance and predictability
NoSQL Introduction, Theory, Implementations
IntelliJ IDEA Architecture and Performance
Optimizing your java applications for multi core hardware
Planning for-high-performance-web-application
Apache ignite as in-memory computing platform
Unit-4 swapping.pptx
Performance and predictability
Climbing the beanstalk
Drupal Backend Performance and Scalability
tittle
Ch9 OS
 
Chapter 8 - Main Memory
FOWA Scaling The Lamp Stack Workshop
Main memory os - prashant odhavani- 160920107003

More from Jukka Zitting (9)

PPT
Text and metadata extraction with Apache Tika
PPT
Mime Magic With Apache Tika
PPT
NoSQL Oakland
PPT
Content Storage With Apache Jackrabbit
ODP
Introduction to JCR and Apache Jackrabbi
PPT
File System On Steroids
PPT
Mime Magic With Apache Tika
PPT
Design and architecture of Jackrabbit
PPT
Apache Tika
Text and metadata extraction with Apache Tika
Mime Magic With Apache Tika
NoSQL Oakland
Content Storage With Apache Jackrabbit
Introduction to JCR and Apache Jackrabbi
File System On Steroids
Mime Magic With Apache Tika
Design and architecture of Jackrabbit
Apache Tika

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
August Patch Tuesday
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mushroom cultivation and it's methods.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Assigned Numbers - 2025 - Bluetooth® Document
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
SOPHOS-XG Firewall Administrator PPT.pptx
cloud_computing_Infrastucture_as_cloud_p
gpt5_lecture_notes_comprehensive_20250812015547.pdf
1. Introduction to Computer Programming.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...
August Patch Tuesday
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mushroom cultivation and it's methods.pdf
Spectral efficient network and resource selection model in 5G networks
TLE Review Electricity (Electricity).pptx
Empathic Computing: Creating Shared Understanding
Mobile App Security Testing_ A Comprehensive Guide.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release

Repository performance tuning

  • 1. Jukka Zitting | Senior DeveloperRepository performance tuning
  • 2. AgendaPerformance tuning stepsRepository internalsBasic content accessBatch processingClusteringQuery performanceFull text indexingQuestions and answers2
  • 3. Performance tuning stepsStep 1: Identify the symptomCreate a test case that consistently measures current performanceDefine the performance target if current level unacceptableMake sure that the test case and the target performance are really relevantStep 2: Identify the causeMain suspects: Hardware, Repository, Application, ClientRevise the test case until the problem no longer occurs;for example: Selenium, JMeter, JUnit, IometerStep 3: Identify/implement possible solutionsChange content, configuration, code or upgrade hardwareStep 4: Verify resultsIf target not reached, iterate the process or revise the goal3
  • 5. Data StoreContent-addressed storage for large binary propertiesArbitrarily sized binary streamsAddressed by MD5 hashString properties not included, use UTF-8 to map to binaryFast delivery of binary contentRead directly from diskCan also be read in rangesImproved write throughputMultiple uploads can proceed concurrently (within hardware limits)Cheap copiesGarbage collection used to reclaim disk spaceLogically shared by the entire cluster5DataStore
  • 6. Cluster JournalJournal of all persisted changes in the repositoryContent changesNamespace, nodetype registrations, etc.Used to keep all cluster nodes in syncObservation events to all cluster nodes (see JackrabbitEvent.isExternal)Search index updatesInternal cache invalidationOld events need to be discarded eventuallyNo notable performance impact, just extra disk spaceKeep events for the longest possible time a node can be offline without getting completely recreatedLogically shared by the entire clusterWrites synchronized over the entire cluster6ClusterJournal
  • 7. Persistence ManagerIdentifier-addressed storage for nodes and propertiesEach node has a UUID, even if not mix:referenceableEssentially a key-value store, even when backed by a RDBMSAlso keeps track of node referencesBundles as units of contentBundle = UUID, type, properties, child node references, etc.Only large binaries stored elsewhere in the data storeDesigned for balanced content hierarchies, avoid too many child nodesAtomic updatesA save() call persists the entire transient space as a single atomic operationOne PM per workspace (and one for the shared version store)Logically (often also physically) shared across a cluster7PersistenceManager
  • 8. Query IndexInverse index based on Apache LuceneFlexible mapping from terms to node identifiersSpecial handling for the path structureMostly synchronous index updatesLong full text extraction tasks handled in backgroundOther cluster nodes will update their indexes at next cluster sync Everything indexed by defaultIndexing configuration for tweaking functionality, performance and disk usageOne index per workspace (and one for the shared version store)Not shared across a cluster, indexes are local to each cluster nodeSee http://wiki.apache.org/jackrabbit/Search#Search_Configuration8QueryIndex
  • 9. AgendaPerformance tuning stepsRepository internalsBasic content accessBatch processingClusteringQuery performanceIndexing configurationQuestions and answers9
  • 10. Basic content accessVery fast access by path and IDUnderlying storage addressed by ID, but path traversal is in any case needed for ACL checksRelevant caches:Path to ID map (internal structure, not configurable)Item state caches (automatically balanced, configurable for special cases)Bundle cache (default fairly low, increase for large deployments)Also some PM-specific options (TarPM index, etc.)Caches optimized for a reasonably sized active working settypical web access pattern: handful of key resources and a long tail of less frequently accessed content, few writesPerformance hit especially when updating nodes with lots of child nodesFineGrainedISMLocking for concurrent, non-overlapping writes10
  • 11. Example: Bundle cache configuration11<!-- In …/repository/worspaces/${wsp.name}/workspace.xml --><Workspace …> <PersistenceManager class=“…"> <paramname="bundleCacheSize" value="8"/> </PersistenceManager></Workspace>
  • 12. Batch processingTwo issues: read and writeReading lots of contentTree traversal the best approach, but will flood cachesSchedule for off-peak timesAdd explicit delay (used by the garbage collectors)Use a dedicated cluster node for batch processingWriting lots of content (including deleting large subtrees)The entire transient space is kept in memory and committed atomicallySplit the operation to smaller piecesSave after every ~1k nodesLeverage the data store if possible12
  • 13. ClusteringGood for horizontally scaling readsPractically zero overhead on read accessNot so good for heavy concurrent writesExclusive lock over the whole clusterDirect all writes to a single master nodeLeverage the data storeNote the cluster sync interval for query consistency, etc.Session.refresh() can be used to force a cluster sync13
  • 14. Query performanceWhat’s really fast?Constraints on properties, node types, full textTypically O(n) where n is the number of results, vs. the total number of nodes What’s pretty fast?Path constraintsWhat needs some planning?Constraints on the child axisSorting, limit/offset JoinsWhat’s not yet available?Aggregate queries (COUNT, SUM, DISTINCT, etc.)Faceting14
  • 15. Join engine15SELECT a.* FROM [nt:unstructured] AS a JOIN [nt:unstructured] AS b <PersistenceManager class=“…"> <paramname="bundleCacheSize" value="8"/> </PersistenceManager></Workspace>
  • 16. Indexing configurationDefault configurationIndex all non-binary propertiesIndex binary jcr:data properties (think nt:file/nt:resource)Full text extraction support for all major document formatsFull text extraction from images, packages, etc. is explicitly disabledCQ5 / WEM comes with default aggregate indexing rules for cq:Pages, etc.Why change the configuration?Reduce the index size (by default almost as large as the PM)Enable features like aggregate indexesAssign boost values for selected properties to improve search result relevance16
  • 17. Indexing configurationHow to change the configuration?indexing_configuration.xml file in the workspace directoryReferenced by the indexingConfiguration option in the workspace.xml fileSee http://wiki.apache.org/jackrabbit/IndexingConfigurationExample:17<?xml version="1.0"?><!DOCTYPE configuration SYSTEM"http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd"><configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <aggregateprimaryType="nt:file"> <include>jcr:content</include> </aggregate></configuration>