SlideShare a Scribd company logo
Optimizing computer vision problems on mobile platforms
Looksery.com
Fedor Polyakov
Software Engineer, CIO
Looksery, INC
fedor@looksery.com
+380 97 5900009 (mobile)
www.looksery.com
Optimize algorithm first
• If your algorithm is suboptimal, “technical” optimizations won’t
be as effective as just algo fixes
• When you optimize the algorithm, you’d probably have to
change your technical optimizations too
• Single instruction - multiple data
• On NEON, 16x128-bit wide registers (up to 4 int32_t’s/floats, 2 doubles)
• Uses a bit more cycles per instruction, but can operate on a lot more data
• Can ideally give the performance boost of up to 4x times (typically, in my
practice ~2-3x)
• Can be used for many image processing algorithms
• Especially useful at various linear algebra problems
SIMD operations
• The easiest way - you just use the library and it does everything for you
• Eigen - great header-only library for linear algebra
• Ne10 - neon-optimized library for some image processing/DSP on android
• Accelerate.framework - lots of image processing/DSP on iOS
• OpenCV, unfortunately, is quite weakly optimized for ARM SIMD (though,
they’ve optimized ~40 low-level functions in OpenCV 3.0)
• There are also some commercial libraries
• + Everything is done without any your efforts
• - You should still profile and analyze the ASM code to verify that everything
is vectorized as you expect
Using computer vision/algebra/DSP libraries
using v4si = int __attribute__ ((vector_size (VECTOR_SIZE_IN_BYTES)));
v4si x, y;
• All common operations with x are now vectorized
• Written once and for all architectures
• Operations supported +, -, *, /, unary minus, ^, |, &, ~, %, <<, >>, comparisons
• Loading from memory in a way like this x = *((v4si*)ptr);
• Loading back to memory in a way like this *((v4si*)ptr) = x;
• Supports subscript operator for accessing individual elements
• Not all SIMD operations supported
• May produce suboptimal code
GCC/clang vector extensions
• Provide a custom data types and a set of c functions to vectorize code
• Example: float32x4_t vrsqrtsq_f32(float32x4_t a, float32x4_t b);
• Generally, are similar to previous approach though give you a better control and
full instruction set.
• Cons:
• Have to write separate code for each platform
• In all the above approaches, compiler may inject some instructions which
can be avoided in hand-crafted code
• Compiler might generate code that won’t use the pipeline efficiently
SIMD intrinsics
• Gives you the most control - you know what code will be generated
• So, if created carefully, can sometimes be up to 2 times faster than the code
generated by compiler using previous approaches (usually 10-15% though)
• You need to write separate code for each architecture :(
• Need to learn
• Harder to create
• In order to get the maximum performance possible, some additional steps may
be required
Handcrafted ASM code
• Reduce data types to as small as possible
• If you can change double to int16_t, you’ll get more than 4x performance boost
• Try using pld intrinsic - it “hints” CPU to load some data into caches which will be
used in a near future (can be used as __builtin_prefetch)
• If you use intrinsics, watch out for some extra loads/stores which you may be
able to get rid of
• Use loop unrolling
• Interleave load/store instructions and arithmetical operations
• Use proper memory alignment - can cause crashes/slow down performance
Some other tricks
• Sum of matrix rows
• Matrices are 128x128, test is repeated 10^5 times
Some benchmarks
// Non-vectorized code
for (int i = 0; i < matSize; i++) {
for (int j = 0; j < matSize; j++) {
rowSum[j] += testMat[i][j];
}
}
// Vectorized code
for (int i = 0; i < matSize; i++) {
for (int j = 0; j < matSize; j += vectorSize) {
VectorType x = *(VectorType*)(testMat[i] + j);
VectorType y = *(VectorType*)(rowSum + j);
y += x;
*(VectorType*)(rowSum + j) = y;
}
}
Some benchmarks
Tested on iPhone 5, results on other phones show pretty much the same
0
1
2
3
4
5
6
7
8
9
10
Simple Vectorized
Time,s int float short
Got more than 2x performance boost, mission completed?
Some benchmarks
0
1
2
3
4
5
6
7
8
9
10
Simple Vectorized Loop unroll
Time,s
int float short
Got another ~15%
for (int i = 0; i < matSize; i++) {
auto ptr = testMat[i];
for (int j = 0; j < matSize; j += 4 * xSize) {
auto ptrStart = ptr + j;
VT x1 = *(VT*)(ptrStart + 0 * xSize);
VT y1 = *(VT*)(rowSum + j + 0 * xSize);
y1 += x1;
VT x2 = *(VT*)(ptrStart + 1 * xSize);
VT y2 = *(VT*)(rowSum + j + 1 * xSize);
y2 += x2;
VT x3 = *(VT*)(ptrStart + 2 * xSize);
VT y3 = *(VT*)(rowSum + j + 2 * xSize);
y3 += x3;
VT x4 = *(VT*)(ptrStart + 3 * xSize);
VT y4 = *(VT*)(rowSum + j + 3 * xSize);
y4 += x4;
*(VT*)(rowSum + j + 0 * xSize) = y1;
*(VT*)(rowSum + j + 1 * xSize) = y2;
*(VT*)(rowSum + j + 2 * xSize) = y3;
*(VT*)(rowSum + j + 3 * xSize) = y4;
}
}
Some benchmarks
Let’s take a look at profiler
Some benchmarks
// Non-vectorized code
for (int i = 0; i < matSize; i++) {
for (int j = 0; j < matSize; j++) {
rowSum[i] += testMat[j][i];
}
}
// Vectorized, loop-unrolled code
for (int i = 0; i < matSize; i+=4 * xSize) {
VT y1 = *(VT*)(rowSum + i);
VT y2 = *(VT*)(rowSum + i + xSize);
VT y3 = *(VT*)(rowSum + i + 2*xSize);
VT y4 = *(VT*)(rowSum + i + 3*xSize);
for (int j = 0; j < matSize; j ++) {
x1 = *(VT*)(testMat[j] + i);
x2 = *(VT*)(testMat[j] + i + xSize);
x3 = *(VT*)(testMat[j] + i + 2*xSize);
x4 = *(VT*)(testMat[j] + i + 3*xSize);
y1 += x1;
y2 += x2;
y3 += x3;
y4 += x4;
}
*(VT*)(rowSum + i) = y1;
*(VT*)(rowSum + i + xSize) = y2;
*(VT*)(rowSum + i + 2*xSize) = y3;
*(VT*)(rowSum + i + 3*xSize) = y4;
}
Some benchmarks
0
1
2
3
4
5
6
7
8
9
10
Simple Vect + Loop
Time,s
int float Short
Some benchmarks
0
1
2
3
4
5
6
7
8
9
10
Simple Vectorized Vect + Loop Eigen SumOrder Asm
Time,s
float
Using GPGPU
• Around 1.5 orders of magnitude bigger theoretical performance
• On iPhone 5, CPU has like ~800 MFlops, GPU has 28.8 GFlops
• On iPhone 5S, CPU has 1.5~ GFlops, GPU has 76.4 GFlops !
• Can be very hard to utilize efficiently
• CUDA, obviously, isn’t available on mobile devices
• OpenCL isn’t available on iOS and is hardly available on android
• On iOS, Metal is available for GPGPU but only starting with iPhone 5S
• On Android, Google promotes Renderscript for GPGPU
• So, the only cross-platform way is to use OpenGL ES (2.0)
Common usage of shaders for GPGPU
Shader 1
Image
Data
Texture containing processed data
Shader 2
…
Data
Results
Display on screen
Read back to cpu
Common problems
• Textures were designed to hold RGBA8 data
• On almost all phones starting 2012, half-float and float textures are supported as
input
• Effective bilinear filtering for float textures may be unsupported or ineffective
• On many devices, writing from fragment shader to half-float (16 bit) textures is
supported.
• Emulating the fixed-point arithmetic is pretty straightforward
• Emulating floating-point is possible, but a bit tricky and requires more operations
• Change of OpenGL states may be expensive
• For-loops with non-const number of iterations not supported on older devices
• Reading from GPU to CPU is very expensive
• There are some platform-dependent way to make it faster
Tasks that can be solved on OpenGL ES
• Image processing
• Image binarization
• Edge detection (Sobel, Canny)
• Hough transform (though, some parts can’t be implemented on GPU)
• Histogram equalization
• Gaussian blur/other convolutions
• Colorspace conversions
• Much more examples in GPUImage library for iOS
• For other tasks, it depends on many factors
• We tried to implement our tracking on GPU, but didn’t get the expected
performance boost
Questions?
Thanks for attention!

More Related Content

PDF
Maxim Kamensky - Applying image matching algorithms to video recognition and ...
PPTX
Eugene Khvedchenia - Image processing using FPGAs
PPTX
Taras Chaykivskyy - Computer Vision in Front-End
PDF
Viktor Sdobnikov - Computer Vision for Advanced Driver Assistance Systems (AD...
PPT
emips_overview_apr08
PPTX
Optimizing Total War*: WARHAMMER II
PPTX
Old code for code quality
PPTX
Report
Maxim Kamensky - Applying image matching algorithms to video recognition and ...
Eugene Khvedchenia - Image processing using FPGAs
Taras Chaykivskyy - Computer Vision in Front-End
Viktor Sdobnikov - Computer Vision for Advanced Driver Assistance Systems (AD...
emips_overview_apr08
Optimizing Total War*: WARHAMMER II
Old code for code quality
Report

What's hot (20)

PPT
Challenges in Embedded Development
PDF
Con-FESS 2015 - Is your profiler speaking to you?
PPTX
Getting Space Pirate Trainer* to Perform on Intel® Graphics
PPT
OpenMP And C++
PPTX
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
PDF
GPU Pipeline - Realtime Rendering CH3
PPTX
[TGDF 2020] Mobile Graphics Best Practices for Artist
PDF
Minimizing CPU Shortage Risks in Integrated Embedded Software
PPTX
Engineering show and tell
ODP
Event Driven with LibUV and ZeroMQ
PDF
Memory Leak Analysis in Android Games
PPT
Unity mobile game performance profiling – using arm mobile studio
PDF
Concurrent Programming OpenMP @ Distributed System Discussion
PDF
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
PDF
GPU Computing for Data Science
PDF
BruCON 2010 Lightning Talks - DIY Grid Computing
PDF
Tw2010slide2
ODP
openmp
PDF
TinyML as-a-Service
PDF
SpeedIT FLOW
Challenges in Embedded Development
Con-FESS 2015 - Is your profiler speaking to you?
Getting Space Pirate Trainer* to Perform on Intel® Graphics
OpenMP And C++
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
GPU Pipeline - Realtime Rendering CH3
[TGDF 2020] Mobile Graphics Best Practices for Artist
Minimizing CPU Shortage Risks in Integrated Embedded Software
Engineering show and tell
Event Driven with LibUV and ZeroMQ
Memory Leak Analysis in Android Games
Unity mobile game performance profiling – using arm mobile studio
Concurrent Programming OpenMP @ Distributed System Discussion
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
GPU Computing for Data Science
BruCON 2010 Lightning Talks - DIY Grid Computing
Tw2010slide2
openmp
TinyML as-a-Service
SpeedIT FLOW
Ad

Viewers also liked (18)

PPT
Michael Norel - High Accuracy Camera Calibration
PPT
Andrii Babii - Application of fuzzy transform to image fusion
PPTX
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
PPT
Multi sensor data fusion system for enhanced analysis of deterioration in con...
PPT
Image quality improvement of Low-resolution camera using Data fusion technique
PDF
Real-Time Face Detection, Tracking, and Attributes Recognition
PDF
TargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
PPTX
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
PPTX
Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)
PDF
Teruel Emprende, ¿y Tú? 2015
PPT
ProEvents Team presentation
PPT
Retailing
PPTX
RDF Validation in a Linked Data World - A vision beyond structural and value ...
PPT
3 arte romano
PPT
Eerm mapping c++
PPTX
Vétérenaires Sans Frontieres International
 
PPTX
Cómo adelgazar sin recuperar los kilos perdidos
PPTX
Ventas y compras internacionales
Michael Norel - High Accuracy Camera Calibration
Andrii Babii - Application of fuzzy transform to image fusion
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Image quality improvement of Low-resolution camera using Data fusion technique
Real-Time Face Detection, Tracking, and Attributes Recognition
TargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)
Teruel Emprende, ¿y Tú? 2015
ProEvents Team presentation
Retailing
RDF Validation in a Linked Data World - A vision beyond structural and value ...
3 arte romano
Eerm mapping c++
Vétérenaires Sans Frontieres International
 
Cómo adelgazar sin recuperar los kilos perdidos
Ventas y compras internacionales
Ad

Similar to Fedor Polyakov - Optimizing computer vision problems on mobile platforms (20)

PPT
Happy To Use SIMD
PDF
Vectorization on x86: all you need to know
KEY
SMP implementation for OpenBSD/sgi
PPTX
JVM Memory Model - Yoav Abrahami, Wix
PPTX
Objects? No thanks!
PPTX
Java Jit. Compilation and optimization by Andrey Kovalenko
PDF
Programar para GPUs
PDF
State of the .Net Performance
PDF
Adam Sitnik "State of the .NET Performance"
PDF
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
PPTX
Optimizing Games for Mobiles
PPT
8871077.ppt
PDF
Practical C++ Generative Programming
PPTX
Jvm memory model
PPTX
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
PPTX
Week1 Electronic System-level ESL Design and SystemC Begin
PDF
第11回 配信講義 計算科学技術特論A(2021)
PDF
Vectorization in ATLAS
PPTX
Static analysis of C++ source code
PPTX
Static analysis of C++ source code
Happy To Use SIMD
Vectorization on x86: all you need to know
SMP implementation for OpenBSD/sgi
JVM Memory Model - Yoav Abrahami, Wix
Objects? No thanks!
Java Jit. Compilation and optimization by Andrey Kovalenko
Programar para GPUs
State of the .Net Performance
Adam Sitnik "State of the .NET Performance"
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Optimizing Games for Mobiles
8871077.ppt
Practical C++ Generative Programming
Jvm memory model
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Week1 Electronic System-level ESL Design and SystemC Begin
第11回 配信講義 計算科学技術特論A(2021)
Vectorization in ATLAS
Static analysis of C++ source code
Static analysis of C++ source code

Recently uploaded (20)

PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Machine Learning_overview_presentation.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Network Security Unit 5.pdf for BCA BBA.
SOPHOS-XG Firewall Administrator PPT.pptx
MIND Revenue Release Quarter 2 2025 Press Release
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
Machine Learning_overview_presentation.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Assigned Numbers - 2025 - Bluetooth® Document
1. Introduction to Computer Programming.pptx
Programs and apps: productivity, graphics, security and other tools
Digital-Transformation-Roadmap-for-Companies.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Network Security Unit 5.pdf for BCA BBA.

Fedor Polyakov - Optimizing computer vision problems on mobile platforms

  • 1. Optimizing computer vision problems on mobile platforms Looksery.com
  • 2. Fedor Polyakov Software Engineer, CIO Looksery, INC [email protected] +380 97 5900009 (mobile) www.looksery.com
  • 3. Optimize algorithm first • If your algorithm is suboptimal, “technical” optimizations won’t be as effective as just algo fixes • When you optimize the algorithm, you’d probably have to change your technical optimizations too
  • 4. • Single instruction - multiple data • On NEON, 16x128-bit wide registers (up to 4 int32_t’s/floats, 2 doubles) • Uses a bit more cycles per instruction, but can operate on a lot more data • Can ideally give the performance boost of up to 4x times (typically, in my practice ~2-3x) • Can be used for many image processing algorithms • Especially useful at various linear algebra problems SIMD operations
  • 5. • The easiest way - you just use the library and it does everything for you • Eigen - great header-only library for linear algebra • Ne10 - neon-optimized library for some image processing/DSP on android • Accelerate.framework - lots of image processing/DSP on iOS • OpenCV, unfortunately, is quite weakly optimized for ARM SIMD (though, they’ve optimized ~40 low-level functions in OpenCV 3.0) • There are also some commercial libraries • + Everything is done without any your efforts • - You should still profile and analyze the ASM code to verify that everything is vectorized as you expect Using computer vision/algebra/DSP libraries
  • 6. using v4si = int __attribute__ ((vector_size (VECTOR_SIZE_IN_BYTES))); v4si x, y; • All common operations with x are now vectorized • Written once and for all architectures • Operations supported +, -, *, /, unary minus, ^, |, &, ~, %, <<, >>, comparisons • Loading from memory in a way like this x = *((v4si*)ptr); • Loading back to memory in a way like this *((v4si*)ptr) = x; • Supports subscript operator for accessing individual elements • Not all SIMD operations supported • May produce suboptimal code GCC/clang vector extensions
  • 7. • Provide a custom data types and a set of c functions to vectorize code • Example: float32x4_t vrsqrtsq_f32(float32x4_t a, float32x4_t b); • Generally, are similar to previous approach though give you a better control and full instruction set. • Cons: • Have to write separate code for each platform • In all the above approaches, compiler may inject some instructions which can be avoided in hand-crafted code • Compiler might generate code that won’t use the pipeline efficiently SIMD intrinsics
  • 8. • Gives you the most control - you know what code will be generated • So, if created carefully, can sometimes be up to 2 times faster than the code generated by compiler using previous approaches (usually 10-15% though) • You need to write separate code for each architecture :( • Need to learn • Harder to create • In order to get the maximum performance possible, some additional steps may be required Handcrafted ASM code
  • 9. • Reduce data types to as small as possible • If you can change double to int16_t, you’ll get more than 4x performance boost • Try using pld intrinsic - it “hints” CPU to load some data into caches which will be used in a near future (can be used as __builtin_prefetch) • If you use intrinsics, watch out for some extra loads/stores which you may be able to get rid of • Use loop unrolling • Interleave load/store instructions and arithmetical operations • Use proper memory alignment - can cause crashes/slow down performance Some other tricks
  • 10. • Sum of matrix rows • Matrices are 128x128, test is repeated 10^5 times Some benchmarks // Non-vectorized code for (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j++) { rowSum[j] += testMat[i][j]; } } // Vectorized code for (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j += vectorSize) { VectorType x = *(VectorType*)(testMat[i] + j); VectorType y = *(VectorType*)(rowSum + j); y += x; *(VectorType*)(rowSum + j) = y; } }
  • 11. Some benchmarks Tested on iPhone 5, results on other phones show pretty much the same 0 1 2 3 4 5 6 7 8 9 10 Simple Vectorized Time,s int float short Got more than 2x performance boost, mission completed?
  • 12. Some benchmarks 0 1 2 3 4 5 6 7 8 9 10 Simple Vectorized Loop unroll Time,s int float short Got another ~15% for (int i = 0; i < matSize; i++) { auto ptr = testMat[i]; for (int j = 0; j < matSize; j += 4 * xSize) { auto ptrStart = ptr + j; VT x1 = *(VT*)(ptrStart + 0 * xSize); VT y1 = *(VT*)(rowSum + j + 0 * xSize); y1 += x1; VT x2 = *(VT*)(ptrStart + 1 * xSize); VT y2 = *(VT*)(rowSum + j + 1 * xSize); y2 += x2; VT x3 = *(VT*)(ptrStart + 2 * xSize); VT y3 = *(VT*)(rowSum + j + 2 * xSize); y3 += x3; VT x4 = *(VT*)(ptrStart + 3 * xSize); VT y4 = *(VT*)(rowSum + j + 3 * xSize); y4 += x4; *(VT*)(rowSum + j + 0 * xSize) = y1; *(VT*)(rowSum + j + 1 * xSize) = y2; *(VT*)(rowSum + j + 2 * xSize) = y3; *(VT*)(rowSum + j + 3 * xSize) = y4; } }
  • 13. Some benchmarks Let’s take a look at profiler
  • 14. Some benchmarks // Non-vectorized code for (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j++) { rowSum[i] += testMat[j][i]; } } // Vectorized, loop-unrolled code for (int i = 0; i < matSize; i+=4 * xSize) { VT y1 = *(VT*)(rowSum + i); VT y2 = *(VT*)(rowSum + i + xSize); VT y3 = *(VT*)(rowSum + i + 2*xSize); VT y4 = *(VT*)(rowSum + i + 3*xSize); for (int j = 0; j < matSize; j ++) { x1 = *(VT*)(testMat[j] + i); x2 = *(VT*)(testMat[j] + i + xSize); x3 = *(VT*)(testMat[j] + i + 2*xSize); x4 = *(VT*)(testMat[j] + i + 3*xSize); y1 += x1; y2 += x2; y3 += x3; y4 += x4; } *(VT*)(rowSum + i) = y1; *(VT*)(rowSum + i + xSize) = y2; *(VT*)(rowSum + i + 2*xSize) = y3; *(VT*)(rowSum + i + 3*xSize) = y4; }
  • 15. Some benchmarks 0 1 2 3 4 5 6 7 8 9 10 Simple Vect + Loop Time,s int float Short
  • 16. Some benchmarks 0 1 2 3 4 5 6 7 8 9 10 Simple Vectorized Vect + Loop Eigen SumOrder Asm Time,s float
  • 17. Using GPGPU • Around 1.5 orders of magnitude bigger theoretical performance • On iPhone 5, CPU has like ~800 MFlops, GPU has 28.8 GFlops • On iPhone 5S, CPU has 1.5~ GFlops, GPU has 76.4 GFlops ! • Can be very hard to utilize efficiently • CUDA, obviously, isn’t available on mobile devices • OpenCL isn’t available on iOS and is hardly available on android • On iOS, Metal is available for GPGPU but only starting with iPhone 5S • On Android, Google promotes Renderscript for GPGPU • So, the only cross-platform way is to use OpenGL ES (2.0)
  • 18. Common usage of shaders for GPGPU Shader 1 Image Data Texture containing processed data Shader 2 … Data Results Display on screen Read back to cpu
  • 19. Common problems • Textures were designed to hold RGBA8 data • On almost all phones starting 2012, half-float and float textures are supported as input • Effective bilinear filtering for float textures may be unsupported or ineffective • On many devices, writing from fragment shader to half-float (16 bit) textures is supported. • Emulating the fixed-point arithmetic is pretty straightforward • Emulating floating-point is possible, but a bit tricky and requires more operations • Change of OpenGL states may be expensive • For-loops with non-const number of iterations not supported on older devices • Reading from GPU to CPU is very expensive • There are some platform-dependent way to make it faster
  • 20. Tasks that can be solved on OpenGL ES • Image processing • Image binarization • Edge detection (Sobel, Canny) • Hough transform (though, some parts can’t be implemented on GPU) • Histogram equalization • Gaussian blur/other convolutions • Colorspace conversions • Much more examples in GPUImage library for iOS • For other tasks, it depends on many factors • We tried to implement our tracking on GPU, but didn’t get the expected performance boost