SlideShare a Scribd company logo
Parallel Computing 2007: Bring your own parallel application February 26-March 1 2007 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN [email_address]
Intel’s Application Stack Discussed here Rest mainly classic parallel computing
K-Means The diagrams come from  Wikipedia Take N data points  x  in some space (can be relatively abstract such as space of chemical properties) We want to cluster into c components based on distance in space Algorithm assumes you have a guess  c k  for  cluster centers  k=1..c Associate each of N points with one and only one cluster by minimizing distance to the  c k   Replace  c k   by the  centroid  of points associated with it Iterate algorithm
Problem used later in deterministic annealing version of K-Means
K-Means illustrated Again, the centers are moved to the centroids of the corresponding associated points.  Now, the association is shown in more detail, once the centroids have been moved.  Centers have been associated with the points and have been moved to the respective centroids  Shows the initial randomized centers and a number of points  a) b) c) d)
Parallel K-Means This algorithm is data parallel over  N  points  x Assign  N/N proc  points to each of  N proc  processors; no ordering needed in simple algorithm Broadcast initial cluster centers  c k  to each processor Each processor independently calculates nearest  c k  for each data point it is responsible before Further it calculates  partial sums   for c centroids and error estimates (used for convergence) {Sums over all points} are {Sums over processors (sums over all points in given processor)} Apply  MPI_Allreduce  for global sums with (same)  c  results placed in each processor All processors calculate new  c k  and iterate
MPI Parallel Divkmeans clustering of PubChem AVIDD Linux cluster, 5,273,852 structures (Pubchem compound collection, Nov 2005) David Wild Indiana
Performance of Parallel K-Means There is an an amount of distance calculation that is proportional to  ( n =N/N proc )*c  for  c  clusters and  N  points on  N proc   processors There is the global sum calculation proportional to  c log 2 N proc So overhead  f comm  is  log 2 N proc  t comm / n t calc Appearance of  log 2 N proc   is quite common as global sums over used That’s why MPI has  MPI_Allreduce  with hope it can be optimized on whatever network is available Notice these MPI collectives are often not optimized and rarely used except by Marine Corps Note this problem has  information dimension 1
Find Maximum of a distributed array TEST ALLREDUCE can do many reductions typically after user has done reduction internally to each processor
ALLREDUCE on a multicore chip On a shared memory machine, one can use a different strategy by “transposing” the decomposition so that in global reduction you parallelize over c (the number of) centers not over geometric spatial decomposition Each core sums over contributions to a given center Computational Complexity is  Max(1, c/N proc )  * Dimension of vector  x Distributed version is  c log 2 N proc  * Dimension of vector  x
Transposing Partial Sums Let result of parallel computation by partial sum C( i,k ) for Processor  i  calculating centroid  k 1 ≤  i  ≤ N proc  and 1 ≤  k  ≤ c Take special case c = N proc  = 4 C(1,1) C(1,2) C(1,3) C(1,4) 1 C(2,1) C(2,2) C(2,3) C(2,4) 2 C(3,1) C(3,2) C(3,3) C(3,4) 3 C(4,1) C(4,2) C(4,3) C(4,4) 4 Calculate Partial Sums locally 1 2 3 4 C(1,1)+C(2,1)+C(3,1)+C(4,1) C(1,2)+C(2,2)+C(3,2)+C(4,2) C(1,3)+C(2,3)+C(3,3)+C(4,3) C(1,4)+C(2,4)+C(3,4)+C(4,4) Transpose and sum along rows in each processor to get 100% efficiency MPI Solution cannot transpose for free and so uses a tree in this direction
Continuing the Intel Homework Set
Clustering by Deterministic Annealing  One can refine this by using multi scale methods and anneal system in position resolution (Gurewitz and Rose)
Deterministically find cluster centers y j  using “mean field approximation” – could use slower Monte Carlo
 
Annealing avoids local minima
 
Deterministic Annealing Method does not need to assume a number of clusters See  K. Rose , "Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems," Proceedings of the IEEE, vol. 80, pp. 2210-2239, November 1998 Parallelization  is similar to ordinary K-Means as we are calculating global sums which are decomposed into local averages and then summed over components calculated in each processor I found it interesting that clustering (and K-Means) very important in  Chemical Informatics  for finding related compounds Field does not seem to know about these multi-resolution methods
Frequent Itemsets Mining We have a transaction database TDB whose records  T i   are a set of items  {i 1 ,i 2 …..i m }  The  i k  are items from a source vocabulary  {s 1  … s N }  and we wish to find frequently occurring itemsets  {s A , s B  …}  based on number of times this itemset appears in any order in a transaction I looked at two algorithms –  Apriori  and  Frequent Pattern Growth Apriori  focuses on the itemsets searching from smallest to largest systematically Natural for short transactions and small vocabularies Frequent Pattern Growth  focuses on transactions after re-ordering them in order of item frequency Superior for finding long itemsets Effectively generates a new (compact) database with re-ordered items
Parallel Frequent Itemsets Mining Parallelize by  partitioning transaction database  and calculating independently frequent patterns from each partition Use  global reduction  to accumulate itemset counts from each partition Now global reduction is summing counts over  candidate patterns  and goes together with a  pruning  to only consider patterns with an  occurrence > than some threshold This pruning is not easy to do before global sums (in spite of claims of at least one paper) The “ transposed multicore ” ALLREDUCE would be a good strategy
Transposing Partial Itemset Counts Let result of parallel computation by partial sum C( i,k ) for Processor  i  counting occurrences of itemset  k 1 ≤  i  ≤ N proc  and 1 ≤  k  ≤ c Take unrealistic special case c = N proc  = 4 MPI Solution cannot transpose for free and so uses a tree in this direction Multicore Algorithm Distributed MPI_ALLREDUCE C(1,1) C(1,2) C(1,3) C(1,4) 1 C(2,1) C(2,2) C(2,3) C(2,4) 2 C(3,1) C(3,2) C(3,3) C(3,4) 3 C(4,1) C(4,2) C(4,3) C(4,4) 4 Calculate Partial Sums locally 1 2 3 4 C(1,1)+C(2,1)+C(3,1)+C(4,1) C(1,2)+C(2,2)+C(3,2)+C(4,2) C(1,3)+C(2,3)+C(3,3)+C(4,3) C(1,4)+C(2,4)+C(3,4)+C(4,4) Transpose and sum along rows in each processor to get 100% efficiency
(Mixed) Integer Programming We are solving an optimization problem such as minimize  f(x) =  C T x  (for linear programming) Subject to constraints (which are also linear for linear programming) such as A T 1 x  = b 1  or  A T 2 x     0 With constraints that some (mixed case) or all the elements of x are  integers  (possibly 0 or 1) The non integer problem is soluble by  Simplex  method or by  interior point methods  (Karmarkar) in polynomial time The integer programming problem is  NP complete
Integer Programming Parallelization Typically one does  not parallelize the linear program solver  but rather runs this sequentially and instead  parallelizes a branch and bound (or cut) search over possible solutions  in NP complete case  e.g. search over integer choices for  x The hard  integer programming problem  consists of Divide space  into subspaces Find  upper and lower bounds  on f(x) in each subspace If lower bound on f(x) in a subspace is greater than current minimum of upper bounds of f(x) in other subspaces (i.e. upper bound of f(x) in any subspace), then one can  prune  this subspace If a  subspace  is still  active  and  upper bound > lower bound , then further divide it into subspaces and iterate process Parallelism comes from “ data parallelism ” over subspaces which is suitable for  thread based systems There is typically  important shared knowledge  such as current minimum upper bound and other information from one subspace that can be re-used by others Shared  (in memory)  database  for  performance
Computer Chess I Games like  computer chess  are a special case of the general branch and bound strategy The space is the set of all moves where N moves by white and black is  2N plys ; at each ply there are roughly 35 legal moves so complexity is  35 2N Evaluation of of one set of moves to depth 2N is completed by evaluating the final position  f( x ;  x  is set of moves)  by rules reflecting chess wisdom and summarized by a number (Queen=10, Pawn =1 etc.) Deep Blue  parallelized the calculation of f( x )  but here we explore  subspace parallelization We follow work done at  Caltech  using a  512 node nCUBE  which competed as WAYCOOL with poor reliability and results in 1987 and 1988 ACM Computer Chess Championships
Computer Chess II The upper-lower bound approach is replaced by a  minimax  principle Assume  f( x ) positive  is good for white; then at each move white looks at each subspace spawned from the white move and chooses the one with the largest f( x ) In evaluating the subspace we assume that each stage, the side on move makes the best choice White  always   maximizes f( x )   at her move and  black minimizes f(x)  at his move Of course as N is finite and evaluation function approximate, this is not precise but it gets better and better the larger N is Note human players tend to use more  pattern recognition  and less brute force evaluation Computer games are unimaginative but have fewer errors
Computer Chess III Pruning  is illustrated below; as it is advantageous to get (if white is to move) to get a large (good) value of f(x) as early as possible, one sorts moves at each node and looks at the most plausible first This reduces effective branching ratio from 35 to 6 4 4 -1 -7 -17 White Maximizes Black Minimizes The dotted lines show subspaces that  never need to be searched ; this requires that one have done a  complete depth search  at first subspaces looked at 4 29 13 -1 5 2 -7 3 15 -11 -10 -17 5
Computer Chess IV Threads  were spawned in  groups of 4  in Caltech example at different depths of tree and project achieved a  speed up of over a 100  and the larger # plys N gets the more parallelism there will be Increasing search depth
Computer Chess V We have subsets of threads (4 in this example)  synchronizing  on  node minimax value This is a  global variable  and there are (as in other branch and bound) very important performance gains from a  shared position database This allows scores to be stored for positions and re-used In chess there are many  transpositions  leading to identical positions 1 e4 e5 2 Nf3 Nc6  is identical to (less usual)  1 Nf3 Nc6 2 e4 e5 There was only a few percent overhead for a distributed database on Caltech distributed memory implementation Queuing of update requests ensured no errors from multiple threads accessing same location Multicore  architecture should be excellent for this and other large branch and bound and related search algorithms as support shared databases and fast thread synchronization Note that in  Deep Fritz  vs. Vladimir Kramnik (human world champion) in November 2006, the program ran on a personal computer containing two Intel Core 2 Duo CPUs, capable of evaluating  8 million positions per second , and searching to an average depth of  17 to 18 ply  in the middlegame. Deep Fritz won 4-2
Wikipedia SVM Example We are finding optimal hyperplane splitting two samples Samples are training set Normal  w  to splitting hyperplane given by w  =   i =1 n  y i   i  x i Two samples denoted by  crosses  y i  =1 or  circles  y i  = -1
Support Vector Machines SVM I These divide sets by (in simplest case) hyperplanes into two in an optimal least squares fashion Minimize  f(  ) = 0.5   T G   -   i =1 n  i Subject to   i =1 n  y i  i  = 0   and  0  ≤    i  ≤  C With  G ij  = y i y j  K( x i , x j )  for Kernel  K This is a training problem where we have a total of  n  data points from two populations with y i  = +1 for first and = -1 for second K( x i , x j ) =  x i  . x j   is simplest case when division is by a hyperplane in space in which  x  is a vector but Gaussian forms are often used K = exp(- constant   x i - x j  2 ) G  is an n by n dense matrix (n is number of data points) This is a a  quadratic programming QP  problem
Support Vector Machines SVM II Differentiating wrt    gives linear equations that must solved iteratively to satisfy inequality constraints The solver matrix G is both  large  (10 6  by 10 6 ) and can be dense and this requires large storage space which often exceeds available memory As in much quadratic programming one can use  conjugate gradient solution methods  as this identifies systematically the important directions in space (roughly large eigenvalues of positive definite symmetric matrix G) There are several papers on parallel SVM but I did not see substantial use of parallel implementations There were two approaches Either solve the  matrix problems in parallel  or Split up dataset  and solve multiple subproblems
Support Vector Machines SVM III Solve the  matrix problems in parallel Interestingly one does not solve full G but iterates up from smaller (~150 by 150) problems and so data parallelism does not exploit size n Need more reliable SVM solvers for large matrices? Split up dataset  and solve multiple subproblems –  Scalable! Here the difficulty is that essentially you have  changed algorithm  and it is not clear how best to combine solution of subproblems But original SVM is full of  heuristics  (choice of K) so other heuristics may be allowed! Note whereas  multicore  appears especially  attractive for search  problems, it is not so clear for SVM Multicore does not address  huge size of matrix G High performance matrix solvers  are available for  distributed memory  machines I suspect there are better “approximate” SVM solvers that will do well on multicore and reduce dimension of G but this is  research
Some Parallelization Results from “Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems” This paper reviews much previous work Super linear speedup in (a) due to extra memory

More Related Content

PPT
Parallel Computing 2007: Overview
PPT
Parallel Computing
PDF
Chapter 1 - introduction - parallel computing
PPT
Chapter 1 pc
PPT
Parallel Algorithm Models
PPT
Chapter 3 pc
PDF
Parallel Algorithms
PPTX
Parallel processing
Parallel Computing 2007: Overview
Parallel Computing
Chapter 1 - introduction - parallel computing
Chapter 1 pc
Parallel Algorithm Models
Chapter 3 pc
Parallel Algorithms
Parallel processing

What's hot (20)

PPTX
High Performance Parallel Computing with Clouds and Cloud Technologies
PPT
Chap3 slides
PDF
Lecture 4 principles of parallel algorithm design updated
DOCX
Introduction to parallel computing
PDF
Solution(1)
PDF
Lecture 1 introduction to parallel and distributed computing
PPTX
Patterns For Parallel Computing
PPT
program partitioning and scheduling IN Advanced Computer Architecture
PPTX
Parallel processing coa
PDF
Chapter on Book on Cloud Computing 96
PPT
Chapter 4 pc
PPTX
Parallel algorithms
PDF
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
PDF
IRJET- Latin Square Computation of Order-3 using Open CL
PPT
Chapter 5 pc
PPTX
A Tale of Data Pattern Discovery in Parallel
PPTX
Scalable Parallel Computing on Clouds
PDF
Chapter 4: Parallel Programming Languages
PPS
PRAM algorithms from deepika
PDF
Parallel computation
High Performance Parallel Computing with Clouds and Cloud Technologies
Chap3 slides
Lecture 4 principles of parallel algorithm design updated
Introduction to parallel computing
Solution(1)
Lecture 1 introduction to parallel and distributed computing
Patterns For Parallel Computing
program partitioning and scheduling IN Advanced Computer Architecture
Parallel processing coa
Chapter on Book on Cloud Computing 96
Chapter 4 pc
Parallel algorithms
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
IRJET- Latin Square Computation of Order-3 using Open CL
Chapter 5 pc
A Tale of Data Pattern Discovery in Parallel
Scalable Parallel Computing on Clouds
Chapter 4: Parallel Programming Languages
PRAM algorithms from deepika
Parallel computation
Ad

Similar to Parallel Computing 2007: Bring your own parallel application (20)

PDF
Scaling algebraic multigrid to over 287K processors
PPTX
Oxford 05-oct-2012
PDF
Parallel Computing - Lec 4
PPTX
Chapter 5.pptx
PPTX
ACM 2013-02-25
PDF
Bayesian Counters
PPTX
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
PDF
Local vs. Global Models for Effort Estimation and Defect Prediction
PDF
Unit- 2_my1.pdf jbvjwe vbeijv dv d d d kjd k
PPTX
Nearest Neighbor Customer Insight
PDF
Big Data com Python
PPT
Lect4
PPTX
Unit 2 - 2.2 (Basic Algorithms).pptxeeee
PDF
Hadoop.mapreduce
PDF
Notes on data-intensive processing with Hadoop Mapreduce
PDF
Terascale Learning
PDF
Detecting Lateral Movement with a Compute-Intense Graph Kernel
PDF
Design And Analysis Of Algorithms Lecture Notes Mit 6046j Itebooks
PDF
20110319 parameterized algorithms_fomin_lecture01-02
PDF
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Scaling algebraic multigrid to over 287K processors
Oxford 05-oct-2012
Parallel Computing - Lec 4
Chapter 5.pptx
ACM 2013-02-25
Bayesian Counters
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Local vs. Global Models for Effort Estimation and Defect Prediction
Unit- 2_my1.pdf jbvjwe vbeijv dv d d d kjd k
Nearest Neighbor Customer Insight
Big Data com Python
Lect4
Unit 2 - 2.2 (Basic Algorithms).pptxeeee
Hadoop.mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
Terascale Learning
Detecting Lateral Movement with a Compute-Intense Graph Kernel
Design And Analysis Of Algorithms Lecture Notes Mit 6046j Itebooks
20110319 parameterized algorithms_fomin_lecture01-02
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Ad

More from Geoffrey Fox (20)

PPTX
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
PPTX
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
PPTX
High Performance Computing and Big Data
PPTX
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
PPTX
Big Data HPC Convergence
PPTX
Data Science and Online Education
PPTX
Big Data HPC Convergence and a bunch of other things
PPTX
High Performance Processing of Streaming Data
PPTX
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
PPTX
Visualizing and Clustering Life Science Applications in Parallel 
PPTX
Lessons from Data Science Program at Indiana University: Curriculum, Students...
PPTX
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
PPTX
Data Science Curriculum at Indiana University
PPTX
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
DOCX
Experience with Online Teaching with Open Source MOOC Technology
PPTX
Cloud Services for Big Data Analytics
PPTX
Matching Data Intensive Applications and Hardware/Software Architectures
PDF
Big Data and Clouds: Research and Education
PPTX
Comparing Big Data and Simulation Applications and Implications for Software ...
PDF
High Performance Data Analytics and a Java Grande Run Time
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
High Performance Computing and Big Data
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Big Data HPC Convergence
Data Science and Online Education
Big Data HPC Convergence and a bunch of other things
High Performance Processing of Streaming Data
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Visualizing and Clustering Life Science Applications in Parallel 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
Data Science Curriculum at Indiana University
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Experience with Online Teaching with Open Source MOOC Technology
Cloud Services for Big Data Analytics
Matching Data Intensive Applications and Hardware/Software Architectures
Big Data and Clouds: Research and Education
Comparing Big Data and Simulation Applications and Implications for Software ...
High Performance Data Analytics and a Java Grande Run Time

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
A comparative analysis of optical character recognition models for extracting...
Programs and apps: productivity, graphics, security and other tools
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced methodologies resolving dimensionality complications for autism neur...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Network Security Unit 5.pdf for BCA BBA.
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
sap open course for s4hana steps from ECC to s4
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine learning based COVID-19 study performance prediction
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
MIND Revenue Release Quarter 2 2025 Press Release
The Rise and Fall of 3GPP – Time for a Sabbatical?

Parallel Computing 2007: Bring your own parallel application

  • 1. Parallel Computing 2007: Bring your own parallel application February 26-March 1 2007 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN [email_address]
  • 2. Intel’s Application Stack Discussed here Rest mainly classic parallel computing
  • 3. K-Means The diagrams come from Wikipedia Take N data points x in some space (can be relatively abstract such as space of chemical properties) We want to cluster into c components based on distance in space Algorithm assumes you have a guess c k for cluster centers k=1..c Associate each of N points with one and only one cluster by minimizing distance to the c k Replace c k by the centroid of points associated with it Iterate algorithm
  • 4. Problem used later in deterministic annealing version of K-Means
  • 5. K-Means illustrated Again, the centers are moved to the centroids of the corresponding associated points. Now, the association is shown in more detail, once the centroids have been moved. Centers have been associated with the points and have been moved to the respective centroids Shows the initial randomized centers and a number of points a) b) c) d)
  • 6. Parallel K-Means This algorithm is data parallel over N points x Assign N/N proc points to each of N proc processors; no ordering needed in simple algorithm Broadcast initial cluster centers c k to each processor Each processor independently calculates nearest c k for each data point it is responsible before Further it calculates partial sums for c centroids and error estimates (used for convergence) {Sums over all points} are {Sums over processors (sums over all points in given processor)} Apply MPI_Allreduce for global sums with (same) c results placed in each processor All processors calculate new c k and iterate
  • 7. MPI Parallel Divkmeans clustering of PubChem AVIDD Linux cluster, 5,273,852 structures (Pubchem compound collection, Nov 2005) David Wild Indiana
  • 8. Performance of Parallel K-Means There is an an amount of distance calculation that is proportional to ( n =N/N proc )*c for c clusters and N points on N proc processors There is the global sum calculation proportional to c log 2 N proc So overhead f comm is log 2 N proc t comm / n t calc Appearance of log 2 N proc is quite common as global sums over used That’s why MPI has MPI_Allreduce with hope it can be optimized on whatever network is available Notice these MPI collectives are often not optimized and rarely used except by Marine Corps Note this problem has information dimension 1
  • 9. Find Maximum of a distributed array TEST ALLREDUCE can do many reductions typically after user has done reduction internally to each processor
  • 10. ALLREDUCE on a multicore chip On a shared memory machine, one can use a different strategy by “transposing” the decomposition so that in global reduction you parallelize over c (the number of) centers not over geometric spatial decomposition Each core sums over contributions to a given center Computational Complexity is Max(1, c/N proc ) * Dimension of vector x Distributed version is c log 2 N proc * Dimension of vector x
  • 11. Transposing Partial Sums Let result of parallel computation by partial sum C( i,k ) for Processor i calculating centroid k 1 ≤ i ≤ N proc and 1 ≤ k ≤ c Take special case c = N proc = 4 C(1,1) C(1,2) C(1,3) C(1,4) 1 C(2,1) C(2,2) C(2,3) C(2,4) 2 C(3,1) C(3,2) C(3,3) C(3,4) 3 C(4,1) C(4,2) C(4,3) C(4,4) 4 Calculate Partial Sums locally 1 2 3 4 C(1,1)+C(2,1)+C(3,1)+C(4,1) C(1,2)+C(2,2)+C(3,2)+C(4,2) C(1,3)+C(2,3)+C(3,3)+C(4,3) C(1,4)+C(2,4)+C(3,4)+C(4,4) Transpose and sum along rows in each processor to get 100% efficiency MPI Solution cannot transpose for free and so uses a tree in this direction
  • 12. Continuing the Intel Homework Set
  • 13. Clustering by Deterministic Annealing One can refine this by using multi scale methods and anneal system in position resolution (Gurewitz and Rose)
  • 14. Deterministically find cluster centers y j using “mean field approximation” – could use slower Monte Carlo
  • 15.  
  • 17.  
  • 18. Deterministic Annealing Method does not need to assume a number of clusters See K. Rose , "Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems," Proceedings of the IEEE, vol. 80, pp. 2210-2239, November 1998 Parallelization is similar to ordinary K-Means as we are calculating global sums which are decomposed into local averages and then summed over components calculated in each processor I found it interesting that clustering (and K-Means) very important in Chemical Informatics for finding related compounds Field does not seem to know about these multi-resolution methods
  • 19. Frequent Itemsets Mining We have a transaction database TDB whose records T i are a set of items {i 1 ,i 2 …..i m } The i k are items from a source vocabulary {s 1 … s N } and we wish to find frequently occurring itemsets {s A , s B …} based on number of times this itemset appears in any order in a transaction I looked at two algorithms – Apriori and Frequent Pattern Growth Apriori focuses on the itemsets searching from smallest to largest systematically Natural for short transactions and small vocabularies Frequent Pattern Growth focuses on transactions after re-ordering them in order of item frequency Superior for finding long itemsets Effectively generates a new (compact) database with re-ordered items
  • 20. Parallel Frequent Itemsets Mining Parallelize by partitioning transaction database and calculating independently frequent patterns from each partition Use global reduction to accumulate itemset counts from each partition Now global reduction is summing counts over candidate patterns and goes together with a pruning to only consider patterns with an occurrence > than some threshold This pruning is not easy to do before global sums (in spite of claims of at least one paper) The “ transposed multicore ” ALLREDUCE would be a good strategy
  • 21. Transposing Partial Itemset Counts Let result of parallel computation by partial sum C( i,k ) for Processor i counting occurrences of itemset k 1 ≤ i ≤ N proc and 1 ≤ k ≤ c Take unrealistic special case c = N proc = 4 MPI Solution cannot transpose for free and so uses a tree in this direction Multicore Algorithm Distributed MPI_ALLREDUCE C(1,1) C(1,2) C(1,3) C(1,4) 1 C(2,1) C(2,2) C(2,3) C(2,4) 2 C(3,1) C(3,2) C(3,3) C(3,4) 3 C(4,1) C(4,2) C(4,3) C(4,4) 4 Calculate Partial Sums locally 1 2 3 4 C(1,1)+C(2,1)+C(3,1)+C(4,1) C(1,2)+C(2,2)+C(3,2)+C(4,2) C(1,3)+C(2,3)+C(3,3)+C(4,3) C(1,4)+C(2,4)+C(3,4)+C(4,4) Transpose and sum along rows in each processor to get 100% efficiency
  • 22. (Mixed) Integer Programming We are solving an optimization problem such as minimize f(x) = C T x (for linear programming) Subject to constraints (which are also linear for linear programming) such as A T 1 x = b 1 or A T 2 x  0 With constraints that some (mixed case) or all the elements of x are integers (possibly 0 or 1) The non integer problem is soluble by Simplex method or by interior point methods (Karmarkar) in polynomial time The integer programming problem is NP complete
  • 23. Integer Programming Parallelization Typically one does not parallelize the linear program solver but rather runs this sequentially and instead parallelizes a branch and bound (or cut) search over possible solutions in NP complete case e.g. search over integer choices for x The hard integer programming problem consists of Divide space into subspaces Find upper and lower bounds on f(x) in each subspace If lower bound on f(x) in a subspace is greater than current minimum of upper bounds of f(x) in other subspaces (i.e. upper bound of f(x) in any subspace), then one can prune this subspace If a subspace is still active and upper bound > lower bound , then further divide it into subspaces and iterate process Parallelism comes from “ data parallelism ” over subspaces which is suitable for thread based systems There is typically important shared knowledge such as current minimum upper bound and other information from one subspace that can be re-used by others Shared (in memory) database for performance
  • 24. Computer Chess I Games like computer chess are a special case of the general branch and bound strategy The space is the set of all moves where N moves by white and black is 2N plys ; at each ply there are roughly 35 legal moves so complexity is 35 2N Evaluation of of one set of moves to depth 2N is completed by evaluating the final position f( x ; x is set of moves) by rules reflecting chess wisdom and summarized by a number (Queen=10, Pawn =1 etc.) Deep Blue parallelized the calculation of f( x ) but here we explore subspace parallelization We follow work done at Caltech using a 512 node nCUBE which competed as WAYCOOL with poor reliability and results in 1987 and 1988 ACM Computer Chess Championships
  • 25. Computer Chess II The upper-lower bound approach is replaced by a minimax principle Assume f( x ) positive is good for white; then at each move white looks at each subspace spawned from the white move and chooses the one with the largest f( x ) In evaluating the subspace we assume that each stage, the side on move makes the best choice White always maximizes f( x ) at her move and black minimizes f(x) at his move Of course as N is finite and evaluation function approximate, this is not precise but it gets better and better the larger N is Note human players tend to use more pattern recognition and less brute force evaluation Computer games are unimaginative but have fewer errors
  • 26. Computer Chess III Pruning is illustrated below; as it is advantageous to get (if white is to move) to get a large (good) value of f(x) as early as possible, one sorts moves at each node and looks at the most plausible first This reduces effective branching ratio from 35 to 6 4 4 -1 -7 -17 White Maximizes Black Minimizes The dotted lines show subspaces that never need to be searched ; this requires that one have done a complete depth search at first subspaces looked at 4 29 13 -1 5 2 -7 3 15 -11 -10 -17 5
  • 27. Computer Chess IV Threads were spawned in groups of 4 in Caltech example at different depths of tree and project achieved a speed up of over a 100 and the larger # plys N gets the more parallelism there will be Increasing search depth
  • 28. Computer Chess V We have subsets of threads (4 in this example) synchronizing on node minimax value This is a global variable and there are (as in other branch and bound) very important performance gains from a shared position database This allows scores to be stored for positions and re-used In chess there are many transpositions leading to identical positions 1 e4 e5 2 Nf3 Nc6 is identical to (less usual) 1 Nf3 Nc6 2 e4 e5 There was only a few percent overhead for a distributed database on Caltech distributed memory implementation Queuing of update requests ensured no errors from multiple threads accessing same location Multicore architecture should be excellent for this and other large branch and bound and related search algorithms as support shared databases and fast thread synchronization Note that in Deep Fritz vs. Vladimir Kramnik (human world champion) in November 2006, the program ran on a personal computer containing two Intel Core 2 Duo CPUs, capable of evaluating 8 million positions per second , and searching to an average depth of 17 to 18 ply in the middlegame. Deep Fritz won 4-2
  • 29. Wikipedia SVM Example We are finding optimal hyperplane splitting two samples Samples are training set Normal w to splitting hyperplane given by w =  i =1 n y i  i x i Two samples denoted by crosses y i =1 or circles y i = -1
  • 30. Support Vector Machines SVM I These divide sets by (in simplest case) hyperplanes into two in an optimal least squares fashion Minimize f(  ) = 0.5  T G  -  i =1 n  i Subject to  i =1 n y i  i = 0 and 0 ≤  i ≤ C With G ij = y i y j K( x i , x j ) for Kernel K This is a training problem where we have a total of n data points from two populations with y i = +1 for first and = -1 for second K( x i , x j ) = x i . x j is simplest case when division is by a hyperplane in space in which x is a vector but Gaussian forms are often used K = exp(- constant  x i - x j  2 ) G is an n by n dense matrix (n is number of data points) This is a a quadratic programming QP problem
  • 31. Support Vector Machines SVM II Differentiating wrt  gives linear equations that must solved iteratively to satisfy inequality constraints The solver matrix G is both large (10 6 by 10 6 ) and can be dense and this requires large storage space which often exceeds available memory As in much quadratic programming one can use conjugate gradient solution methods as this identifies systematically the important directions in space (roughly large eigenvalues of positive definite symmetric matrix G) There are several papers on parallel SVM but I did not see substantial use of parallel implementations There were two approaches Either solve the matrix problems in parallel or Split up dataset and solve multiple subproblems
  • 32. Support Vector Machines SVM III Solve the matrix problems in parallel Interestingly one does not solve full G but iterates up from smaller (~150 by 150) problems and so data parallelism does not exploit size n Need more reliable SVM solvers for large matrices? Split up dataset and solve multiple subproblems – Scalable! Here the difficulty is that essentially you have changed algorithm and it is not clear how best to combine solution of subproblems But original SVM is full of heuristics (choice of K) so other heuristics may be allowed! Note whereas multicore appears especially attractive for search problems, it is not so clear for SVM Multicore does not address huge size of matrix G High performance matrix solvers are available for distributed memory machines I suspect there are better “approximate” SVM solvers that will do well on multicore and reduce dimension of G but this is research
  • 33. Some Parallelization Results from “Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems” This paper reviews much previous work Super linear speedup in (a) due to extra memory