SlideShare a Scribd company logo
Creating an Optimized  Algorithm in R:  Version 1 October 22, 2009
R : Background Nobody owns it , yet R related products have been created by REvolution Computing (Partnering with Microsoft/Intel) http://www.revolution-computing.com/industry/academic.php SAS (Interface to SAS/IML) http://support.sas.com/rnd/app/studio/Rinterface2.html and SPSS (Interface to SPSS including some use of Python) http://insideout.spss.com/2009/01/13/spss-statistics-and-r/ Blue Reference Inc ( Plugin for MS Office) http://inferenceforr.com/default.aspx and Information Focus ( R GUI for Data Mining) http://www.informationbuilders.com/products/webfocus/predictivemodeling.html
R Packages CRAN - 1783 Packages in R 2.11                1977 Packages in R 2.9       COST -0 $   BUT a lot of hours. Question: Number of People in the World who know all 1977 R Packages?
Some uses of R Citation:    httP://blog.revolution-computing.com library ( maps ) map ( &quot;state&quot; ,  interior  =  FALSE ) map ( &quot;state&quot; ,  boundary  =  FALSE ,  col = &quot;gray&quot; ,  add  =  TRUE )       GADM  is a spatial database of the location of the world's administrative boundaries the  spplot  function (from the  sp package ). the data for Switzerland, and then plot each canton with a color denoting its primary language: library ( sp ) con  <-  url ( &quot;http://gadm.org/data/rda/CHE_adm1.RData&quot; ) print ( load ( con )) close ( con ) language  <-  c ( &quot;german&quot; ,  &quot;german&quot; ,  &quot;german&quot; , &quot;german&quot; ,   &quot;german&quot; , &quot;german&quot; , &quot;french&quot; ,  &quot;french&quot; ,   &quot;german&quot; , &quot;german&quot; , &quot;french&quot; ,  &quot;french&quot; ,    &quot;german&quot; ,  &quot;french&quot; , &quot;german&quot; , &quot;german&quot; ,   &quot;german&quot; , &quot;german&quot; , &quot;german&quot; ,  &quot;german&quot; ,   &quot;german&quot; , &quot;italian&quot; , &quot;german&quot; , &quot;french&quot; ,   &quot;french&quot; , &quot;german&quot; , &quot;german&quot; ) gadm $ language  <-  as.factor ( language ) col  =  rainbow ( length ( levels ( gadm $ language ))) spplot ( gadm ,  &quot;language&quot; ,  col.regions =c ol , main= &quot;Swiss Language Regions&quot; )   AnthroSpace:  Download Global Administrative Areas as RData files      
Seven tips for &quot;surviving&quot; R    Keep extensive written notes Find a way to search for R answers Learn to convert complex objects to canonical forms with unclass() Learn how to find and inspect classes and methods for objects Learn how to clear pesky attributes from objects Swallow your pride  and learn and use R's many one-line idioms, rather than reinventing the wheel   John Mount from Win-Vector LLC :Citation    
Writing a Function/ Algorithm in R Simply enough, newRalgorithm(x) <- function(x) OldAlgorithm(x) Eg- do_something  <-   function ( x , y ){ # Function code goes here ...   }   # Subset my data orange_girls  <-  subset ( crabs ,  sex  ==   'F'   &  sp  ==   'O' )   # Call my function  do_something ( orange_girls $ CW , orange_girls $ C       Citation-     http://cran.r-project.org/doc/manuals/R-exts.html#Top http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/
Writing a new stats algorithm ( in R /other language) Steps  ( Basic Idea)- Journal Review of Study Area Existing Algorithm Study for GAP analysis And add creativity Test and Iterate within community Publish
Choosing Clustering as the area of interest   Clustering works with Big Data.    Can work with lots of incomplete column variables when other techniques may not be suitable.   Works when data cannot be used for regression models. Groups of clusters can be merged and combined to make new clusters so a case for parallel processing Useful for product marketing, business, medicine  and financial
K Means Clustering using R R> data(&quot;planets&quot;, package = &quot;HSAUR&quot;) R> library(&quot;scatterplot3d&quot;) R> scatterplot3d(log(planets$mass), log(planets$period), + log(planets$eccen), type = &quot;h&quot;, angle = 55, + pch = 16, y.ticklabs = seq(0, 10, by = 2), + y.margin.add = 0.1, scale.y = 0.7)    
Writing a Function/ Algorithm in R 2 Adding loops and multiple function Eg-   # Arrays of values for each type of species and sex species  <-  unique ( crabs $ sp ) sexes  <-  unique ( crabs $ sex )   # Loop through species ... for ( i in  1 :length ( species )){ # ... loop through sex ..   for ( j in  1 :length ( sexes )){ #... and finally call a function on each subset  something_else ( subset ( crabs ,  sp  ==  species [ i ]   &  sex  ==  sexes [ j ]))   } Citation-  http://cran.r-project.org/doc/manuals/R-exts.html#Top   http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/
Writing a Function/ Algorithm in R 2 Adding loops and multiple function Eg-   # Arrays of values for each type of species and sex species  <-  unique ( crabs $ sp ) sexes  <-  unique ( crabs $ sex )      # Loop through species ... for ( i in  1 :length ( species )){ # ... loop through sex ..      for ( j in  1 :length ( sexes )){      #... and finally call a function on each subset something_else ( subset ( crabs ,  sp  ==  species [ i ]   &  sex  ==  sexes [ j ]))  }         Citation-     http://cran.r-project.org/doc/manuals/R-exts.html#Top       http://www.bioinformaticszen.com/r_programmin/data_analysis_using_r_functions_as_objects/
More ways to write functions each  <-   function ( . column , . data , . lambda ){ # Find the column index from it's name column_index  <-  which(names( . data)  ==   . column) # Find the unique values in the column column_levels  <-  unique( . data[,column_index])     # Loop over these values for (i in  1 :length(column_levels)){ # Subset the data and call the passed function on it . lambda( . data[ . data[,column_index]  ==  column_levels[i],]) } }  The last argument  .lambda  is an R function, because R treats functions as objects this allows them to be passed as arguments to other functions.   # Another function as the last argument to this function each ( &quot;sp&quot; ,  crabs ,  something_else )   # Or create a new anonymous function ...     each ( &quot;sp&quot; ,  crabs ,   function ( x ){   # ... and run multiple lines of code here  something_else ( x )  with ( x , lm ( CW ~ CL ))   })
Additionally create new functions use a Plyr From  http://had.co.nz/plyr/ plyr is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. It's already possible to do this with split and the apply functions, but plyr just makes it all a bit easier with: consistent names, arguments and outputs input from and output to data.frames, matrices and lists progress bars to keep track of long running operations built-in error recovery a consistent and useful set of tools for solving the split-apply-combine problem. library ( plyr )  # Three arguments # 1. The dataframe # 2. The name of columns to subset by   # 3. The function to call on each subset  d_ply ( crabs ,   . ( sp ,  sex ),  something_else )
Quick Recap   We have an algorithm in mind or create a new alogirthm ( toughest part)  ( Eg. http://en.scientificcommons.org/42572415  Genetic K-Means (GKM) or Genetic Regularized Mahalanobis (GARM) distances to compute the initial cluster parameters, with little difference in the final results. This innovation allows our algorithm to find optimal parameter estimates of complex hyperellisoidal clusters. We develop and score the information complexity (ICOMP) criterion of Bozdogan (1994a,b, 2004) as our fitness function to choose the number of clusters present in the data sets ) We created a function in R on it.We can also use to rename Package Functions (like a SAS R Package I created) We now need to create a package so we all 2 million R users may have a chance to use it
Creating a New Package Citation- http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf     1. Load all functions and data sets you want in the package into a clean R session, and run package.skeleton(). The objects are sorted into data and functions, skeleton help files        are created for them using prompt() and a DESCRIPTION file is created. The function then prints out a list of things for you to do next. This creates the Package within the Current Working Directory > package.skeleton(name=&quot;NAME_OF_PACKAGE&quot;,code_files=&quot;FILENAME.R&quot;) Creating directories ... Creating DESCRIPTION ... Creating Read-and-delete-me ... Copying code files ... Making help files ... Done. Further steps are described in './linmod/Read-and-delete-me'. Q WHERE IS MY PACKAGE? A  getwd()
Creating a New Package Citation- http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf Q What is the best step in making a software- A Documenting HELP FINALLY  * Edit the help file skeletons in 'man', possibly combining help files    for multiple functions. * Put any C/C++/Fortran code in 'src'. * If you have compiled code, add a .First.lib() function in 'R' to load    the shared library. * Run R CMD build to build the package tarball. * Run R CMD check to check the package tarball. Read &quot;Writing R Extensions&quot; for more information.  http://cran. r -project.org/doc/manuals/ R -exts.pdf     Also see guidelines for CRAN submission
Next Steps We have New functions and a new Package We now need to optimize the R Package for Performance  Using 1) Parallel Computing 2) High Performance Computing 3) Code Optimization
Optimizing Code Citation: Dirk Eddelbuettel http://dirk .eddelbuettel.com/papers/ useR 2009 hpcTutorial .pdf R already provides the basic tools for performance analysis.      the  system.time  function for simple measurements.      the  Rprof  function for profiling R code.      the  Rprofmem  function for profiling R memory usage. In addition, the  profr and proftools package  on CRAN can be used to visualize Rprof data. We use tools for creating visual images of how the algorithm is looping- in case we dont know how the algorithm we created looks visually and to avoid multiple calls.  
Optimizing Code :Example Citation: Dirk Eddelbuettel http://dirk .eddelbuettel.com/papers/ useR 2009 hpcTutorial .pdf > sillysum <- function(N) { s <- 0;        for (i in 1:N) s <- s + i; return(s) } > system.time(print(sillysum(1e7))) [1] 5e+13    user system elapsed  13.617   0.020 13.701> > system.time(print(sum(as.numeric(seq(1,1e7))))) [1] 5e+13    user system elapsed   0.224   0.092   0.315> Replacing the loop yielded a gain of a factor of more than 40.
Running R Parallel We need a cluster ( like Newton with 1500 processors  run on 2 nd floor SMC ) Several R packages to execute code in parallel:      NWS      Rmpi      snow (using MPI, PVM, NWS or sockets)      papply      taskPR      multicore
Running R Parallel We need a HPC cluster and also Que time in terms of how long we can run our query on the shared resource. Using SNOW A simple example: cl <- makeCluster(4, &quot;MPI&quot;) print(clusterCall(cl, function() \            Sys.info()[c(&quot;nodename&quot;,&quot;machine&quot;)])) stopCluster(cl) and  params  <- c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;D&quot;, &quot;E&quot;, &quot;F&quot;, &quot;G&quot;, &quot;H&quot;) cl  <- makeCluster( 8 , &quot;MPI&quot;) res <- parSapply( cl ,  params ,                          FUN= function(x) myNEWFunction(x)) will ’unroll’ the parameters params one-each over the function argument given, utilising the cluster cl. In other words, we will be running  eight  copies of myNEWFunction() at once.
Current Status We are writing the algorithm we have selected for optimized use on Newton We will create a Package and release it with a paper once project is over

More Related Content

PPT
r,rstats,r language,r packages
PPT
SPARQL Tutorial
PPT
SPARQL in a nutshell
PPTX
Semantic web meetup – sparql tutorial
PDF
Autumn collection JavaOne 2014
PPT
Introduction to Python - Part Two
PDF
Linked to ArrayList: the full story
PPT
Lecture 4 - Comm Lab: Web @ ITP
r,rstats,r language,r packages
SPARQL Tutorial
SPARQL in a nutshell
Semantic web meetup – sparql tutorial
Autumn collection JavaOne 2014
Introduction to Python - Part Two
Linked to ArrayList: the full story
Lecture 4 - Comm Lab: Web @ ITP

What's hot (20)

PDF
Free your lambdas
PDF
Java 8 Streams and Rx Java Comparison
PDF
Unsupervised Machine Learning for clone detection
PPT
Introduction to Python - Part Three
PDF
Clone detection in Python
PDF
Introduction to source{d} Engine and source{d} Lookout
PPT
python.ppt
PDF
Python revision tour i
PPTX
Python Homework Help
DOC
1183 c-interview-questions-and-answers
PPTX
Clonedigger-Python
PPTX
Pig: Data Analysis Tool in Cloud
PDF
TDC2016POA | Trilha Programacao Funcional - Ramda JS como alternativa a under...
ODP
Introducing Modern Perl
PPT
Perl 101 - The Basics of Perl Programming
PDF
Real World Haskell: Lecture 7
KEY
Let's build a parser!
PDF
DEFUN 2008 - Real World Haskell
PPT
Programming in Computational Biology
PDF
Let’s Learn Python An introduction to Python
Free your lambdas
Java 8 Streams and Rx Java Comparison
Unsupervised Machine Learning for clone detection
Introduction to Python - Part Three
Clone detection in Python
Introduction to source{d} Engine and source{d} Lookout
python.ppt
Python revision tour i
Python Homework Help
1183 c-interview-questions-and-answers
Clonedigger-Python
Pig: Data Analysis Tool in Cloud
TDC2016POA | Trilha Programacao Funcional - Ramda JS como alternativa a under...
Introducing Modern Perl
Perl 101 - The Basics of Perl Programming
Real World Haskell: Lecture 7
Let's build a parser!
DEFUN 2008 - Real World Haskell
Programming in Computational Biology
Let’s Learn Python An introduction to Python
Ad

Viewers also liked (20)

PPT
Offshoring 101 For Statisticians Techies And A
PDF
R stata
PPT
About us ppt
PDF
Modeling science
PDF
1 basics
PPTX
Analytics what to look for sustaining your growing business-
PPTX
C1 t1,t2,t3,t4 complete
PDF
Summer School with DecisionStats brochure
PPTX
Bd class 2 complete
PDF
Introduction to sas
PPT
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
PPTX
Analyze this
PPTX
R basics
PDF
Big data Big Analytics
PPTX
Big data gaurav
PPTX
Introduction to sas in spanish
PDF
Open source analytics
PDF
March meet up new delhi users- Two R GUIs Rattle and Deducer
PDF
Data analytics using the cloud challenges and opportunities for india
PPT
About us ver2 ppt
Offshoring 101 For Statisticians Techies And A
R stata
About us ppt
Modeling science
1 basics
Analytics what to look for sustaining your growing business-
C1 t1,t2,t3,t4 complete
Summer School with DecisionStats brochure
Bd class 2 complete
Introduction to sas
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Analyze this
R basics
Big data Big Analytics
Big data gaurav
Introduction to sas in spanish
Open source analytics
March meet up new delhi users- Two R GUIs Rattle and Deducer
Data analytics using the cloud challenges and opportunities for india
About us ver2 ppt
Ad

Similar to Easy R (20)

PPT
Internet Technology and its Applications
PDF
Mapreduce Algorithms
PDF
R Traning-Session-I 21-23 May 2025 Updated Alpha.pdf
PPT
Bioinformatica 10-11-2011-p6-bioperl
PDF
Reproducibility with R
PDF
Language-agnostic data analysis workflows and reproducible research
PPTX
RPreliminariesdsjhfsdsfhjshfjsdhjfhjfhdfjhf
PPTX
RPreliminariesdsjhfsdsfhjshfjsdhjfhjfhdfjhf
PPTX
The GO Language : From Beginners to Gophers
ODP
PPTX
R Introduction
DOCX
Article link httpiveybusinessjournal.compublicationmanaging-.docx
PPTX
Presentation on use of r statistics
PPT
Plunging Into Perl While Avoiding the Deep End (mostly)
PPT
Php Reusing Code And Writing Functions
PPTX
AWS Hadoop and PIG and overview
PPT
course slides -- powerpoint
ODP
Programming Under Linux In Python
PDF
Writing a REST Interconnection Library in Swift
PPTX
Workshop presentation hands on r programming
Internet Technology and its Applications
Mapreduce Algorithms
R Traning-Session-I 21-23 May 2025 Updated Alpha.pdf
Bioinformatica 10-11-2011-p6-bioperl
Reproducibility with R
Language-agnostic data analysis workflows and reproducible research
RPreliminariesdsjhfsdsfhjshfjsdhjfhjfhdfjhf
RPreliminariesdsjhfsdsfhjshfjsdhjfhjfhdfjhf
The GO Language : From Beginners to Gophers
R Introduction
Article link httpiveybusinessjournal.compublicationmanaging-.docx
Presentation on use of r statistics
Plunging Into Perl While Avoiding the Deep End (mostly)
Php Reusing Code And Writing Functions
AWS Hadoop and PIG and overview
course slides -- powerpoint
Programming Under Linux In Python
Writing a REST Interconnection Library in Swift
Workshop presentation hands on r programming

More from Ajay Ohri (20)

PDF
Introduction to R ajay Ohri
PPTX
Introduction to R
PDF
Social Media and Fake News in the 2016 Election
PDF
Pyspark
PDF
Download Python for R Users pdf for free
PDF
Install spark on_windows10
DOCX
Ajay ohri Resume
PDF
Statistics for data scientists
PPTX
National seminar on emergence of internet of things (io t) trends and challe...
PDF
Tools and techniques for data science
PPTX
How Big Data ,Cloud Computing ,Data Science can help business
PDF
Training in Analytics and Data Science
PDF
Tradecraft
PDF
Software Testing for Data Scientists
PDF
Craps
PDF
A Data Science Tutorial in Python
PDF
How does cryptography work? by Jeroen Ooms
PDF
Using R for Social Media and Sports Analytics
PDF
Kush stats alpha
PPTX
Summer school python in spanish
Introduction to R ajay Ohri
Introduction to R
Social Media and Fake News in the 2016 Election
Pyspark
Download Python for R Users pdf for free
Install spark on_windows10
Ajay ohri Resume
Statistics for data scientists
National seminar on emergence of internet of things (io t) trends and challe...
Tools and techniques for data science
How Big Data ,Cloud Computing ,Data Science can help business
Training in Analytics and Data Science
Tradecraft
Software Testing for Data Scientists
Craps
A Data Science Tutorial in Python
How does cryptography work? by Jeroen Ooms
Using R for Social Media and Sports Analytics
Kush stats alpha
Summer school python in spanish

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Machine Learning_overview_presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
August Patch Tuesday
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Approach and Philosophy of On baking technology
PPTX
Tartificialntelligence_presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation theory and applications.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine learning based COVID-19 study performance prediction
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
SOPHOS-XG Firewall Administrator PPT.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
1. Introduction to Computer Programming.pptx
Machine Learning_overview_presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
August Patch Tuesday
Spectral efficient network and resource selection model in 5G networks
Heart disease approach using modified random forest and particle swarm optimi...
Approach and Philosophy of On baking technology
Tartificialntelligence_presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation theory and applications.pdf

Easy R

  • 1. Creating an Optimized  Algorithm in R:  Version 1 October 22, 2009
  • 2. R : Background Nobody owns it , yet R related products have been created by REvolution Computing (Partnering with Microsoft/Intel) http://www.revolution-computing.com/industry/academic.php SAS (Interface to SAS/IML) http://support.sas.com/rnd/app/studio/Rinterface2.html and SPSS (Interface to SPSS including some use of Python) http://insideout.spss.com/2009/01/13/spss-statistics-and-r/ Blue Reference Inc ( Plugin for MS Office) http://inferenceforr.com/default.aspx and Information Focus ( R GUI for Data Mining) http://www.informationbuilders.com/products/webfocus/predictivemodeling.html
  • 3. R Packages CRAN - 1783 Packages in R 2.11                1977 Packages in R 2.9       COST -0 $   BUT a lot of hours. Question: Number of People in the World who know all 1977 R Packages?
  • 4. Some uses of R Citation:   httP://blog.revolution-computing.com library ( maps ) map ( &quot;state&quot; , interior = FALSE ) map ( &quot;state&quot; , boundary = FALSE , col = &quot;gray&quot; , add = TRUE )       GADM is a spatial database of the location of the world's administrative boundaries the spplot function (from the sp package ). the data for Switzerland, and then plot each canton with a color denoting its primary language: library ( sp ) con <- url ( &quot;http://gadm.org/data/rda/CHE_adm1.RData&quot; ) print ( load ( con )) close ( con ) language <- c ( &quot;german&quot; , &quot;german&quot; , &quot;german&quot; , &quot;german&quot; ,   &quot;german&quot; , &quot;german&quot; , &quot;french&quot; ,  &quot;french&quot; ,   &quot;german&quot; , &quot;german&quot; , &quot;french&quot; ,  &quot;french&quot; ,    &quot;german&quot; ,  &quot;french&quot; , &quot;german&quot; , &quot;german&quot; ,   &quot;german&quot; , &quot;german&quot; , &quot;german&quot; ,  &quot;german&quot; ,   &quot;german&quot; , &quot;italian&quot; , &quot;german&quot; , &quot;french&quot; ,   &quot;french&quot; , &quot;german&quot; , &quot;german&quot; ) gadm $ language <- as.factor ( language ) col = rainbow ( length ( levels ( gadm $ language ))) spplot ( gadm , &quot;language&quot; , col.regions =c ol , main= &quot;Swiss Language Regions&quot; )   AnthroSpace:  Download Global Administrative Areas as RData files    
  • 5. Seven tips for &quot;surviving&quot; R   Keep extensive written notes Find a way to search for R answers Learn to convert complex objects to canonical forms with unclass() Learn how to find and inspect classes and methods for objects Learn how to clear pesky attributes from objects Swallow your pride  and learn and use R's many one-line idioms, rather than reinventing the wheel   John Mount from Win-Vector LLC :Citation    
  • 6. Writing a Function/ Algorithm in R Simply enough, newRalgorithm(x) <- function(x) OldAlgorithm(x) Eg- do_something <- function ( x , y ){ # Function code goes here ... }   # Subset my data orange_girls <- subset ( crabs , sex == 'F' & sp == 'O' )   # Call my function do_something ( orange_girls $ CW , orange_girls $ C       Citation-     http://cran.r-project.org/doc/manuals/R-exts.html#Top http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/
  • 7. Writing a new stats algorithm ( in R /other language) Steps ( Basic Idea)- Journal Review of Study Area Existing Algorithm Study for GAP analysis And add creativity Test and Iterate within community Publish
  • 8. Choosing Clustering as the area of interest   Clustering works with Big Data.    Can work with lots of incomplete column variables when other techniques may not be suitable.   Works when data cannot be used for regression models. Groups of clusters can be merged and combined to make new clusters so a case for parallel processing Useful for product marketing, business, medicine  and financial
  • 9. K Means Clustering using R R> data(&quot;planets&quot;, package = &quot;HSAUR&quot;) R> library(&quot;scatterplot3d&quot;) R> scatterplot3d(log(planets$mass), log(planets$period), + log(planets$eccen), type = &quot;h&quot;, angle = 55, + pch = 16, y.ticklabs = seq(0, 10, by = 2), + y.margin.add = 0.1, scale.y = 0.7)    
  • 10. Writing a Function/ Algorithm in R 2 Adding loops and multiple function Eg- # Arrays of values for each type of species and sex species <- unique ( crabs $ sp ) sexes <- unique ( crabs $ sex )   # Loop through species ... for ( i in 1 :length ( species )){ # ... loop through sex .. for ( j in 1 :length ( sexes )){ #... and finally call a function on each subset something_else ( subset ( crabs , sp == species [ i ] & sex == sexes [ j ])) } Citation-  http://cran.r-project.org/doc/manuals/R-exts.html#Top   http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/
  • 11. Writing a Function/ Algorithm in R 2 Adding loops and multiple function Eg- # Arrays of values for each type of species and sex species <- unique ( crabs $ sp ) sexes <- unique ( crabs $ sex )     # Loop through species ... for ( i in 1 :length ( species )){ # ... loop through sex ..     for ( j in 1 :length ( sexes )){     #... and finally call a function on each subset something_else ( subset ( crabs , sp == species [ i ] &  sex == sexes [ j ])) }         Citation-     http://cran.r-project.org/doc/manuals/R-exts.html#Top       http://www.bioinformaticszen.com/r_programmin/data_analysis_using_r_functions_as_objects/
  • 12. More ways to write functions each <- function ( . column , . data , . lambda ){ # Find the column index from it's name column_index <- which(names( . data) == . column) # Find the unique values in the column column_levels <- unique( . data[,column_index])    # Loop over these values for (i in 1 :length(column_levels)){ # Subset the data and call the passed function on it . lambda( . data[ . data[,column_index] == column_levels[i],]) } } The last argument  .lambda  is an R function, because R treats functions as objects this allows them to be passed as arguments to other functions. # Another function as the last argument to this function each ( &quot;sp&quot; , crabs , something_else ) # Or create a new anonymous function ...     each ( &quot;sp&quot; , crabs , function ( x ){ # ... and run multiple lines of code here something_else ( x ) with ( x , lm ( CW ~ CL )) })
  • 13. Additionally create new functions use a Plyr From  http://had.co.nz/plyr/ plyr is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. It's already possible to do this with split and the apply functions, but plyr just makes it all a bit easier with: consistent names, arguments and outputs input from and output to data.frames, matrices and lists progress bars to keep track of long running operations built-in error recovery a consistent and useful set of tools for solving the split-apply-combine problem. library ( plyr ) # Three arguments # 1. The dataframe # 2. The name of columns to subset by # 3. The function to call on each subset d_ply ( crabs , . ( sp , sex ), something_else )
  • 14. Quick Recap   We have an algorithm in mind or create a new alogirthm ( toughest part)  ( Eg. http://en.scientificcommons.org/42572415  Genetic K-Means (GKM) or Genetic Regularized Mahalanobis (GARM) distances to compute the initial cluster parameters, with little difference in the final results. This innovation allows our algorithm to find optimal parameter estimates of complex hyperellisoidal clusters. We develop and score the information complexity (ICOMP) criterion of Bozdogan (1994a,b, 2004) as our fitness function to choose the number of clusters present in the data sets ) We created a function in R on it.We can also use to rename Package Functions (like a SAS R Package I created) We now need to create a package so we all 2 million R users may have a chance to use it
  • 15. Creating a New Package Citation- http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf    1. Load all functions and data sets you want in the package into a clean R session, and run package.skeleton(). The objects are sorted into data and functions, skeleton help files        are created for them using prompt() and a DESCRIPTION file is created. The function then prints out a list of things for you to do next. This creates the Package within the Current Working Directory > package.skeleton(name=&quot;NAME_OF_PACKAGE&quot;,code_files=&quot;FILENAME.R&quot;) Creating directories ... Creating DESCRIPTION ... Creating Read-and-delete-me ... Copying code files ... Making help files ... Done. Further steps are described in './linmod/Read-and-delete-me'. Q WHERE IS MY PACKAGE? A  getwd()
  • 16. Creating a New Package Citation- http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf Q What is the best step in making a software- A Documenting HELP FINALLY  * Edit the help file skeletons in 'man', possibly combining help files    for multiple functions. * Put any C/C++/Fortran code in 'src'. * If you have compiled code, add a .First.lib() function in 'R' to load    the shared library. * Run R CMD build to build the package tarball. * Run R CMD check to check the package tarball. Read &quot;Writing R Extensions&quot; for more information.  http://cran. r -project.org/doc/manuals/ R -exts.pdf     Also see guidelines for CRAN submission
  • 17. Next Steps We have New functions and a new Package We now need to optimize the R Package for Performance  Using 1) Parallel Computing 2) High Performance Computing 3) Code Optimization
  • 18. Optimizing Code Citation: Dirk Eddelbuettel http://dirk .eddelbuettel.com/papers/ useR 2009 hpcTutorial .pdf R already provides the basic tools for performance analysis.      the system.time function for simple measurements.      the Rprof function for profiling R code.      the Rprofmem function for profiling R memory usage. In addition, the profr and proftools package on CRAN can be used to visualize Rprof data. We use tools for creating visual images of how the algorithm is looping- in case we dont know how the algorithm we created looks visually and to avoid multiple calls.  
  • 19. Optimizing Code :Example Citation: Dirk Eddelbuettel http://dirk .eddelbuettel.com/papers/ useR 2009 hpcTutorial .pdf > sillysum <- function(N) { s <- 0;        for (i in 1:N) s <- s + i; return(s) } > system.time(print(sillysum(1e7))) [1] 5e+13    user system elapsed  13.617   0.020 13.701> > system.time(print(sum(as.numeric(seq(1,1e7))))) [1] 5e+13    user system elapsed   0.224   0.092   0.315> Replacing the loop yielded a gain of a factor of more than 40.
  • 20. Running R Parallel We need a cluster ( like Newton with 1500 processors  run on 2 nd floor SMC ) Several R packages to execute code in parallel:      NWS      Rmpi      snow (using MPI, PVM, NWS or sockets)      papply      taskPR      multicore
  • 21. Running R Parallel We need a HPC cluster and also Que time in terms of how long we can run our query on the shared resource. Using SNOW A simple example: cl <- makeCluster(4, &quot;MPI&quot;) print(clusterCall(cl, function() \            Sys.info()[c(&quot;nodename&quot;,&quot;machine&quot;)])) stopCluster(cl) and  params <- c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;D&quot;, &quot;E&quot;, &quot;F&quot;, &quot;G&quot;, &quot;H&quot;) cl <- makeCluster( 8 , &quot;MPI&quot;) res <- parSapply( cl , params ,                         FUN= function(x) myNEWFunction(x)) will ’unroll’ the parameters params one-each over the function argument given, utilising the cluster cl. In other words, we will be running eight copies of myNEWFunction() at once.
  • 22. Current Status We are writing the algorithm we have selected for optimized use on Newton We will create a Package and release it with a paper once project is over