SlideShare a Scribd company logo
Data Analysis by Checking, Clustering and Componentizing in IMPL 
(IMPL-DataAnalysis) 
i n d u s t r IAL g o r i t h m s LLC. (IAL) 
www.industrialgorithms.com 
September 2014 
Introduction 
Presented in this short document is a description of our three separate techniques to analyze the data by checking, clustering and componentizing it before it is used by other IMPL’s routines especially in on-line/real-time decision-making applications. We also have other data consistency or analysis techniques which have been described in other IMPL documents and these relate to the application of data reconciliation and regression with diagnostics but require an explicit model (model-based) whereas the techniques below do not i.e., they are data-based techniques. 
Data Checking 
IMPL’s two separate data checking routines are well-known to the process industries especially when PLC and DCS applications are implemented. The first data checking routine has two parts: (1) check the “range” (or domain) of the data against its expected lower and upper absolute bounds similar to bounds-checking in compilers, and (2) check the “rate-of-change” of the data against its lower and upper relative bounds i.e., the lower bound represents the minimum expected ROC from data sample to sample and the upper bound represents the maximum expected ROC. 
The second data checking routine is more sophisticated and is more relevant to continuous- processes. This is the technique of steady-state detection (SSD) and a new and accurate method can be found in Kelly and Hedengren (2013). It requires several key process or operating variables such as flows, holdups, temperatures, pressures, analyzers, etc. to be checked to see if they are statistically stationary or steady typically over a one-hour time- horizon. If a majority of the key variables are at steady-state then we can declare the process also to be at steady-state and then steady-state empirical and/or engineering models can be used to monitor and optimize the process. 
Data Clustering 
IMPL’s data clustering routine implements the Fuzzy C-Mean Clustering (FCMC) algorithm (Bezdek, et. al., 1984 and Bezdek et. al., 1999) and is a nonlinear iterative algorithm which usually requires multiple randomized re-starts to find the most accurate c-mean or k-mean clusters. Once the data has passed the above data checks, then it is possible to cluster the operating or process data (also using a set of key process variables) into several regions, groups or partitions which usually correspond to various and distinct operating modes or process operations such as minimum throughput, maximum yield, high conversion, low grade, etc. where the number of clusters can be estimated using the gap statistic from Tibshirani et. al. (2001) or are usually known a priori based on production/operating orders and logs where these modes/operations are planned/scheduled weeks to days in advance.
The FCMC algorithm is useful to assign real number probabilities (weights or memberships) to the expectation that the process is in a particular operating/production mode, region or regime. This information is useful given that multiple local and perhaps linear and simpler nonlinear models can be employed to monitor and optimize the process accurately similar to the approach found in Aumi and Mhaskar (2011) and Aumi et. al. (2011) to control a nonlinear batch process using multiple linear auto-regressive exogenous (ARX) dynamic models. Their approach is to use the clustering routine to determine which ARX model to use for the control prediction and manipulation, given the current state of the process, where transitions from one cluster, mode or region to the next result in probabilities lying between 0.0 and 1.0. 
When the clusters have been determined from the training, calibration or development data in terms of their cluster targets or mean-centers/centroids then the same routine can be used with testing, control or deployment data by fixing the targets and computing the weights or membership probabilities only in one (1) iteration. These weights can then be used to weight or proportion the predictions from the multiple localized models. With regard to dimensionality, if the number of key process variables used in the clustering is large then it is possible and practical to use the data componentizing routine described below to cluster only the larger principal components (Aumi and Mhaskar, 2011). 
Data Componentizing 
IMPL’s data componentizing routine implements the very well-known Principal Components Analysis and Regression (PCA/PCR) where the X-block of explanatory or regressor variables are componentized into one or more factors, latent variables or principal components called scores which are orthogonal, perpendicular or completely independent to each other. These scores are then used to regress one or more Y-block responses. The loadings that relate the scores to the X-block are computed in the PCA prior to computing the regression parameters relating the scores to the Y-block which can be argued is inferior to the related technique of Partial Least Squares or Projection to Latent Structures (PLS). This inferiority of PCR compared to PLS is primarily attributed to the fact that PCR requires more components than PLS for the same or similar R2 or percent (%) Y explained and is not as parsimonious as PLS. 
To address this issue, a unique and unpublished technique only found in IMPL is our Principal Component Regression Optimization (PCRO). This algorithm simultaneously computes the scores and regression coefficients together by minimizing a weighted sum of squares of residuals for both the X- and Y-blocks together with regularization similar to the Levenberg- Marquardt (trust-region) algorithm in nonlinear parameter estimation. PCR and PLS sequentially computes the latent variables or scores one at a time typically using NIPALS whereas PCRO, as mentioned, computes the loadings and regression parameters together into the same nonlinear optimization problem solved using an Equality-Constrained Successive Quadratic Programming (SQP) algorithm. The interesting feature of PCRO is that for the same or similar R2 or % Y explained, it requires less components than PLS. 
References 
Bezdek, J.C., Ehrlich, R., Full, W., “FCM: the fuzzy c-means clustering algorithm”, Computers & Geosciences, 10, 191, (1984). 
Bezdek, J.C., Keller, J., Krishnapuram, R., Pal, N.R., “Fuzzy models and algorithms for pattern recognition and image processing”, Kluwer Academic Publishers, TA1650.F89, (1999).
Tibshirani, R., Walther, G., Hastie, T., “Estimating the number of clusters in a data set via the gap statistic”, J.R. Statist. Soc. B., 63, 411-423, (2001). 
Aumi, S., Mhaskar, P., “Integrating data-based modeling and nonlinear control tools for batch process control”, American Control Conference, San Francisco, June, (2011). 
Aumi, S., Corbett, C., Mhaskar, P., “Data-based modeling and control of Nylon 6,6 batch polymerization ”, American Control Conference, San Francisco, June, (2011). 
Kelly, J.D., Hedengren, J.D., "A steady-state detection (SDD) algorithm to detect non-stationary drifts in processes", Journal of Process Control, 23, 326, (2013).

More Related Content

PDF
Neural Network-Based Actuator Fault Diagnosis for a Non-Linear Multi-Tank System
PDF
Comparative Study on the Performance of A Coherency-based Simple Dynamic Equi...
PDF
Sca a sine cosine algorithm for solving optimization problems
PDF
Feature selection using modified particle swarm optimisation for face recogni...
PDF
C054
PDF
BINARY SINE COSINE ALGORITHMS FOR FEATURE SELECTION FROM MEDICAL DATA
PDF
SYSTEM IDENTIFICATION AND MODELING FOR INTERACTING AND NON-INTERACTING TANK S...
PDF
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
Neural Network-Based Actuator Fault Diagnosis for a Non-Linear Multi-Tank System
Comparative Study on the Performance of A Coherency-based Simple Dynamic Equi...
Sca a sine cosine algorithm for solving optimization problems
Feature selection using modified particle swarm optimisation for face recogni...
C054
BINARY SINE COSINE ALGORITHMS FOR FEATURE SELECTION FROM MEDICAL DATA
SYSTEM IDENTIFICATION AND MODELING FOR INTERACTING AND NON-INTERACTING TANK S...
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data

What's hot (16)

PDF
DESIGN OF OBSERVER BASED QUASI DECENTRALIZED FUZZY LOAD FREQUENCY CONTROLLER ...
PDF
Advanced Process Monitoring IMF
PDF
Partial stabilization based guidance
PPT
Test Optimization With Design of Experiment
PDF
Ijmet 10 01_141
PDF
New feature selection based on kernel
PDF
A hybrid fuzzy ann approach for software effort estimation
PDF
Android a fast clustering-based feature subset selection algorithm for high-...
PDF
Control chart pattern recognition using k mica clustering and neural networks
PDF
SigOpt_Bayesian_Optimization_Primer
PPTX
A Comparative study of locality Preserving Projection & Principle Component A...
PDF
mlsys_portrait
PDF
Threshold benchmarking for feature ranking techniques
PDF
Implementing an ATL Model Checker tool using Relational Algebra concepts
PDF
Survey on Feature Selection and Dimensionality Reduction Techniques
PDF
IRJET- Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...
DESIGN OF OBSERVER BASED QUASI DECENTRALIZED FUZZY LOAD FREQUENCY CONTROLLER ...
Advanced Process Monitoring IMF
Partial stabilization based guidance
Test Optimization With Design of Experiment
Ijmet 10 01_141
New feature selection based on kernel
A hybrid fuzzy ann approach for software effort estimation
Android a fast clustering-based feature subset selection algorithm for high-...
Control chart pattern recognition using k mica clustering and neural networks
SigOpt_Bayesian_Optimization_Primer
A Comparative study of locality Preserving Projection & Principle Component A...
mlsys_portrait
Threshold benchmarking for feature ranking techniques
Implementing an ATL Model Checker tool using Relational Algebra concepts
Survey on Feature Selection and Dimensionality Reduction Techniques
IRJET- Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...
Ad

Viewers also liked (20)

PDF
Arkansas Attorney General Opinion 2008 147
PDF
The capitalism distribution-12.12.12
PDF
Deep Dive - BDD with RubyMotion
PDF
6 alimentacion e cociña basica
PPT
01.7 ea water_wash_presentation_final[한글번역]
PPTX
James e cook jr!
PPTX
Latihan bab6 cikgugeog
PPTX
Презентація команди Васильківської ЗОШ І-ІІІ ступенів №1
PDF
Capital Investment Industrial Modeling Framework - IMPRESS
PPTX
학원다나와
PPTX
Digital lit daily hw and agenda
PPTX
Proposal
PDF
Park Hill Brand Report
PPT
Moo cs digitalisation_book-mooc_cmg
PDF
植物生理学第8回
PDF
[DDBJing29]NBDC ヒトデータベースを介した Japanese Genotype-phenotype Archive のデータ共有の審査過程と...
DOC
Peranan audit kinerja dk
PPT
As mestras e a literatura
PDF
Дмитрий Игнатов для ФИSNA
Arkansas Attorney General Opinion 2008 147
The capitalism distribution-12.12.12
Deep Dive - BDD with RubyMotion
6 alimentacion e cociña basica
01.7 ea water_wash_presentation_final[한글번역]
James e cook jr!
Latihan bab6 cikgugeog
Презентація команди Васильківської ЗОШ І-ІІІ ступенів №1
Capital Investment Industrial Modeling Framework - IMPRESS
학원다나와
Digital lit daily hw and agenda
Proposal
Park Hill Brand Report
Moo cs digitalisation_book-mooc_cmg
植物生理学第8回
[DDBJing29]NBDC ヒトデータベースを介した Japanese Genotype-phenotype Archive のデータ共有の審査過程と...
Peranan audit kinerja dk
As mestras e a literatura
Дмитрий Игнатов для ФИSNA
Ad

Similar to IMPL Data Analysis (20)

DOCX
Advanced Production Control Using Julia & IMPL
PDF
Advanced property tracking Industrial Modeling Framework
PDF
Performance Comparision of Machine Learning Algorithms
PPT
Neural networks for the prediction and forecasting of water resources variables
PDF
Testing the performance of the power law process model considering the use of...
PDF
Stochastic behavior analysis of complex repairable industrial systems
PDF
Abnormal Patterns Detection In Control Charts Using Classification Techniques
PDF
Artificial Neural Network and Multi-Response Optimization in Reliability Meas...
PDF
Verification of confliction and unreachability in rule based expert systems w...
PDF
Data mining projects topics for java and dot net
PDF
Advanced Production Accounting of an Olefins Plant Industrial Modeling Framew...
PPTX
Application of Principal Components Analysis in Quality Control Problem
DOCX
Advanced Parameter Estimation (APE) for Motor Gasoline Blending (MGB) Indust...
PDF
Fuzzy model reference learning control (1)
DOCX
Industrial Modeling Service (IMS-IMPL)
PDF
APL Programs For Interactive Data Analysis Basic Statistics And Histograms
PDF
PERFORMANCE ASSESSMENT OF ANFIS APPLIED TO FAULT DIAGNOSIS OF POWER TRANSFORMER
PPTX
ADMET.pptx
PDF
CONTINUOUSLY IMPROVE THE PERFORMANCE OF PLANNING AND SCHEDULING MODELS WITH P...
DOCX
Missing-Value Handling in Dynamic Model Estimation using IMPL
Advanced Production Control Using Julia & IMPL
Advanced property tracking Industrial Modeling Framework
Performance Comparision of Machine Learning Algorithms
Neural networks for the prediction and forecasting of water resources variables
Testing the performance of the power law process model considering the use of...
Stochastic behavior analysis of complex repairable industrial systems
Abnormal Patterns Detection In Control Charts Using Classification Techniques
Artificial Neural Network and Multi-Response Optimization in Reliability Meas...
Verification of confliction and unreachability in rule based expert systems w...
Data mining projects topics for java and dot net
Advanced Production Accounting of an Olefins Plant Industrial Modeling Framew...
Application of Principal Components Analysis in Quality Control Problem
Advanced Parameter Estimation (APE) for Motor Gasoline Blending (MGB) Indust...
Fuzzy model reference learning control (1)
Industrial Modeling Service (IMS-IMPL)
APL Programs For Interactive Data Analysis Basic Statistics And Histograms
PERFORMANCE ASSESSMENT OF ANFIS APPLIED TO FAULT DIAGNOSIS OF POWER TRANSFORMER
ADMET.pptx
CONTINUOUSLY IMPROVE THE PERFORMANCE OF PLANNING AND SCHEDULING MODELS WITH P...
Missing-Value Handling in Dynamic Model Estimation using IMPL

More from Alkis Vazacopoulos (20)

PPT
Automatic Fine-tuning Xpress-MP to Solve MIP
PPT
Data mining 2004
PPTX
Amazing results with ODH|CPLEX
PPTX
Bia project poster fantasy football
PPT
NFL Game schedule optimization
PDF
2017 Business Intelligence & Analytics Corporate Event Stevens Institute of T...
PDF
Posters 2017
PDF
Very largeoptimizationparallel
PDF
Retail Pricing Optimization
PDF
Optimization Direct: Introduction and recent case studies
PDF
Informs 2016 Solving Planning and Scheduling Problems with CPLEX
PDF
ODHeuristics
DOCX
Finite Impulse Response Estimation of Gas Furnace Data in IMPL Industrial Mod...
DOCX
Dither Signal Design Problem (DSDP) for Closed-Loop Estimation Industrial Mod...
DOCX
PPTX
Distillation Curve Optimization Using Monotonic Interpolation
DOCX
Multi-Utility Scheduling Optimization (MUSO) Industrial Modeling Framework (M...
PDF
Hybrid Dynamic Simulation (HDS) Industrial Modeling Framework (HDS-IMF)
DOCX
Partial Differential Equations (PDE’s) Industrial Modeling Framework (PDE-IMF)
PDF
Benefits of using IMPL
Automatic Fine-tuning Xpress-MP to Solve MIP
Data mining 2004
Amazing results with ODH|CPLEX
Bia project poster fantasy football
NFL Game schedule optimization
2017 Business Intelligence & Analytics Corporate Event Stevens Institute of T...
Posters 2017
Very largeoptimizationparallel
Retail Pricing Optimization
Optimization Direct: Introduction and recent case studies
Informs 2016 Solving Planning and Scheduling Problems with CPLEX
ODHeuristics
Finite Impulse Response Estimation of Gas Furnace Data in IMPL Industrial Mod...
Dither Signal Design Problem (DSDP) for Closed-Loop Estimation Industrial Mod...
Distillation Curve Optimization Using Monotonic Interpolation
Multi-Utility Scheduling Optimization (MUSO) Industrial Modeling Framework (M...
Hybrid Dynamic Simulation (HDS) Industrial Modeling Framework (HDS-IMF)
Partial Differential Equations (PDE’s) Industrial Modeling Framework (PDE-IMF)
Benefits of using IMPL

Recently uploaded (20)

PDF
Introduction to Data Science and Data Analysis
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Introduction to the R Programming Language
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Quality review (1)_presentation of this 21
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Predictive modeling basics in data cleaning process
PDF
annual-report-2024-2025 original latest.
Introduction to Data Science and Data Analysis
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Supervised vs unsupervised machine learning algorithms
[EN] Industrial Machine Downtime Prediction
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction-to-Cloud-ComputingFinal.pptx
SAP 2 completion done . PRESENTATION.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to the R Programming Language
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Quality review (1)_presentation of this 21
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Leprosy and NLEP programme community medicine
STERILIZATION AND DISINFECTION-1.ppthhhbx
oil_refinery_comprehensive_20250804084928 (1).pptx
Database Infoormation System (DBIS).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Predictive modeling basics in data cleaning process
annual-report-2024-2025 original latest.

IMPL Data Analysis

  • 1. Data Analysis by Checking, Clustering and Componentizing in IMPL (IMPL-DataAnalysis) i n d u s t r IAL g o r i t h m s LLC. (IAL) www.industrialgorithms.com September 2014 Introduction Presented in this short document is a description of our three separate techniques to analyze the data by checking, clustering and componentizing it before it is used by other IMPL’s routines especially in on-line/real-time decision-making applications. We also have other data consistency or analysis techniques which have been described in other IMPL documents and these relate to the application of data reconciliation and regression with diagnostics but require an explicit model (model-based) whereas the techniques below do not i.e., they are data-based techniques. Data Checking IMPL’s two separate data checking routines are well-known to the process industries especially when PLC and DCS applications are implemented. The first data checking routine has two parts: (1) check the “range” (or domain) of the data against its expected lower and upper absolute bounds similar to bounds-checking in compilers, and (2) check the “rate-of-change” of the data against its lower and upper relative bounds i.e., the lower bound represents the minimum expected ROC from data sample to sample and the upper bound represents the maximum expected ROC. The second data checking routine is more sophisticated and is more relevant to continuous- processes. This is the technique of steady-state detection (SSD) and a new and accurate method can be found in Kelly and Hedengren (2013). It requires several key process or operating variables such as flows, holdups, temperatures, pressures, analyzers, etc. to be checked to see if they are statistically stationary or steady typically over a one-hour time- horizon. If a majority of the key variables are at steady-state then we can declare the process also to be at steady-state and then steady-state empirical and/or engineering models can be used to monitor and optimize the process. Data Clustering IMPL’s data clustering routine implements the Fuzzy C-Mean Clustering (FCMC) algorithm (Bezdek, et. al., 1984 and Bezdek et. al., 1999) and is a nonlinear iterative algorithm which usually requires multiple randomized re-starts to find the most accurate c-mean or k-mean clusters. Once the data has passed the above data checks, then it is possible to cluster the operating or process data (also using a set of key process variables) into several regions, groups or partitions which usually correspond to various and distinct operating modes or process operations such as minimum throughput, maximum yield, high conversion, low grade, etc. where the number of clusters can be estimated using the gap statistic from Tibshirani et. al. (2001) or are usually known a priori based on production/operating orders and logs where these modes/operations are planned/scheduled weeks to days in advance.
  • 2. The FCMC algorithm is useful to assign real number probabilities (weights or memberships) to the expectation that the process is in a particular operating/production mode, region or regime. This information is useful given that multiple local and perhaps linear and simpler nonlinear models can be employed to monitor and optimize the process accurately similar to the approach found in Aumi and Mhaskar (2011) and Aumi et. al. (2011) to control a nonlinear batch process using multiple linear auto-regressive exogenous (ARX) dynamic models. Their approach is to use the clustering routine to determine which ARX model to use for the control prediction and manipulation, given the current state of the process, where transitions from one cluster, mode or region to the next result in probabilities lying between 0.0 and 1.0. When the clusters have been determined from the training, calibration or development data in terms of their cluster targets or mean-centers/centroids then the same routine can be used with testing, control or deployment data by fixing the targets and computing the weights or membership probabilities only in one (1) iteration. These weights can then be used to weight or proportion the predictions from the multiple localized models. With regard to dimensionality, if the number of key process variables used in the clustering is large then it is possible and practical to use the data componentizing routine described below to cluster only the larger principal components (Aumi and Mhaskar, 2011). Data Componentizing IMPL’s data componentizing routine implements the very well-known Principal Components Analysis and Regression (PCA/PCR) where the X-block of explanatory or regressor variables are componentized into one or more factors, latent variables or principal components called scores which are orthogonal, perpendicular or completely independent to each other. These scores are then used to regress one or more Y-block responses. The loadings that relate the scores to the X-block are computed in the PCA prior to computing the regression parameters relating the scores to the Y-block which can be argued is inferior to the related technique of Partial Least Squares or Projection to Latent Structures (PLS). This inferiority of PCR compared to PLS is primarily attributed to the fact that PCR requires more components than PLS for the same or similar R2 or percent (%) Y explained and is not as parsimonious as PLS. To address this issue, a unique and unpublished technique only found in IMPL is our Principal Component Regression Optimization (PCRO). This algorithm simultaneously computes the scores and regression coefficients together by minimizing a weighted sum of squares of residuals for both the X- and Y-blocks together with regularization similar to the Levenberg- Marquardt (trust-region) algorithm in nonlinear parameter estimation. PCR and PLS sequentially computes the latent variables or scores one at a time typically using NIPALS whereas PCRO, as mentioned, computes the loadings and regression parameters together into the same nonlinear optimization problem solved using an Equality-Constrained Successive Quadratic Programming (SQP) algorithm. The interesting feature of PCRO is that for the same or similar R2 or % Y explained, it requires less components than PLS. References Bezdek, J.C., Ehrlich, R., Full, W., “FCM: the fuzzy c-means clustering algorithm”, Computers & Geosciences, 10, 191, (1984). Bezdek, J.C., Keller, J., Krishnapuram, R., Pal, N.R., “Fuzzy models and algorithms for pattern recognition and image processing”, Kluwer Academic Publishers, TA1650.F89, (1999).
  • 3. Tibshirani, R., Walther, G., Hastie, T., “Estimating the number of clusters in a data set via the gap statistic”, J.R. Statist. Soc. B., 63, 411-423, (2001). Aumi, S., Mhaskar, P., “Integrating data-based modeling and nonlinear control tools for batch process control”, American Control Conference, San Francisco, June, (2011). Aumi, S., Corbett, C., Mhaskar, P., “Data-based modeling and control of Nylon 6,6 batch polymerization ”, American Control Conference, San Francisco, June, (2011). Kelly, J.D., Hedengren, J.D., "A steady-state detection (SDD) algorithm to detect non-stationary drifts in processes", Journal of Process Control, 23, 326, (2013).