SlideShare a Scribd company logo
A software Infrastructure for
Multidimensional data Analysis: A Data
Modelling Aspect
Prarthana A. Deshkar
Assistant Professor
Yeshwantrao Chavan College of Engineering
Nagpur, India
prarthana.deshkar@gmail.com
Dr. Parag S. Deshpande
Supervisor, CSE dept, G. H. Raisoni College of Engineering, Nagpur, India
psdeshpande@cse.vnit.ac.in
Prof. A. Thomas
HoD, CSE department
G. H. Raisoni College of
Engineering, Nagpur, India
Abstract - Rapid changes in the technology lead to increased variety of data sources. These varied data sources
generating data in the large volume and with extremely high speed. To accommodate and use this data in decision
making systems is the big challenge. To make fullest use of the valuable data generated by different systems, target
users of the analysis systems need to be increased. In general knowledge discovery process using the tools which are
available requires the handsome expertise in the domain as well as in the technology. The project ITDA (Integrated
Tool for Data Analysis) focuses to provide the complete platform for multidimensional data analysis to enhance the
decision making process in every domain. This projects provides all the techniques required to perform
multidimensional data analysis and avoids the overheads occurred by the traditional cube architecture followed by
most of the analytics system. Modelling the available data in the multidimensional form is the basis and crucial step
for multidimensional analysis. This work describes the multidimensional modelling aspect and its implementation
using ITDA project.
Keywords - Multidimensional data analysis, cube, data mining, machine learning, ETL, multidimensional modelling,
OLAP.
I. INTRODUCTION
Due to increased frequency of data generation, data under consideration of analysis is also goes on increasing
tremendously. The large size of the data and complexity in data analysis demands an easy platform so that
researchers and domain experts can do analysis on their data without the hard core knowledge of information
technology. Ad hoc querying or ad hoc reporting is the main need of data analysis. To achieve this data
modeling is essential task if the system wants to facilitate the variety of domains. Multidimensional data
modeling is the way to provide facility to perform ad hoc analysis. Analyzing multidimensional data is of
growing need to extract the knowledge and hence to enable the decision making in various domains. Data
analysis process which leads to the enhanced decision making, combine various techniques like statistical
techniques, data mining algorithms and machine learning techniques. With all these techniques, presentation of
analysis output with attractive visuals is a key part of popular analytics systems. Most of the current
multidimensional systems rely on data cubes which are very much resource and time intensive. In this context,
ITDA architecture is the solution for multidimensional analysis with the reduced memory and time overheads as
compared to the existing systems.
Absorption of high volume of data from variety of sources requires the robust and flexible system. In OLAP
terminology the data modelling and data absorption system is called as the Extraction-Transformation-Loading
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
163 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
(ETL) process. The most important bi-product of the ETL process is the metadata. ITDA system uses the on –
the – fly architecture for the query generation and hence metadata of the multidimensional model is very crucial
component of the system. In a typical analysis environment ETL processes are performed in an ad-hoc, in house
fashion or by using some specialized ETL tools. General functionalities of all these tools are identification of
relevant information present at the source, extraction of this information, customization and integration of the
information coming from multiple sources into a common format, cleansing of the final data set, on the basis of
database and business rules, and propagation of the data to relational database which will be used for analysis.
In current scenario, organizations might be having number of sources contributing to data collection playing
important role in modelling process. The source data might be at different places and it is needed to extract all
necessary and data relevant for the analysis. After applying the transformations according to business rule the
data is transferred into the target model.
The paper is focusing on this important aspect of any decision making tool, i.e. modeling the data in analysis
ready form which may be residing at varied location and may have heterogeneous formats. The organization of
this paper is as follows. Firstly in section I we discusses the related work in this area. In section II we give brief
introduction of the architecture of the ITDA project along with the basic characteristics of it. In Section III we
discuss the conceptual design of the ETL process for ITDA. In next section, section IV we discusses the
implementation of the process by considering the case study where data is available in transformed format. And
finally we summarize all the contents and discuss the future scope of the system.
II. RELATED WORK
Multidimensional data analysis system to enhance the efficiency and accuracy of the decision support system
is the growing need of today. Many big players of technology like, IBM, Microsoft are having good range of
solutions for the same. Every solution is having its own pros and cons. As discussed in [1] most of the
multidimensional analysis tools are having stiff curve of learning. Many tools are domain specific. The tools
which are having good range of analytical options generally provide the different components for each and
every facility which de-motivate the non expert data analysts.
Microstrategy is the leading name in the market of data analysis. Microstrategy provides the component
called integrity manager which takes care of the ETL process. It replaces the traditional manual process of data
integration. ETL process is the separate component in this tool. Numbers of supporting ETL components are
available in the Microstrategy; like, Enterprise Manager ETL, ETL Server, ETL Support, etc. But this may lead
to a bit complicated and a costly affair for the research community those are focusing more on analytics and less
on technology. [3]
IBM Cognos is also very powerful tool available in the market to perform the multidimensional data analysis.
IBM cognos is having different components for each feature like, Cognos for analytics, for business
intelligence, for predictive analysis, etc. Cognos Analytics is having a separate data modelling component. This
component provides the interface for data extraction from various sources, for transformations and for data
validations. [6]
ETL process of ITDA is the integral part of the system to avoid the additional installation and usage overhead
of the user.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
164 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
III. ITDA SYSTEM ARCHITECTURE
The ITDA system is basically designed to facilitate the researchers and data analyst with the complete
package of multidimensional reporting, statistical processing, data mining, machine learning and visualization.
This is achieved by the web based system with user friendly and secured environment for the data analyst. This
system is functionally independent; it does not require any additional external component or system to complete
the task. Also the components of this system are integrated and there is no need to install any of the components
separately, which is often common for most of analytics tools.
ITDA system architecture is mainly divided in two parts, data modeling part and data analysis part. Proposed
system consists of two main parts containing various components. First is data absorption from different data
sources, collection of metadata, and formation of multidimensional model and second is multidimensional
analysis on modeled data which further extends to perform statistical analysis and data mining.
Data modeling functionality mainly includes the extraction, transformation and loading (ETL) process.
Source data is given to the ETL process and it produces the ready to analyze data. ETL process is responsible to
extract the data resides on various sources and in variety of formats. It also performs cleansing and
customization of data according to the analysis needs. This process is also responsible to generate the metadata
of the ready to analyze data. The proposed system is not going to store the data and the aggregations, hence
metadata is having crucial role in this system. Aggregations can be generated on – the – fly by using the
metadata.
A. ITDA Characteristics
Customized modeling of the data
Multidimensional modelling of the data according to the business needs is the key of any efficient decision
making system. ITDA supports the multiuser system. Each user can model the data in its own way according to
business need. In the ITDA terminology the information of the model is conceptualize as the ‘environment’.
Single user can have multiple environments for same data so that user gets various views of data for analysis
without having complexity of handling number of users for separate business need.
Data absorption options
ITDA system can accommodate pre-processed data present in flat files where transformation is not required.
For such cases it directly loads data in server and collects metadata for that environment. If data is present in
multiple sites then this system performs ETL processing during environment creation.
Flexibility in data selection
Data analyst can have analysis on some particular portion of data by using horizontal partitioning facility
given in this system. It allows the user to analyze particular snippet of dataset. It increases performance by
reducing number of rows used while running analytical queries or algorithms. User can directly get particular
portion of uploaded data by using row filter utility given in the system. This utility allows user to build row
filter query without requirement of prior SQL query knowledge. Both these facilities are integrated with the
system which can be used after creation of environment.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
165 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
IV. ITDA ETL PROCESS: CONCEPTUAL DESIGN
Process of ETL starts with the understanding of business requirements and the objective of the organization
followed by modelling and design of environment for that organization. Modelling and design are defined as
representation of key business measurements around its dimensions using dimension modelling. This process
decides the level of complexity of transformation based on the source of the data. If data is present at multiple
sites then ITDA provides the technique which takes care of extraction of data from multiple sources,
transformation and loading.
Last stage of conceptual design part is metadata generation. Metadata contents need to be formulated for a
specific multidimensional model. The process decides relationship described by the dimensions, like
hierarchical relationship, or sequential relationships. It also gives the level of relationship exists in each
component of the dimensional structure. ITDA produces the flat file at the end of the process containing the
complete metadata for a multidimensional model created by the user. It also stores the information of the
temporal component to create the run time summaries.
A. ETL Algorithm
During the implementation of the ETL process in ITDA, every correct or missed step is recorded and made
available to the user.
1) Finalize the ETL processing path
2) Finalize the type of data source
3) For each any data source map the data source attributes with the dimensional attributes
4) Preparation of metadata
5) Preparation of configuration file for further processing of model
One of the basic motives behind the ITDA project is to provide the multidimensional analysis platform for
non expert data analyst community along with the expert data scientist. The project focuses on interactive, user
friendly implementation of the ETL process.
ETL processing path decides whether the data sources are at the same site or located on the different sites. If
the data sources are located at different location then the user needs to create one configuration file and based on
the instructions given in the file; data will be absorbed. If the data source is at single location then next step is to
decide the type of data source like flat file or any other database. Mapping of data source attributes and
dimension and fact values are performed and then metadata is generated.
V USE OF ETL SERVICE: CASE STUDY FOR FLAT FILE AND DATABASE
ITDA implements the ETL process with highly interactive and user friendly interface. It covers complete ETL
process without any programming aspect. Fig 1 gives the main interface of the ITDA system which allows user
to initiate the creation of new environment in the system.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
166 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Fig 1ITDA user interface – option to create new environment
Fig. 2 ITDA interface – selection of ETL processing path
Fig 2 shows the interface which provides the two different paths for ETL process. If the data is available in
the already transformed form according to business rule then user will go for the ‘Simple Upload’ option. If the
data needs to be extracted from various sources and need the pre-processing according to business rule then the
‘Steps Upload’ option can be the choice.
A. Simple upload
This module assumes that data is already in the required form and there is no need of transformation step. For
single source data, we can have data in flat files or in database server.
B. Flat files
Generally spread sheets or text file formats are used to export data from any database server. If source
machine is not accessible from remote location, user can export data in flat files and use those files to create new
environment (multidimensional model) in this web tool. ITDA allows user to have data in standard comma
separated file or any other flat file with any type of separator. User is allowed to see sample data. Standard
query is generated by the system so that user can drop some of the unwanted columns. Successful creation of
table enables the metadata collection interface. Figure 3 shows the user interface for uploading csv files to the
server. Figure 4 shows the interface with the sample data from the selected file and the standard query generated
by the system to extract the file. Analyst can customize the query further.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
167 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Fig. 3 ITDA interface – selecting file as the data source
Fig. 4 ITDA interface – sample data and editable extraction query
C. Database
When data source is any database then connection details can be provided so that system can access the data.
In this module if source connection and destination connection are same then it will skip data migration process.
It avoids extra overload of unnecessary copying entire table. As we are using on the fly architecture, we can use
source table for analysis. For analysis we are going to read existing data in OLTP server. In on the fly
architecture, we can use same server for OLTP and OLAP processing. This is the biggest advantage of using on
the fly architecture. Figure 5 shows database option available in simple upload module.
Fig. 5 ITDA interface – options to map source database for data extraction
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
168 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
D. Steps Upload
If we have to take data from multiple sources then we need to have extraction and transformation logic at the
server side. In simple upload we were getting pre-processed data so it was much easier to load data in the server
and collect metadata. To support collection of data from multiple sites we have steps upload module. This
module takes care of extraction, transformation and loading of data at server. In this, we need to upload
configuration file to the server containing all details and transformation scripts. We will get all configuration file
parameters at the time of conceptual design. Configuration file is in simple text format so that any database user
can build it. It is required to keep the process of environment creation as easy as possible. Figure 6 shows steps
upload choice
Fig. 6 ITDA interface – option when data needs transformation steps
E. Metadata Collection
In order to have multiuser system it is needed to maintain context of every user separately. ITDA ETL process
uses specific directory structure for maintaining all the environments created by any user. For every
environment there is one flat file for storing the customized operations built by user for performing OLAP. This
file is retaining for each environment separately to avoid clash. All the necessary information to operate the
models created by a user, a separate directory structure is provided to every user. When user registered for the
first time to the system, this complete directory structure is created for that user.
To give metadata user can fill simple html form control to mention dimension names, their hierarchies and
time dimension details. Once this data is inserted system can proceed for environment creation.
F. ETL for periodic updates
For any ETL system updating data in the server is crucial part. Since OLTP servers will be generating new
data continuously. To have analysis on updated data; either the system will change the data available in the
server or will add new data keeping earlier data as it is. Here important thing is environment metadata is not
changing so metadata collection process can be skipped and directly system can update the data in the required
environment.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
169 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Fig. 7 ITDA interface – options for edit environment
Each user can create any number of environments based on the analysis need. Update process will be invoked
for each separate environment. Figure 7 shows environment selection interface and various operations that user
can perform after selecting it.
Here this module is for loading new data to the same environment. Figure 8 shows the result after uploading
new dataset file to the server. This module flush older data from the table and inserts new data.
Fig. 8 ITDA interface – edit environment option
CONCLUSION AND FUTURE WORK
Design of ETL process in ITDA addresses the requirements of efficient extraction, transformation and loading
of data from various sources. It also meets with the challenges in assimilating data from heterogeneous data
sources, provides an easy to use tool for uploading the existing data set in hand. It successfully collects all
metadata parameters required for multidimensional analysis. The designed ETL model can be extended to
include facilities of automatic multidimensional modelling where automatic extraction of metadata will be done
at the time of load. It can also have context based data generation which collects as well as models the data
gathered from web. This data can in turn be tunnelled to multidimensional analysis.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
170 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
REFERENCES
[1] Prarthana A. Deshkar, Parag S. Deshpande, A. Thomas, “ Multidimensional Data Analysis Facilities and Challenges: A
Survey for Data Analysis Tools”, International Journal of Computer Applications (0975 – 8887), Volume 179 – No.13,
January 2018
[2] Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, Neoklis Polyzotis “SEEDB: Efficient Data-
Driven Visualization Recommendations to Support Visual Analytics”, Proceedings of the VLDB Endowment, Vol. 8, No.
13 Copyright 2015 VLDB Endowment 2150-8097/15/09.
[3] Data Modeling Guide, IBM Cognos Analytics Version 11.0.0, Copyright IBM Corporation 2015, 2017.
[4] Sandro Fiore, Alessandro D’Anca, Donatello Elia, Cosimo Palazzo, Ian Foster, Dean Williams, Giovanni Aloisio,
“Ophidia: a full software stack for scientific data analytics”, 978-1-4799-5313-4/14/$31.00 ©2014 IEEE
[5] S. Fiorea, A. D’Ancaa, C. Palazzoa,b, I. Fosterc, D. N. Williamsd, G. Aloisioa, “Ophidia: toward big data analytics for
eScience”, 2013 International Conference on Computational Science, doi: 10.1016/j.procs.2013.05.409, 2013
[6] Architecture for Enterprise Business Intelligence, an overview of the microstrategy platform architecture for big data, cloud
bi, and mobile applications
[7] Usman AHMED, “Dynamic Cubing for Hierarchical Multidimensional Data Space”, PhD thesis, February 2013
[8] Muntazir Mehdi, Ratnesh Sahay, Wassim Derguech, Edward Curry, “On-The-Fly Generation of Multidimensional Data
Cubes for Web of Things”, IDEAS ’13 October 09 - 11 2013, Barcelona, Spain
[9] Yang Zhang, Simon Fong, Jinan Fiaidhi, SabahMohammed, “Real-Time Clinical Decision Support Systemwith Data
StreamMining”, Hindawi Publishing Corporation Journal of Biomedicine and Biotechnology Volume 2012
[10] Sandra Geisler, Christoph Quix, Stefan Schiffer, Matthias Jarke, “An evaluation framework for traffic information systems
based on data streams”, 2011 Elsevier Ltd. All rights reserved.
[11] IBM Cognos Dynamic Cubes, October 2012
[12] Marta Zorrilla, Diego García-Saiz, “A service oriented architecture to provide data mining services for non-expert data
miners”,
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
171 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

More Related Content

DOCX
Introduction
PPTX
Introduction to Database Management Systems
PDF
Entity resolution for hierarchical data using attributes value comparison ove...
PPT
Research Article
PDF
H1803014347
PDF
Data Warehousing & Basic Architectural Framework
PDF
Comparison Between WEKA and Salford System in Data Mining Software
PDF
Data flow diagram part7
Introduction
Introduction to Database Management Systems
Entity resolution for hierarchical data using attributes value comparison ove...
Research Article
H1803014347
Data Warehousing & Basic Architectural Framework
Comparison Between WEKA and Salford System in Data Mining Software
Data flow diagram part7

What's hot (20)

PPT
Hand Coding ETL Scenarios and Challenges
PDF
IRJET- Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
PDF
Granularity analysis of classification and estimation for complex datasets wi...
PPT
Lecture 03 - The Data Warehouse and Design
PPTX
Different types of data processing
PDF
Data science lecture2_doaa_mohey
PPTX
Lecture 02 - The Data Warehouse Environment
PDF
Data science lecture3_doaa_mohey
PDF
Rule-based Information Extraction for Airplane Crashes Reports
PPT
D01 etl
PPTX
Process management seminar
PDF
System analysis part2
PDF
Knowledge Discovery Applied to a Database of Errors of Systems Development
PPTX
System Analysis And Design
PDF
An Architecture for Simplified and Automated Machine Learning
PDF
An effective pre processing algorithm for information retrieval systems
PDF
Z36149154
PDF
Data science lecture1_doaa_mohey
PDF
Implementing data-driven decision support system based on independent educati...
PDF
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Hand Coding ETL Scenarios and Challenges
IRJET- Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
Granularity analysis of classification and estimation for complex datasets wi...
Lecture 03 - The Data Warehouse and Design
Different types of data processing
Data science lecture2_doaa_mohey
Lecture 02 - The Data Warehouse Environment
Data science lecture3_doaa_mohey
Rule-based Information Extraction for Airplane Crashes Reports
D01 etl
Process management seminar
System analysis part2
Knowledge Discovery Applied to a Database of Errors of Systems Development
System Analysis And Design
An Architecture for Simplified and Automated Machine Learning
An effective pre processing algorithm for information retrieval systems
Z36149154
Data science lecture1_doaa_mohey
Implementing data-driven decision support system based on independent educati...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Ad

Similar to A Software Infrastructure for Multidimensional Data Analysis: A Data Modelling Aspect (20)

PDF
CHATBOT FOR COLLEGE RELATED QUERIES | J4RV4I1008
PPT
using big-data methods analyse the Cross platform aviation
PPTX
1) Introduction to Data Analyticszz.pptx
PDF
6. ijece guideforauthors 2012_2 eidt sat
PDF
Data Orchestration Solution: An Integral Part of DataOps
PDF
Developing Sales Information System Application using Prototyping Model
PDF
Developing Sales Information System Application using Prototyping Model
PDF
An Integrated ERP with Web Portal
PDF
An Integrated ERP With Web Portal
PDF
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
PDF
DOCUMENT SELECTION USING MAPREDUCE
PDF
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
PDF
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
PDF
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
PPTX
Sadchap3
PDF
The Big Data Importance – Tools and their Usage
PDF
Data Integration Made Easy Databricks Connects Your Data Ecosystem
PPTX
Data Warehouse for data analytics presentation
PPTX
km ppt neew one
DOCX
Ajith_kumar_4.3 Years_Informatica_ETL
CHATBOT FOR COLLEGE RELATED QUERIES | J4RV4I1008
using big-data methods analyse the Cross platform aviation
1) Introduction to Data Analyticszz.pptx
6. ijece guideforauthors 2012_2 eidt sat
Data Orchestration Solution: An Integral Part of DataOps
Developing Sales Information System Application using Prototyping Model
Developing Sales Information System Application using Prototyping Model
An Integrated ERP with Web Portal
An Integrated ERP With Web Portal
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
Sadchap3
The Big Data Importance – Tools and their Usage
Data Integration Made Easy Databricks Connects Your Data Ecosystem
Data Warehouse for data analytics presentation
km ppt neew one
Ajith_kumar_4.3 Years_Informatica_ETL
Ad

Recently uploaded (20)

PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Managing Community Partner Relationships
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Transcultural that can help you someday.
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Leprosy and NLEP programme community medicine
PDF
annual-report-2024-2025 original latest.
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
STUDY DESIGN details- Lt Col Maksud (21).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Managing Community Partner Relationships
Introduction-to-Cloud-ComputingFinal.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Supervised vs unsupervised machine learning algorithms
Miokarditis (Inflamasi pada Otot Jantung)
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Clinical guidelines as a resource for EBP(1).pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
STERILIZATION AND DISINFECTION-1.ppthhhbx
Transcultural that can help you someday.
Data_Analytics_and_PowerBI_Presentation.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Leprosy and NLEP programme community medicine
annual-report-2024-2025 original latest.
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

A Software Infrastructure for Multidimensional Data Analysis: A Data Modelling Aspect

  • 1. A software Infrastructure for Multidimensional data Analysis: A Data Modelling Aspect Prarthana A. Deshkar Assistant Professor Yeshwantrao Chavan College of Engineering Nagpur, India [email protected] Dr. Parag S. Deshpande Supervisor, CSE dept, G. H. Raisoni College of Engineering, Nagpur, India [email protected] Prof. A. Thomas HoD, CSE department G. H. Raisoni College of Engineering, Nagpur, India Abstract - Rapid changes in the technology lead to increased variety of data sources. These varied data sources generating data in the large volume and with extremely high speed. To accommodate and use this data in decision making systems is the big challenge. To make fullest use of the valuable data generated by different systems, target users of the analysis systems need to be increased. In general knowledge discovery process using the tools which are available requires the handsome expertise in the domain as well as in the technology. The project ITDA (Integrated Tool for Data Analysis) focuses to provide the complete platform for multidimensional data analysis to enhance the decision making process in every domain. This projects provides all the techniques required to perform multidimensional data analysis and avoids the overheads occurred by the traditional cube architecture followed by most of the analytics system. Modelling the available data in the multidimensional form is the basis and crucial step for multidimensional analysis. This work describes the multidimensional modelling aspect and its implementation using ITDA project. Keywords - Multidimensional data analysis, cube, data mining, machine learning, ETL, multidimensional modelling, OLAP. I. INTRODUCTION Due to increased frequency of data generation, data under consideration of analysis is also goes on increasing tremendously. The large size of the data and complexity in data analysis demands an easy platform so that researchers and domain experts can do analysis on their data without the hard core knowledge of information technology. Ad hoc querying or ad hoc reporting is the main need of data analysis. To achieve this data modeling is essential task if the system wants to facilitate the variety of domains. Multidimensional data modeling is the way to provide facility to perform ad hoc analysis. Analyzing multidimensional data is of growing need to extract the knowledge and hence to enable the decision making in various domains. Data analysis process which leads to the enhanced decision making, combine various techniques like statistical techniques, data mining algorithms and machine learning techniques. With all these techniques, presentation of analysis output with attractive visuals is a key part of popular analytics systems. Most of the current multidimensional systems rely on data cubes which are very much resource and time intensive. In this context, ITDA architecture is the solution for multidimensional analysis with the reduced memory and time overheads as compared to the existing systems. Absorption of high volume of data from variety of sources requires the robust and flexible system. In OLAP terminology the data modelling and data absorption system is called as the Extraction-Transformation-Loading International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 163 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. (ETL) process. The most important bi-product of the ETL process is the metadata. ITDA system uses the on – the – fly architecture for the query generation and hence metadata of the multidimensional model is very crucial component of the system. In a typical analysis environment ETL processes are performed in an ad-hoc, in house fashion or by using some specialized ETL tools. General functionalities of all these tools are identification of relevant information present at the source, extraction of this information, customization and integration of the information coming from multiple sources into a common format, cleansing of the final data set, on the basis of database and business rules, and propagation of the data to relational database which will be used for analysis. In current scenario, organizations might be having number of sources contributing to data collection playing important role in modelling process. The source data might be at different places and it is needed to extract all necessary and data relevant for the analysis. After applying the transformations according to business rule the data is transferred into the target model. The paper is focusing on this important aspect of any decision making tool, i.e. modeling the data in analysis ready form which may be residing at varied location and may have heterogeneous formats. The organization of this paper is as follows. Firstly in section I we discusses the related work in this area. In section II we give brief introduction of the architecture of the ITDA project along with the basic characteristics of it. In Section III we discuss the conceptual design of the ETL process for ITDA. In next section, section IV we discusses the implementation of the process by considering the case study where data is available in transformed format. And finally we summarize all the contents and discuss the future scope of the system. II. RELATED WORK Multidimensional data analysis system to enhance the efficiency and accuracy of the decision support system is the growing need of today. Many big players of technology like, IBM, Microsoft are having good range of solutions for the same. Every solution is having its own pros and cons. As discussed in [1] most of the multidimensional analysis tools are having stiff curve of learning. Many tools are domain specific. The tools which are having good range of analytical options generally provide the different components for each and every facility which de-motivate the non expert data analysts. Microstrategy is the leading name in the market of data analysis. Microstrategy provides the component called integrity manager which takes care of the ETL process. It replaces the traditional manual process of data integration. ETL process is the separate component in this tool. Numbers of supporting ETL components are available in the Microstrategy; like, Enterprise Manager ETL, ETL Server, ETL Support, etc. But this may lead to a bit complicated and a costly affair for the research community those are focusing more on analytics and less on technology. [3] IBM Cognos is also very powerful tool available in the market to perform the multidimensional data analysis. IBM cognos is having different components for each feature like, Cognos for analytics, for business intelligence, for predictive analysis, etc. Cognos Analytics is having a separate data modelling component. This component provides the interface for data extraction from various sources, for transformations and for data validations. [6] ETL process of ITDA is the integral part of the system to avoid the additional installation and usage overhead of the user. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 164 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. III. ITDA SYSTEM ARCHITECTURE The ITDA system is basically designed to facilitate the researchers and data analyst with the complete package of multidimensional reporting, statistical processing, data mining, machine learning and visualization. This is achieved by the web based system with user friendly and secured environment for the data analyst. This system is functionally independent; it does not require any additional external component or system to complete the task. Also the components of this system are integrated and there is no need to install any of the components separately, which is often common for most of analytics tools. ITDA system architecture is mainly divided in two parts, data modeling part and data analysis part. Proposed system consists of two main parts containing various components. First is data absorption from different data sources, collection of metadata, and formation of multidimensional model and second is multidimensional analysis on modeled data which further extends to perform statistical analysis and data mining. Data modeling functionality mainly includes the extraction, transformation and loading (ETL) process. Source data is given to the ETL process and it produces the ready to analyze data. ETL process is responsible to extract the data resides on various sources and in variety of formats. It also performs cleansing and customization of data according to the analysis needs. This process is also responsible to generate the metadata of the ready to analyze data. The proposed system is not going to store the data and the aggregations, hence metadata is having crucial role in this system. Aggregations can be generated on – the – fly by using the metadata. A. ITDA Characteristics Customized modeling of the data Multidimensional modelling of the data according to the business needs is the key of any efficient decision making system. ITDA supports the multiuser system. Each user can model the data in its own way according to business need. In the ITDA terminology the information of the model is conceptualize as the ‘environment’. Single user can have multiple environments for same data so that user gets various views of data for analysis without having complexity of handling number of users for separate business need. Data absorption options ITDA system can accommodate pre-processed data present in flat files where transformation is not required. For such cases it directly loads data in server and collects metadata for that environment. If data is present in multiple sites then this system performs ETL processing during environment creation. Flexibility in data selection Data analyst can have analysis on some particular portion of data by using horizontal partitioning facility given in this system. It allows the user to analyze particular snippet of dataset. It increases performance by reducing number of rows used while running analytical queries or algorithms. User can directly get particular portion of uploaded data by using row filter utility given in the system. This utility allows user to build row filter query without requirement of prior SQL query knowledge. Both these facilities are integrated with the system which can be used after creation of environment. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 165 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. IV. ITDA ETL PROCESS: CONCEPTUAL DESIGN Process of ETL starts with the understanding of business requirements and the objective of the organization followed by modelling and design of environment for that organization. Modelling and design are defined as representation of key business measurements around its dimensions using dimension modelling. This process decides the level of complexity of transformation based on the source of the data. If data is present at multiple sites then ITDA provides the technique which takes care of extraction of data from multiple sources, transformation and loading. Last stage of conceptual design part is metadata generation. Metadata contents need to be formulated for a specific multidimensional model. The process decides relationship described by the dimensions, like hierarchical relationship, or sequential relationships. It also gives the level of relationship exists in each component of the dimensional structure. ITDA produces the flat file at the end of the process containing the complete metadata for a multidimensional model created by the user. It also stores the information of the temporal component to create the run time summaries. A. ETL Algorithm During the implementation of the ETL process in ITDA, every correct or missed step is recorded and made available to the user. 1) Finalize the ETL processing path 2) Finalize the type of data source 3) For each any data source map the data source attributes with the dimensional attributes 4) Preparation of metadata 5) Preparation of configuration file for further processing of model One of the basic motives behind the ITDA project is to provide the multidimensional analysis platform for non expert data analyst community along with the expert data scientist. The project focuses on interactive, user friendly implementation of the ETL process. ETL processing path decides whether the data sources are at the same site or located on the different sites. If the data sources are located at different location then the user needs to create one configuration file and based on the instructions given in the file; data will be absorbed. If the data source is at single location then next step is to decide the type of data source like flat file or any other database. Mapping of data source attributes and dimension and fact values are performed and then metadata is generated. V USE OF ETL SERVICE: CASE STUDY FOR FLAT FILE AND DATABASE ITDA implements the ETL process with highly interactive and user friendly interface. It covers complete ETL process without any programming aspect. Fig 1 gives the main interface of the ITDA system which allows user to initiate the creation of new environment in the system. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 166 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 5. Fig 1ITDA user interface – option to create new environment Fig. 2 ITDA interface – selection of ETL processing path Fig 2 shows the interface which provides the two different paths for ETL process. If the data is available in the already transformed form according to business rule then user will go for the ‘Simple Upload’ option. If the data needs to be extracted from various sources and need the pre-processing according to business rule then the ‘Steps Upload’ option can be the choice. A. Simple upload This module assumes that data is already in the required form and there is no need of transformation step. For single source data, we can have data in flat files or in database server. B. Flat files Generally spread sheets or text file formats are used to export data from any database server. If source machine is not accessible from remote location, user can export data in flat files and use those files to create new environment (multidimensional model) in this web tool. ITDA allows user to have data in standard comma separated file or any other flat file with any type of separator. User is allowed to see sample data. Standard query is generated by the system so that user can drop some of the unwanted columns. Successful creation of table enables the metadata collection interface. Figure 3 shows the user interface for uploading csv files to the server. Figure 4 shows the interface with the sample data from the selected file and the standard query generated by the system to extract the file. Analyst can customize the query further. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 167 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 6. Fig. 3 ITDA interface – selecting file as the data source Fig. 4 ITDA interface – sample data and editable extraction query C. Database When data source is any database then connection details can be provided so that system can access the data. In this module if source connection and destination connection are same then it will skip data migration process. It avoids extra overload of unnecessary copying entire table. As we are using on the fly architecture, we can use source table for analysis. For analysis we are going to read existing data in OLTP server. In on the fly architecture, we can use same server for OLTP and OLAP processing. This is the biggest advantage of using on the fly architecture. Figure 5 shows database option available in simple upload module. Fig. 5 ITDA interface – options to map source database for data extraction International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 168 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 7. D. Steps Upload If we have to take data from multiple sources then we need to have extraction and transformation logic at the server side. In simple upload we were getting pre-processed data so it was much easier to load data in the server and collect metadata. To support collection of data from multiple sites we have steps upload module. This module takes care of extraction, transformation and loading of data at server. In this, we need to upload configuration file to the server containing all details and transformation scripts. We will get all configuration file parameters at the time of conceptual design. Configuration file is in simple text format so that any database user can build it. It is required to keep the process of environment creation as easy as possible. Figure 6 shows steps upload choice Fig. 6 ITDA interface – option when data needs transformation steps E. Metadata Collection In order to have multiuser system it is needed to maintain context of every user separately. ITDA ETL process uses specific directory structure for maintaining all the environments created by any user. For every environment there is one flat file for storing the customized operations built by user for performing OLAP. This file is retaining for each environment separately to avoid clash. All the necessary information to operate the models created by a user, a separate directory structure is provided to every user. When user registered for the first time to the system, this complete directory structure is created for that user. To give metadata user can fill simple html form control to mention dimension names, their hierarchies and time dimension details. Once this data is inserted system can proceed for environment creation. F. ETL for periodic updates For any ETL system updating data in the server is crucial part. Since OLTP servers will be generating new data continuously. To have analysis on updated data; either the system will change the data available in the server or will add new data keeping earlier data as it is. Here important thing is environment metadata is not changing so metadata collection process can be skipped and directly system can update the data in the required environment. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 169 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 8. Fig. 7 ITDA interface – options for edit environment Each user can create any number of environments based on the analysis need. Update process will be invoked for each separate environment. Figure 7 shows environment selection interface and various operations that user can perform after selecting it. Here this module is for loading new data to the same environment. Figure 8 shows the result after uploading new dataset file to the server. This module flush older data from the table and inserts new data. Fig. 8 ITDA interface – edit environment option CONCLUSION AND FUTURE WORK Design of ETL process in ITDA addresses the requirements of efficient extraction, transformation and loading of data from various sources. It also meets with the challenges in assimilating data from heterogeneous data sources, provides an easy to use tool for uploading the existing data set in hand. It successfully collects all metadata parameters required for multidimensional analysis. The designed ETL model can be extended to include facilities of automatic multidimensional modelling where automatic extraction of metadata will be done at the time of load. It can also have context based data generation which collects as well as models the data gathered from web. This data can in turn be tunnelled to multidimensional analysis. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 170 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 9. REFERENCES [1] Prarthana A. Deshkar, Parag S. Deshpande, A. Thomas, “ Multidimensional Data Analysis Facilities and Challenges: A Survey for Data Analysis Tools”, International Journal of Computer Applications (0975 – 8887), Volume 179 – No.13, January 2018 [2] Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, Neoklis Polyzotis “SEEDB: Efficient Data- Driven Visualization Recommendations to Support Visual Analytics”, Proceedings of the VLDB Endowment, Vol. 8, No. 13 Copyright 2015 VLDB Endowment 2150-8097/15/09. [3] Data Modeling Guide, IBM Cognos Analytics Version 11.0.0, Copyright IBM Corporation 2015, 2017. [4] Sandro Fiore, Alessandro D’Anca, Donatello Elia, Cosimo Palazzo, Ian Foster, Dean Williams, Giovanni Aloisio, “Ophidia: a full software stack for scientific data analytics”, 978-1-4799-5313-4/14/$31.00 ©2014 IEEE [5] S. Fiorea, A. D’Ancaa, C. Palazzoa,b, I. Fosterc, D. N. Williamsd, G. Aloisioa, “Ophidia: toward big data analytics for eScience”, 2013 International Conference on Computational Science, doi: 10.1016/j.procs.2013.05.409, 2013 [6] Architecture for Enterprise Business Intelligence, an overview of the microstrategy platform architecture for big data, cloud bi, and mobile applications [7] Usman AHMED, “Dynamic Cubing for Hierarchical Multidimensional Data Space”, PhD thesis, February 2013 [8] Muntazir Mehdi, Ratnesh Sahay, Wassim Derguech, Edward Curry, “On-The-Fly Generation of Multidimensional Data Cubes for Web of Things”, IDEAS ’13 October 09 - 11 2013, Barcelona, Spain [9] Yang Zhang, Simon Fong, Jinan Fiaidhi, SabahMohammed, “Real-Time Clinical Decision Support Systemwith Data StreamMining”, Hindawi Publishing Corporation Journal of Biomedicine and Biotechnology Volume 2012 [10] Sandra Geisler, Christoph Quix, Stefan Schiffer, Matthias Jarke, “An evaluation framework for traffic information systems based on data streams”, 2011 Elsevier Ltd. All rights reserved. [11] IBM Cognos Dynamic Cubes, October 2012 [12] Marta Zorrilla, Diego García-Saiz, “A service oriented architecture to provide data mining services for non-expert data miners”, International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 171 https://sites.google.com/site/ijcsis/ ISSN 1947-5500