A Software Infrastructure for Multidimensional Data Analysis: A Data Modelling Aspect

A software Infrastructure for
Multidimensional data Analysis: A Data
Modelling Aspect
Prarthana A. Deshkar
Assistant Professor
Yeshwantrao Chavan College of Engineering
Nagpur, India
prarthana.deshkar@gmail.com
Dr. Parag S. Deshpande
Supervisor, CSE dept, G. H. Raisoni College of Engineering, Nagpur, India
psdeshpande@cse.vnit.ac.in
Prof. A. Thomas
HoD, CSE department
G. H. Raisoni College of
Engineering, Nagpur, India
Abstract - Rapid changes in the technology lead to increased variety of data sources. These varied data sources
generating data in the large volume and with extremely high speed. To accommodate and use this data in decision
making systems is the big challenge. To make fullest use of the valuable data generated by different systems, target
users of the analysis systems need to be increased. In general knowledge discovery process using the tools which are
available requires the handsome expertise in the domain as well as in the technology. The project ITDA (Integrated
Tool for Data Analysis) focuses to provide the complete platform for multidimensional data analysis to enhance the
decision making process in every domain. This projects provides all the techniques required to perform
multidimensional data analysis and avoids the overheads occurred by the traditional cube architecture followed by
most of the analytics system. Modelling the available data in the multidimensional form is the basis and crucial step
for multidimensional analysis. This work describes the multidimensional modelling aspect and its implementation
using ITDA project.
Keywords - Multidimensional data analysis, cube, data mining, machine learning, ETL, multidimensional modelling,
OLAP.
I. INTRODUCTION
Due to increased frequency of data generation, data under consideration of analysis is also goes on increasing
tremendously. The large size of the data and complexity in data analysis demands an easy platform so that
researchers and domain experts can do analysis on their data without the hard core knowledge of information
technology. Ad hoc querying or ad hoc reporting is the main need of data analysis. To achieve this data
modeling is essential task if the system wants to facilitate the variety of domains. Multidimensional data
modeling is the way to provide facility to perform ad hoc analysis. Analyzing multidimensional data is of
growing need to extract the knowledge and hence to enable the decision making in various domains. Data
analysis process which leads to the enhanced decision making, combine various techniques like statistical
techniques, data mining algorithms and machine learning techniques. With all these techniques, presentation of
analysis output with attractive visuals is a key part of popular analytics systems. Most of the current
multidimensional systems rely on data cubes which are very much resource and time intensive. In this context,
ITDA architecture is the solution for multidimensional analysis with the reduced memory and time overheads as
compared to the existing systems.
Absorption of high volume of data from variety of sources requires the robust and flexible system. In OLAP
terminology the data modelling and data absorption system is called as the Extraction-Transformation-Loading
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
163 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

(ETL) process. The most important bi-product of the ETL process is the metadata. ITDA system uses the on –
the – fly architecture for the query generation and hence metadata of the multidimensional model is very crucial
component of the system. In a typical analysis environment ETL processes are performed in an ad-hoc, in house
fashion or by using some specialized ETL tools. General functionalities of all these tools are identification of
relevant information present at the source, extraction of this information, customization and integration of the
information coming from multiple sources into a common format, cleansing of the final data set, on the basis of
database and business rules, and propagation of the data to relational database which will be used for analysis.
In current scenario, organizations might be having number of sources contributing to data collection playing
important role in modelling process. The source data might be at different places and it is needed to extract all
necessary and data relevant for the analysis. After applying the transformations according to business rule the
data is transferred into the target model.
The paper is focusing on this important aspect of any decision making tool, i.e. modeling the data in analysis
ready form which may be residing at varied location and may have heterogeneous formats. The organization of
this paper is as follows. Firstly in section I we discusses the related work in this area. In section II we give brief
introduction of the architecture of the ITDA project along with the basic characteristics of it. In Section III we
discuss the conceptual design of the ETL process for ITDA. In next section, section IV we discusses the
implementation of the process by considering the case study where data is available in transformed format. And
finally we summarize all the contents and discuss the future scope of the system.
II. RELATED WORK
Multidimensional data analysis system to enhance the efficiency and accuracy of the decision support system
is the growing need of today. Many big players of technology like, IBM, Microsoft are having good range of
solutions for the same. Every solution is having its own pros and cons. As discussed in [1] most of the
multidimensional analysis tools are having stiff curve of learning. Many tools are domain specific. The tools
which are having good range of analytical options generally provide the different components for each and
every facility which de-motivate the non expert data analysts.
Microstrategy is the leading name in the market of data analysis. Microstrategy provides the component
called integrity manager which takes care of the ETL process. It replaces the traditional manual process of data
integration. ETL process is the separate component in this tool. Numbers of supporting ETL components are
available in the Microstrategy; like, Enterprise Manager ETL, ETL Server, ETL Support, etc. But this may lead
to a bit complicated and a costly affair for the research community those are focusing more on analytics and less
on technology. [3]
IBM Cognos is also very powerful tool available in the market to perform the multidimensional data analysis.
IBM cognos is having different components for each feature like, Cognos for analytics, for business
intelligence, for predictive analysis, etc. Cognos Analytics is having a separate data modelling component. This
component provides the interface for data extraction from various sources, for transformations and for data
validations. [6]
ETL process of ITDA is the integral part of the system to avoid the additional installation and usage overhead
of the user.
ISSN 1947-5500

III. ITDA SYSTEM ARCHITECTURE
The ITDA system is basically designed to facilitate the researchers and data analyst with the complete
package of multidimensional reporting, statistical processing, data mining, machine learning and visualization.
This is achieved by the web based system with user friendly and secured environment for the data analyst. This
system is functionally independent; it does not require any additional external component or system to complete
the task. Also the components of this system are integrated and there is no need to install any of the components
separately, which is often common for most of analytics tools.
ITDA system architecture is mainly divided in two parts, data modeling part and data analysis part. Proposed
system consists of two main parts containing various components. First is data absorption from different data
sources, collection of metadata, and formation of multidimensional model and second is multidimensional
analysis on modeled data which further extends to perform statistical analysis and data mining.
Data modeling functionality mainly includes the extraction, transformation and loading (ETL) process.
Source data is given to the ETL process and it produces the ready to analyze data. ETL process is responsible to
extract the data resides on various sources and in variety of formats. It also performs cleansing and
customization of data according to the analysis needs. This process is also responsible to generate the metadata
of the ready to analyze data. The proposed system is not going to store the data and the aggregations, hence
metadata is having crucial role in this system. Aggregations can be generated on – the – fly by using the
metadata.
A. ITDA Characteristics
Customized modeling of the data
Multidimensional modelling of the data according to the business needs is the key of any efficient decision
making system. ITDA supports the multiuser system. Each user can model the data in its own way according to
business need. In the ITDA terminology the information of the model is conceptualize as the ‘environment’.
Single user can have multiple environments for same data so that user gets various views of data for analysis
without having complexity of handling number of users for separate business need.
Data absorption options
ITDA system can accommodate pre-processed data present in flat files where transformation is not required.
For such cases it directly loads data in server and collects metadata for that environment. If data is present in
multiple sites then this system performs ETL processing during environment creation.
Flexibility in data selection
Data analyst can have analysis on some particular portion of data by using horizontal partitioning facility
given in this system. It allows the user to analyze particular snippet of dataset. It increases performance by
reducing number of rows used while running analytical queries or algorithms. User can directly get particular
portion of uploaded data by using row filter utility given in the system. This utility allows user to build row
filter query without requirement of prior SQL query knowledge. Both these facilities are integrated with the
system which can be used after creation of environment.
ISSN 1947-5500

IV. ITDA ETL PROCESS: CONCEPTUAL DESIGN
Process of ETL starts with the understanding of business requirements and the objective of the organization
followed by modelling and design of environment for that organization. Modelling and design are defined as
representation of key business measurements around its dimensions using dimension modelling. This process
decides the level of complexity of transformation based on the source of the data. If data is present at multiple
sites then ITDA provides the technique which takes care of extraction of data from multiple sources,
transformation and loading.
Last stage of conceptual design part is metadata generation. Metadata contents need to be formulated for a
specific multidimensional model. The process decides relationship described by the dimensions, like
hierarchical relationship, or sequential relationships. It also gives the level of relationship exists in each
component of the dimensional structure. ITDA produces the flat file at the end of the process containing the
complete metadata for a multidimensional model created by the user. It also stores the information of the
temporal component to create the run time summaries.
A. ETL Algorithm
During the implementation of the ETL process in ITDA, every correct or missed step is recorded and made
available to the user.
1) Finalize the ETL processing path
2) Finalize the type of data source
3) For each any data source map the data source attributes with the dimensional attributes
4) Preparation of metadata
5) Preparation of configuration file for further processing of model
One of the basic motives behind the ITDA project is to provide the multidimensional analysis platform for
non expert data analyst community along with the expert data scientist. The project focuses on interactive, user
friendly implementation of the ETL process.
ETL processing path decides whether the data sources are at the same site or located on the different sites. If
the data sources are located at different location then the user needs to create one configuration file and based on
the instructions given in the file; data will be absorbed. If the data source is at single location then next step is to
decide the type of data source like flat file or any other database. Mapping of data source attributes and
dimension and fact values are performed and then metadata is generated.
V USE OF ETL SERVICE: CASE STUDY FOR FLAT FILE AND DATABASE
ITDA implements the ETL process with highly interactive and user friendly interface. It covers complete ETL
process without any programming aspect. Fig 1 gives the main interface of the ITDA system which allows user
to initiate the creation of new environment in the system.
ISSN 1947-5500

Fig 1ITDA user interface – option to create new environment
Fig. 2 ITDA interface – selection of ETL processing path
Fig 2 shows the interface which provides the two different paths for ETL process. If the data is available in
the already transformed form according to business rule then user will go for the ‘Simple Upload’ option. If the
data needs to be extracted from various sources and need the pre-processing according to business rule then the
‘Steps Upload’ option can be the choice.
A. Simple upload
This module assumes that data is already in the required form and there is no need of transformation step. For
single source data, we can have data in flat files or in database server.
B. Flat files
Generally spread sheets or text file formats are used to export data from any database server. If source
machine is not accessible from remote location, user can export data in flat files and use those files to create new
environment (multidimensional model) in this web tool. ITDA allows user to have data in standard comma
separated file or any other flat file with any type of separator. User is allowed to see sample data. Standard
query is generated by the system so that user can drop some of the unwanted columns. Successful creation of
table enables the metadata collection interface. Figure 3 shows the user interface for uploading csv files to the
server. Figure 4 shows the interface with the sample data from the selected file and the standard query generated
by the system to extract the file. Analyst can customize the query further.
ISSN 1947-5500

Fig. 3 ITDA interface – selecting file as the data source
Fig. 4 ITDA interface – sample data and editable extraction query
C. Database
When data source is any database then connection details can be provided so that system can access the data.
In this module if source connection and destination connection are same then it will skip data migration process.
It avoids extra overload of unnecessary copying entire table. As we are using on the fly architecture, we can use
source table for analysis. For analysis we are going to read existing data in OLTP server. In on the fly
architecture, we can use same server for OLTP and OLAP processing. This is the biggest advantage of using on
the fly architecture. Figure 5 shows database option available in simple upload module.
Fig. 5 ITDA interface – options to map source database for data extraction
ISSN 1947-5500

D. Steps Upload
If we have to take data from multiple sources then we need to have extraction and transformation logic at the
server side. In simple upload we were getting pre-processed data so it was much easier to load data in the server
and collect metadata. To support collection of data from multiple sites we have steps upload module. This
module takes care of extraction, transformation and loading of data at server. In this, we need to upload
configuration file to the server containing all details and transformation scripts. We will get all configuration file
parameters at the time of conceptual design. Configuration file is in simple text format so that any database user
can build it. It is required to keep the process of environment creation as easy as possible. Figure 6 shows steps
upload choice
Fig. 6 ITDA interface – option when data needs transformation steps
E. Metadata Collection
In order to have multiuser system it is needed to maintain context of every user separately. ITDA ETL process
uses specific directory structure for maintaining all the environments created by any user. For every
environment there is one flat file for storing the customized operations built by user for performing OLAP. This
file is retaining for each environment separately to avoid clash. All the necessary information to operate the
models created by a user, a separate directory structure is provided to every user. When user registered for the
first time to the system, this complete directory structure is created for that user.
To give metadata user can fill simple html form control to mention dimension names, their hierarchies and
time dimension details. Once this data is inserted system can proceed for environment creation.
F. ETL for periodic updates
For any ETL system updating data in the server is crucial part. Since OLTP servers will be generating new
data continuously. To have analysis on updated data; either the system will change the data available in the
server or will add new data keeping earlier data as it is. Here important thing is environment metadata is not
changing so metadata collection process can be skipped and directly system can update the data in the required
environment.
ISSN 1947-5500

Fig. 7 ITDA interface – options for edit environment
Each user can create any number of environments based on the analysis need. Update process will be invoked
for each separate environment. Figure 7 shows environment selection interface and various operations that user
can perform after selecting it.
Here this module is for loading new data to the same environment. Figure 8 shows the result after uploading
new dataset file to the server. This module flush older data from the table and inserts new data.
Fig. 8 ITDA interface – edit environment option
CONCLUSION AND FUTURE WORK
Design of ETL process in ITDA addresses the requirements of efficient extraction, transformation and loading
of data from various sources. It also meets with the challenges in assimilating data from heterogeneous data
sources, provides an easy to use tool for uploading the existing data set in hand. It successfully collects all
metadata parameters required for multidimensional analysis. The designed ETL model can be extended to
include facilities of automatic multidimensional modelling where automatic extraction of metadata will be done
at the time of load. It can also have context based data generation which collects as well as models the data
gathered from web. This data can in turn be tunnelled to multidimensional analysis.
ISSN 1947-5500

REFERENCES
[1] Prarthana A. Deshkar, Parag S. Deshpande, A. Thomas, “ Multidimensional Data Analysis Facilities and Challenges: A
Survey for Data Analysis Tools”, International Journal of Computer Applications (0975 – 8887), Volume 179 – No.13,
January 2018
[2] Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, Neoklis Polyzotis “SEEDB: Efficient Data-
Driven Visualization Recommendations to Support Visual Analytics”, Proceedings of the VLDB Endowment, Vol. 8, No.
13 Copyright 2015 VLDB Endowment 2150-8097/15/09.
[3] Data Modeling Guide, IBM Cognos Analytics Version 11.0.0, Copyright IBM Corporation 2015, 2017.
[4] Sandro Fiore, Alessandro D’Anca, Donatello Elia, Cosimo Palazzo, Ian Foster, Dean Williams, Giovanni Aloisio,
“Ophidia: a full software stack for scientific data analytics”, 978-1-4799-5313-4/14/$31.00 ©2014 IEEE
[5] S. Fiorea, A. D’Ancaa, C. Palazzoa,b, I. Fosterc, D. N. Williamsd, G. Aloisioa, “Ophidia: toward big data analytics for
eScience”, 2013 International Conference on Computational Science, doi: 10.1016/j.procs.2013.05.409, 2013
[6] Architecture for Enterprise Business Intelligence, an overview of the microstrategy platform architecture for big data, cloud
bi, and mobile applications
[7] Usman AHMED, “Dynamic Cubing for Hierarchical Multidimensional Data Space”, PhD thesis, February 2013
[8] Muntazir Mehdi, Ratnesh Sahay, Wassim Derguech, Edward Curry, “On-The-Fly Generation of Multidimensional Data
Cubes for Web of Things”, IDEAS ’13 October 09 - 11 2013, Barcelona, Spain
[9] Yang Zhang, Simon Fong, Jinan Fiaidhi, SabahMohammed, “Real-Time Clinical Decision Support Systemwith Data
StreamMining”, Hindawi Publishing Corporation Journal of Biomedicine and Biotechnology Volume 2012
[10] Sandra Geisler, Christoph Quix, Stefan Schiffer, Matthias Jarke, “An evaluation framework for traffic information systems
based on data streams”, 2011 Elsevier Ltd. All rights reserved.
[11] IBM Cognos Dynamic Cubes, October 2012
[12] Marta Zorrilla, Diego García-Saiz, “A service oriented architecture to provide data mining services for non-expert data
miners”,
ISSN 1947-5500

A Software Infrastructure for Multidimensional Data Analysis: A Data Modelling Aspect

More Related Content

What's hot (20)

Similar to A Software Infrastructure for Multidimensional Data Analysis: A Data Modelling Aspect (20)

Recently uploaded (20)

A Software Infrastructure for Multidimensional Data Analysis: A Data Modelling Aspect