SlideShare a Scribd company logo
ETL & Basic OLAP OperationsCSE – 590 Data Mining				Prof. Anita Wasilewska				SUNY Stony BrookPresented By :		-> Preeti Kudva (106887833)		-> Kinjal Khandhar(106878039)
REFERENCES :Data Mining: Concepts & Techniques by Jiawei Han and MichelineKamber.
Presentation Slides of Prof. Anita Wasilewska.
http://en.wikipedia.org/wiki/Extract,_transform,_load
Ralph Kimball, Joe Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for  Extracting, Cleaning, Conforming and Delivering Data
Conceptual modeling for ETL processes by Panos Vassiliadis,Alkis Simitsis,Spiros Skiadopoulos.
http://en.wikipedia.org/wiki/Category:ETL_tools
http://www.1keydata.com/datawarehousing/tooletl.html
http://www.bi-bestpractices.com/view-articles/4738
http://www.computerworld. com/databasetopics/data/story/0,10801,80222,00. html 		OverviewWhat is ETL?ETL In the Architecture.General ETL issues	- Extract    - Transformations/cleansing    - LoadETL Example
What is ETL?	 Extract     - Extract relevant data.Transform     - Transform data to DW format.     - Build DW keys, etc.     - Cleansing of data.Load    - Load data into DW.    - Build aggregates, etc.https://eprints.kfupm.edu.sa/74341/1/74341.pdf
OthersourcesExtractTransformLoadRefreshOperational DBsMetadataAnalysisQueryReportsData miningServeDataWarehouseData MartsData SourcesFront-End ToolsData StorageData Warehouse Architecturehttp://infolab.stanford.edu/warehousing/
Extract
ExtractGoal: fast extract of relevant data
Extract data from different data source formats like flat files, Relational Database Systems,etc.
Convert data into a specific format for transformation processing.
Parse extracted data.
Result in a check if data meets expected pattern/structure.http://en.wikipedia.org/wiki/Extract,_transform,_load
Types of Data SourcesNon-cooperative sources      - Snapshot sources – provides only full copy of source, e.g., files	- Specific sources – each is different, e.g., legacy systems	- Logged sources – writes change log, e.g., DB log	- Queryable sources – provides query interface, e.g., RDBMSCooperative sources	- Replicated sources – publish/subscribe mechanism	- Call back sources – calls external code (ETL) when changes occur	- Internal action sources – only internal actions when changes  occur.eg.DB triggers.Extract strategy depends on the source types.https://intranet.cs.aau.dk/fileadmin/user_upload/Education/Courses/2009/DWML/slides/DW4_ETL.pdf
Extract from Operational SystemDesign Time    – Create/Import Data Sources definition    – Define Stage or Work Areas()    – Validate Connectivity    – Preview/Analyze Sources    – Define Extraction Scheduling                Determine Extract Windows for source system                Batch Extract (Overnight, weekly, monthly)                Continuous extracts (Trigger on Source Table)Run Time    – Connect to the predefined Data Sources as scheduled    – Get raw data save locally in workspace DB
Transform
Transformation – a series of 						   rules/functions.Common Transformations are:Convert data into a consistent, standardized form.
 Cleanse(Automated)
Synonym Substitutions.
Spelling Corrections.
Encoding free-form values(map Male to 1 & Mr to M)
 Merge/Purge(join data from multiple sources).
 Aggregate(eg.rollup)
 Calculate(sale_amt = qty * price)
 Data type conversion
 Data content audit
 Null value handling(null = not to load)
 Customized transformation(based on user).http://www.bi-bestpractices.com/view-articles/4738
Common Transformations..contdData type conversions-EBCDIC  ASCII/UniCode.	-String manipulations.	-Date/time format conversions.		-- E.g., unix time 1201928400 = what time?Normalization/Denormalization- To the desired DW format.	- Depending on source format.Building keys- Table matches production keys to surrogate DW keys.	- Correct handling of history - especially for total reload.https://intranet.cs.aau.dk/fileadmin/user_upload/Education/Courses/2009/DWML/slides/DW4_ETL.pdf
Cleansing	Why cleansing? Garbage In Garbage Out.
BI does not work on “raw” data	- Pre-processing necessary for BI analysis.Handle inconsistent data formats	- Spellings, codings, … Remove unnecessary attributes	- Production keys, comments,…Replace codes with text [for easy understanding]	- City name instead of ZIP code, e.g., Aalborg Centrum vs. DK-9000Combine data from multiple sources with common key	- E.g., customer data from customer address, customer name, …Aalborg University 2009 - DWML course
Cleansing	Don’t use “special” values (e.g., 0, -1) in your data.	-They are hard to understand in query/analysis operations. Mark facts with Data Status dimension	- Normal, abnormal, outside bounds, impossible,…	- Facts can be taken in/out of analyses.Uniform treatment of NULL	- Use NULLs only for measure values (estimates instead?)	- Use special dimension key (i.e., surrogate key value) for	 NULL dimension values E.g., for the time dimension, instead of NULL, use special key values to represent “Date not known”, “Soon to happen”.Avoid problems in joins, since NULL is not equal to NULLAalborg University 2009 - DWML course
Data Quality – most impData almost never has decent quality
Data in DW must be:1] Precise     - DW data must match known numbers - or explanation needed.2] Complete	     - DW has all relevant data and the users know.3] Consistent	      - No contradictory data: aggregates fit with detail data.4] Unique           - The same things is called the same and has the same key(customers).5] Timely	        - Data is updated ”frequently enough” and the users know when.
Improving Data QualityAppoint “data quality administrator”	-Responsibility for data quality.	-Includes manual inspections and corrections!Source-controlled improvements.
Construct programs that check data quality	- Are totals as expected?	-Do results agree with alternative source?	-Number of NULL values?
Transformation in Operational SystemDesign Time     – Specify Criteria/Filter for aggregation.     – Define operators (Mostly Set/SQL based).     – Map columns using operators/Lookups.     – Define other transformation rules.     – Define mappings and/or add new fields.Run Time    – Transform (Cleanse, consolidate, Apply Business Rule, De-Normalize/Normalize) Extracted Data by applying the  operators mapped in design time.   – Aggregate (create & populate raw table).   – Create & populate Staging table.
Load
LoadGoal:fast loading into end target(DW).-Loading chunks is much faster than total load.SQL-based update is slow.- Large overhead (optimization, locking, etc.) for every  SQL call.	-DB load tools are much faster. Index on tables slows load a lot.- Drop index and rebuild after load	- Can be done per index partitionParallellization- Dimensions can be loaded concurrently	- Fact tables can be loaded concurrently	- Partitions can be loaded concurrentlyhttp://en.wikipedia.org/wiki/Extract,_transform,_load
Load	Relationships in the data	- Referential integrity and data consistency must be ensured before loading (Why?)		--Because they won’t be checked in the DW again	- Can be done by loaderAggregates	- Can be built and loaded at the same time as the detail data.Load tuning	- Load without log.	-Sort load file first.	-Make only simple transformations in loader.	-Use loader facilities for building aggregates.http://en.wikipedia.org/wiki/Extract,_transform,_load
Load in Operational SystemsDesign Time   – Design Warehouse.   – Map Staging Data to fact or dimension table attributes.Run Time– Publish Staging data to Data mart (update dimension tables along with the fact tables.
ETL ToolsFrom big vendors :	-Oracle Warehouse Builder	-IBM DB2 Warehouse Manager	-Microsoft Integration ServicesOffers much functionality at a reasonable price	- Data modeling	-ETL code generation	-Scheduling DW jobs The “best” tool does not exist	- Choose based on your own needs http://en.wikipedia.org/wiki/Category:ETL_tools
		ETL Examplehttp://www.stylusstudio.com/etl/
Example of how ETL Works!!!!Consider the HR Department database:-
Extract Step for the Use-Case:Take the data from the dbase III file and convert it into a more usable format - XML .
Extracting data can be done using XML convertors .Just select our table and choose dbase III convertor will transfer the data into XML.
The result of this extraction will be an XML file similar to this:
<?xml version="1.0" encoding="UTF-8"?><table date="20060731" rows="5">    <row row="1">        <NAME>Guiles, Makenzie</NAME>        <STREET>145 Meadowview Road</STREET>        <CITY>South Hadley</CITY>        <STATE>MA</STATE>        <ZIP>01075</ZIP>        <DEAR_WHO>Macy</DEAR_WHO>        <TEL_HOME>(413)555-6225</TEL_HOME>        <BIRTH_DATE>19770201</BIRTH_DATE>        <HIRE_DATE>20060703</HIRE_DATE>        <INSIDE>yes</INSIDE>    </row>    ...</table>Extract – Part 2Find out the Target Schema,which can be done by using DB to XML Data Source Module.Currently using Northwind that comes with Standard SQL Server.(save as etl-target.rdbxml)
Transforming Data into the Target Form Use series of XSLT transforms to modify this.	In a production ETL operation, likely each step would be more complicated, and/or would use different technologies or methods.1] Convert the dates from CCYYMMDD into CCYY-MM-DD (the "ISO 8601" format) [etl-code-1.xsl]    2]Split the first and last names [etl-code-2.xsl]     3]Assign the manager based on inside or external sales [etl-code-3.xsl]     4]Map the data to the new schema [etl-code-4.xsl]      This mapping can be done by using XSLT mapper.
Output of above steps + etl-target.rdbxml gives:
Loading Our ETL Results into the Data Repository loading is a just matter of writing the output of the last XSLT transform step into the etl-target.rdbxml map we built earlier.
Olap Operations
References:Data Mining: Concepts & Techniques by Jiawei Han and MichelineKamber.http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/Lecture4.pdfhttp://en.wikipedia.org/wiki/Online_Analytical_Processinghttp://www.cs.sfu.ca/~hanhttp://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5http://www.fmt.vein.hu/softcom/dw
OverviewOLAPOLAP Cube & Multidimensional dataOLAP Operations:    - Roll up (Drill up)	- Drill down (Rolldown) 	- Slice & Dice	- Pivot    - Other operationsExamplesOLAPOnline analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical queries.[http://en.wikipedia.org/wiki/Online_analytical_processing]The typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar  areas.The term OLAP was created as a slight modification of the traditional database term OLTP(Online Transaction Processing).[http://en.wikipedia.org/wiki/Online_analytical_processing]
OLAP CubeData warehouse & OLAP tools are based on a multidimensional data model which views data in the form of a data cube.An OLAP (Online Analytical Processing) cube is a data structure that allows fast analysis of data.The OLAP cube consists of numeric facts called measures which are categorized by dimensions.      -Dimensions: perspective or entities with respect to which an organization wants to keep records.     -Facts: quantities by which we want to analyze relations between dimensions.The cube metadata may be created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables.Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
Concept HierarchyA concept hierarchy defines a sequence of mappings from a set of low level concepts to higher level,  more general concepts.Each of the elements of a dimension could be summarized using a hierarchy. The hierarchy is a series of parent-child relationships, typically where a parent member represents the consolidation of the members which are its children. Parent members can be further aggregated as the children of another parent.Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
Example – Star Schemaitemtimeitem_keyitem_namebrandtypesupplier_typetime_keydayday_of_the_weekmonthquarteryearlocationbranchlocation_keystreetcityprovince_or_streetcountrybranch_keybranch_namebranch_typeSales Fact Table           time_key              item_key           branch_key         location_keyunits_soldMeasuresReference: http://www.cs.sfu.ca/~han
Example: Example:     Dimensions: Item, Location, TimeHierarchical summarization pathsRegion   Type   Region         Year  Brand   Country  Quarter  Item   City     MonthStreet       Week                                        DayItemMonthReference: http://www.cs.sfu.ca/~han
Working Example(1)Reference: http://www.fmt.vein.hu/softcom/dw
Roll up(Drill up)Performs aggregation on a data cube eitherby climbing up the concept hierarchy for a dimension or by dimension reduction.[http://www.cs.sfu.ca/~han]Specific grouping on one dimension where we go from a lower level of aggregation to a higher. [http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf] e.g. summing-up per whole fiscal year
e.g. summarization over aggregate hierarchy (total sales per region, state)Drill down (Roll down)Reverse of roll-up .[http://www.cs.sfu.ca/~han]Navigates from less detailed data to more detailed data.Can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions. Finer-grained view on aggregated data, i.e. going from higher to lower aggregation.[http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf]e.g. disaggregate volume of products sales by region/city.Roll Up & Drill down on Working Example(1)Roll up sales on time from month to quarterDrill down sales on location from city to plantReference: http://www.fmt.vein.hu/softcom/dw
Slice and DiceSlice:    - performs a selection on one dimension of the given cube, resulting in a sub cube.     e.g. slicing volume of products in product dimension for product_model=‘1996’. Dice:   -performs a selection operation on two or more dimensions.    e.g. dicing the central cube based on the following selection criteria: (location=“Montreal” or “Vancouver”)and (time=“Q1”or”Q2”)and(item=“cell phone”or”pager”)Slice & Dice on Working Example(1)Dicing volume of Products in product & time dimension      Slicing volume of Products in product dimensionReference: http://www.fmt.vein.hu/softcom/dw
PivotVisualization operation which rotates the data axes in view in order to provide an alternate presentation of data.Select a different dimension (orientation) for analysis[http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf]E.g. pivot operation where location & item in a 2D slice are rotated.Other examples:     - rotating the axes in a 3D cube.     - transforming 3D cube into series of 2D planes.
Working Example(2)Dimension Tables:Market(Market_ID, City, Region)Product(Product_ID,  Name, Category)Time( Time_ID,Week,Month,Quarter)Fact Table:Sales(Market_ID,Product_ID,Time_ID,Amount)Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
Roll up & Drill down on Working Example(2)SELECT S.Product_ID,M.City, SUM(S.Amount) INTO City_SalesFROM Sales S, Market MWHEREM.Market_ID = S.Market_IDGROUP BY S.Product_ID, M.CityRoll up salesonMarket from city to regionSELECTT.Product_ID,M.Region, SUM(T.Amount)    FROM City_Sales T, Market M     WHERE T.City = M.City    GROUP BY T.Product_ID, M.RegionDrill down sales on Market from region to cityReference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
Slice & Dice on Working Example(2)     Dicing sales in the time dimension (e.g. total sales for each product in each quarter)SELECT S.Product_ID, T.Quarter, SUM(S.Amount)FROM Sales S, Time TWHERET.Time_ID = S.Time_ID AND T.Week=‘Week12’ AND (S.Product_ID =‘1002’ OR S.Product_ID =‘1003’)GROUP BY T.Quarter, S.Product_ID    Slicing the data cube in the time dimension (e.g. choosing sales only in week 12)SELECT S.*FROM Sales S, Time TWHERET.Time_ID = S.Time_ID AND T.Week = ‘Week12’Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
Other Operationsdrill across:executes queries involving (across) more than one fact table.drill through:makes use of relational SQL facilities to drill through the bottom level of the cube to its back-end relational tables.Reference: [http://www.cs.sfu.ca/~han]

More Related Content

PPTX
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
PDF
Data Migration with Spark to Hive
PPTX
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
PPTX
ADF Mapping Data Flows Level 300
PPTX
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
PPTX
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
PPTX
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
PPTX
U-SQL Partitioned Data and Tables (SQLBits 2016)
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
Data Migration with Spark to Hive
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
ADF Mapping Data Flows Level 300
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)

What's hot (20)

PPTX
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
PPTX
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
PPTX
Advanced integration services on microsoft ssis 1
PPTX
U-SQL Query Execution and Performance Basics (SQLBits 2016)
PPTX
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
PPTX
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
PPTX
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
PPTX
Azure Data Lake and U-SQL
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
PPTX
U-SQL Query Execution and Performance Tuning
PDF
Streaming SQL
PDF
Uncovering SQL Server query problems with execution plans - Tony Davis
PPTX
Killer Scenarios with Data Lake in Azure with U-SQL
PPTX
Azure Data Lake Analytics Deep Dive
PPTX
Introduction to HiveQL
PPT
Datastage Introduction To Data Warehousing
PPTX
Hive and HiveQL - Module6
PPTX
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
PPTX
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
PPT
KnowIT, semantic informatics knowledge base
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
Advanced integration services on microsoft ssis 1
U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Azure Data Lake and U-SQL
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
U-SQL Query Execution and Performance Tuning
Streaming SQL
Uncovering SQL Server query problems with execution plans - Tony Davis
Killer Scenarios with Data Lake in Azure with U-SQL
Azure Data Lake Analytics Deep Dive
Introduction to HiveQL
Datastage Introduction To Data Warehousing
Hive and HiveQL - Module6
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
KnowIT, semantic informatics knowledge base
Ad

Viewers also liked (6)

PDF
Spider Setup with AWS/sandbox
PPTX
Bringing OLTP woth OLAP: Lumos on Hadoop
PPTX
Ronalao termpresent
PPTX
Oltp vs olap
PDF
An Overview of All Ericsson Labs APIs
DOCX
Best topics for seminar
Spider Setup with AWS/sandbox
Bringing OLTP woth OLAP: Lumos on Hadoop
Ronalao termpresent
Oltp vs olap
An Overview of All Ericsson Labs APIs
Best topics for seminar
Ad

Similar to ETL (20)

PPTX
SQL Server 2008 Development for Programmers
PPT
Java Developers, make the database work for you (NLJUG JFall 2010)
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
PPT
Datawarehousing & DSS
DOCX
ETL and pivoting in spark
DOCX
ETL and pivoting in spark
PPT
Dan Querimit - BI Portfolio
PPTX
Data ware house design
PPTX
Data ware house design
DOCX
Cassandra data modelling best practices
PDF
Dipankar resume 2.0 (1)
PPT
ITReady DW Day2
DOCX
Microsoft Fabric data warehouse by dataplatr
PPT
Building the DW - ETL
DOC
Dwh faqs
DOC
PLSQL - Raymond Wu
PDF
Taming the shrew Power BI
PDF
Discovery & Consumption of Analytics Data @Twitter
PDF
In-memory ColumnStore Index
SQL Server 2008 Development for Programmers
Java Developers, make the database work for you (NLJUG JFall 2010)
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Datawarehousing & DSS
ETL and pivoting in spark
ETL and pivoting in spark
Dan Querimit - BI Portfolio
Data ware house design
Data ware house design
Cassandra data modelling best practices
Dipankar resume 2.0 (1)
ITReady DW Day2
Microsoft Fabric data warehouse by dataplatr
Building the DW - ETL
Dwh faqs
PLSQL - Raymond Wu
Taming the shrew Power BI
Discovery & Consumption of Analytics Data @Twitter
In-memory ColumnStore Index

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

ETL

  • 1. ETL & Basic OLAP OperationsCSE – 590 Data Mining Prof. Anita Wasilewska SUNY Stony BrookPresented By : -> Preeti Kudva (106887833) -> Kinjal Khandhar(106878039)
  • 2. REFERENCES :Data Mining: Concepts & Techniques by Jiawei Han and MichelineKamber.
  • 3. Presentation Slides of Prof. Anita Wasilewska.
  • 5. Ralph Kimball, Joe Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data
  • 6. Conceptual modeling for ETL processes by Panos Vassiliadis,Alkis Simitsis,Spiros Skiadopoulos.
  • 10. http://www.computerworld. com/databasetopics/data/story/0,10801,80222,00. html OverviewWhat is ETL?ETL In the Architecture.General ETL issues - Extract - Transformations/cleansing - LoadETL Example
  • 11. What is ETL? Extract - Extract relevant data.Transform - Transform data to DW format. - Build DW keys, etc. - Cleansing of data.Load - Load data into DW. - Build aggregates, etc.https://eprints.kfupm.edu.sa/74341/1/74341.pdf
  • 12. OthersourcesExtractTransformLoadRefreshOperational DBsMetadataAnalysisQueryReportsData miningServeDataWarehouseData MartsData SourcesFront-End ToolsData StorageData Warehouse Architecturehttp://infolab.stanford.edu/warehousing/
  • 14. ExtractGoal: fast extract of relevant data
  • 15. Extract data from different data source formats like flat files, Relational Database Systems,etc.
  • 16. Convert data into a specific format for transformation processing.
  • 18. Result in a check if data meets expected pattern/structure.http://en.wikipedia.org/wiki/Extract,_transform,_load
  • 19. Types of Data SourcesNon-cooperative sources - Snapshot sources – provides only full copy of source, e.g., files - Specific sources – each is different, e.g., legacy systems - Logged sources – writes change log, e.g., DB log - Queryable sources – provides query interface, e.g., RDBMSCooperative sources - Replicated sources – publish/subscribe mechanism - Call back sources – calls external code (ETL) when changes occur - Internal action sources – only internal actions when changes occur.eg.DB triggers.Extract strategy depends on the source types.https://intranet.cs.aau.dk/fileadmin/user_upload/Education/Courses/2009/DWML/slides/DW4_ETL.pdf
  • 20. Extract from Operational SystemDesign Time – Create/Import Data Sources definition – Define Stage or Work Areas() – Validate Connectivity – Preview/Analyze Sources – Define Extraction Scheduling Determine Extract Windows for source system Batch Extract (Overnight, weekly, monthly) Continuous extracts (Trigger on Source Table)Run Time – Connect to the predefined Data Sources as scheduled – Get raw data save locally in workspace DB
  • 22. Transformation – a series of rules/functions.Common Transformations are:Convert data into a consistent, standardized form.
  • 26. Encoding free-form values(map Male to 1 & Mr to M)
  • 27. Merge/Purge(join data from multiple sources).
  • 29. Calculate(sale_amt = qty * price)
  • 30. Data type conversion
  • 32. Null value handling(null = not to load)
  • 33. Customized transformation(based on user).http://www.bi-bestpractices.com/view-articles/4738
  • 34. Common Transformations..contdData type conversions-EBCDIC ASCII/UniCode. -String manipulations. -Date/time format conversions. -- E.g., unix time 1201928400 = what time?Normalization/Denormalization- To the desired DW format. - Depending on source format.Building keys- Table matches production keys to surrogate DW keys. - Correct handling of history - especially for total reload.https://intranet.cs.aau.dk/fileadmin/user_upload/Education/Courses/2009/DWML/slides/DW4_ETL.pdf
  • 36. BI does not work on “raw” data - Pre-processing necessary for BI analysis.Handle inconsistent data formats - Spellings, codings, … Remove unnecessary attributes - Production keys, comments,…Replace codes with text [for easy understanding] - City name instead of ZIP code, e.g., Aalborg Centrum vs. DK-9000Combine data from multiple sources with common key - E.g., customer data from customer address, customer name, …Aalborg University 2009 - DWML course
  • 37. Cleansing Don’t use “special” values (e.g., 0, -1) in your data. -They are hard to understand in query/analysis operations. Mark facts with Data Status dimension - Normal, abnormal, outside bounds, impossible,… - Facts can be taken in/out of analyses.Uniform treatment of NULL - Use NULLs only for measure values (estimates instead?) - Use special dimension key (i.e., surrogate key value) for NULL dimension values E.g., for the time dimension, instead of NULL, use special key values to represent “Date not known”, “Soon to happen”.Avoid problems in joins, since NULL is not equal to NULLAalborg University 2009 - DWML course
  • 38. Data Quality – most impData almost never has decent quality
  • 39. Data in DW must be:1] Precise - DW data must match known numbers - or explanation needed.2] Complete - DW has all relevant data and the users know.3] Consistent - No contradictory data: aggregates fit with detail data.4] Unique - The same things is called the same and has the same key(customers).5] Timely - Data is updated ”frequently enough” and the users know when.
  • 40. Improving Data QualityAppoint “data quality administrator” -Responsibility for data quality. -Includes manual inspections and corrections!Source-controlled improvements.
  • 41. Construct programs that check data quality - Are totals as expected? -Do results agree with alternative source? -Number of NULL values?
  • 42. Transformation in Operational SystemDesign Time – Specify Criteria/Filter for aggregation. – Define operators (Mostly Set/SQL based). – Map columns using operators/Lookups. – Define other transformation rules. – Define mappings and/or add new fields.Run Time – Transform (Cleanse, consolidate, Apply Business Rule, De-Normalize/Normalize) Extracted Data by applying the operators mapped in design time. – Aggregate (create & populate raw table). – Create & populate Staging table.
  • 43. Load
  • 44. LoadGoal:fast loading into end target(DW).-Loading chunks is much faster than total load.SQL-based update is slow.- Large overhead (optimization, locking, etc.) for every SQL call. -DB load tools are much faster. Index on tables slows load a lot.- Drop index and rebuild after load - Can be done per index partitionParallellization- Dimensions can be loaded concurrently - Fact tables can be loaded concurrently - Partitions can be loaded concurrentlyhttp://en.wikipedia.org/wiki/Extract,_transform,_load
  • 45. Load Relationships in the data - Referential integrity and data consistency must be ensured before loading (Why?) --Because they won’t be checked in the DW again - Can be done by loaderAggregates - Can be built and loaded at the same time as the detail data.Load tuning - Load without log. -Sort load file first. -Make only simple transformations in loader. -Use loader facilities for building aggregates.http://en.wikipedia.org/wiki/Extract,_transform,_load
  • 46. Load in Operational SystemsDesign Time – Design Warehouse. – Map Staging Data to fact or dimension table attributes.Run Time– Publish Staging data to Data mart (update dimension tables along with the fact tables.
  • 47. ETL ToolsFrom big vendors : -Oracle Warehouse Builder -IBM DB2 Warehouse Manager -Microsoft Integration ServicesOffers much functionality at a reasonable price - Data modeling -ETL code generation -Scheduling DW jobs The “best” tool does not exist - Choose based on your own needs http://en.wikipedia.org/wiki/Category:ETL_tools
  • 49. Example of how ETL Works!!!!Consider the HR Department database:-
  • 50. Extract Step for the Use-Case:Take the data from the dbase III file and convert it into a more usable format - XML .
  • 51. Extracting data can be done using XML convertors .Just select our table and choose dbase III convertor will transfer the data into XML.
  • 52. The result of this extraction will be an XML file similar to this:
  • 53. <?xml version="1.0" encoding="UTF-8"?><table date="20060731" rows="5">    <row row="1">        <NAME>Guiles, Makenzie</NAME>        <STREET>145 Meadowview Road</STREET>        <CITY>South Hadley</CITY>        <STATE>MA</STATE>        <ZIP>01075</ZIP>        <DEAR_WHO>Macy</DEAR_WHO>        <TEL_HOME>(413)555-6225</TEL_HOME>        <BIRTH_DATE>19770201</BIRTH_DATE>        <HIRE_DATE>20060703</HIRE_DATE>        <INSIDE>yes</INSIDE>    </row>    ...</table>Extract – Part 2Find out the Target Schema,which can be done by using DB to XML Data Source Module.Currently using Northwind that comes with Standard SQL Server.(save as etl-target.rdbxml)
  • 54. Transforming Data into the Target Form Use series of XSLT transforms to modify this. In a production ETL operation, likely each step would be more complicated, and/or would use different technologies or methods.1] Convert the dates from CCYYMMDD into CCYY-MM-DD (the "ISO 8601" format) [etl-code-1.xsl] 2]Split the first and last names [etl-code-2.xsl] 3]Assign the manager based on inside or external sales [etl-code-3.xsl] 4]Map the data to the new schema [etl-code-4.xsl] This mapping can be done by using XSLT mapper.
  • 55. Output of above steps + etl-target.rdbxml gives:
  • 56. Loading Our ETL Results into the Data Repository loading is a just matter of writing the output of the last XSLT transform step into the etl-target.rdbxml map we built earlier.
  • 58. References:Data Mining: Concepts & Techniques by Jiawei Han and MichelineKamber.http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/Lecture4.pdfhttp://en.wikipedia.org/wiki/Online_Analytical_Processinghttp://www.cs.sfu.ca/~hanhttp://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5http://www.fmt.vein.hu/softcom/dw
  • 59. OverviewOLAPOLAP Cube & Multidimensional dataOLAP Operations: - Roll up (Drill up) - Drill down (Rolldown) - Slice & Dice - Pivot - Other operationsExamplesOLAPOnline analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical queries.[http://en.wikipedia.org/wiki/Online_analytical_processing]The typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas.The term OLAP was created as a slight modification of the traditional database term OLTP(Online Transaction Processing).[http://en.wikipedia.org/wiki/Online_analytical_processing]
  • 60. OLAP CubeData warehouse & OLAP tools are based on a multidimensional data model which views data in the form of a data cube.An OLAP (Online Analytical Processing) cube is a data structure that allows fast analysis of data.The OLAP cube consists of numeric facts called measures which are categorized by dimensions. -Dimensions: perspective or entities with respect to which an organization wants to keep records. -Facts: quantities by which we want to analyze relations between dimensions.The cube metadata may be created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables.Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
  • 61. Concept HierarchyA concept hierarchy defines a sequence of mappings from a set of low level concepts to higher level, more general concepts.Each of the elements of a dimension could be summarized using a hierarchy. The hierarchy is a series of parent-child relationships, typically where a parent member represents the consolidation of the members which are its children. Parent members can be further aggregated as the children of another parent.Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
  • 62. Example – Star Schemaitemtimeitem_keyitem_namebrandtypesupplier_typetime_keydayday_of_the_weekmonthquarteryearlocationbranchlocation_keystreetcityprovince_or_streetcountrybranch_keybranch_namebranch_typeSales Fact Table time_key item_key branch_key location_keyunits_soldMeasuresReference: http://www.cs.sfu.ca/~han
  • 63. Example: Example: Dimensions: Item, Location, TimeHierarchical summarization pathsRegion Type Region Year Brand Country Quarter Item City MonthStreet Week DayItemMonthReference: http://www.cs.sfu.ca/~han
  • 65. Roll up(Drill up)Performs aggregation on a data cube eitherby climbing up the concept hierarchy for a dimension or by dimension reduction.[http://www.cs.sfu.ca/~han]Specific grouping on one dimension where we go from a lower level of aggregation to a higher. [http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf] e.g. summing-up per whole fiscal year
  • 66. e.g. summarization over aggregate hierarchy (total sales per region, state)Drill down (Roll down)Reverse of roll-up .[http://www.cs.sfu.ca/~han]Navigates from less detailed data to more detailed data.Can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions. Finer-grained view on aggregated data, i.e. going from higher to lower aggregation.[http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf]e.g. disaggregate volume of products sales by region/city.Roll Up & Drill down on Working Example(1)Roll up sales on time from month to quarterDrill down sales on location from city to plantReference: http://www.fmt.vein.hu/softcom/dw
  • 67. Slice and DiceSlice: - performs a selection on one dimension of the given cube, resulting in a sub cube. e.g. slicing volume of products in product dimension for product_model=‘1996’. Dice: -performs a selection operation on two or more dimensions. e.g. dicing the central cube based on the following selection criteria: (location=“Montreal” or “Vancouver”)and (time=“Q1”or”Q2”)and(item=“cell phone”or”pager”)Slice & Dice on Working Example(1)Dicing volume of Products in product & time dimension Slicing volume of Products in product dimensionReference: http://www.fmt.vein.hu/softcom/dw
  • 68. PivotVisualization operation which rotates the data axes in view in order to provide an alternate presentation of data.Select a different dimension (orientation) for analysis[http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf]E.g. pivot operation where location & item in a 2D slice are rotated.Other examples: - rotating the axes in a 3D cube. - transforming 3D cube into series of 2D planes.
  • 69. Working Example(2)Dimension Tables:Market(Market_ID, City, Region)Product(Product_ID, Name, Category)Time( Time_ID,Week,Month,Quarter)Fact Table:Sales(Market_ID,Product_ID,Time_ID,Amount)Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
  • 70. Roll up & Drill down on Working Example(2)SELECT S.Product_ID,M.City, SUM(S.Amount) INTO City_SalesFROM Sales S, Market MWHEREM.Market_ID = S.Market_IDGROUP BY S.Product_ID, M.CityRoll up salesonMarket from city to regionSELECTT.Product_ID,M.Region, SUM(T.Amount) FROM City_Sales T, Market M WHERE T.City = M.City GROUP BY T.Product_ID, M.RegionDrill down sales on Market from region to cityReference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
  • 71. Slice & Dice on Working Example(2) Dicing sales in the time dimension (e.g. total sales for each product in each quarter)SELECT S.Product_ID, T.Quarter, SUM(S.Amount)FROM Sales S, Time TWHERET.Time_ID = S.Time_ID AND T.Week=‘Week12’ AND (S.Product_ID =‘1002’ OR S.Product_ID =‘1003’)GROUP BY T.Quarter, S.Product_ID Slicing the data cube in the time dimension (e.g. choosing sales only in week 12)SELECT S.*FROM Sales S, Time TWHERET.Time_ID = S.Time_ID AND T.Week = ‘Week12’Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
  • 72. Other Operationsdrill across:executes queries involving (across) more than one fact table.drill through:makes use of relational SQL facilities to drill through the bottom level of the cube to its back-end relational tables.Reference: [http://www.cs.sfu.ca/~han]
  • 73. 10 Challenging Problems in Data Mining ResearchXindong WuDepartment of Computer ScienceUniversity of Vermont33 Colchester Avenue, Burlington, Vermont 05405, [email protected] Yang Department of Computer ScienceHong Kong University of Science & TechnologyClearwater Bay, Kowloon, Hong Kong, Chinaong Kong University of Science and TechnologyClearwater Bay, Kowloon, Hong Kong, ChinaPresented in : ICDM '05The Fifth IEEE International Conference on Data Mining
  • 74. ContributorsPedro DomingosCharles ElkanJohannes GehrkeJiawei HanDavid HeckermanDaniel KeimJiming Liu Gregory Piatetsky-Shapiro
  • 79. Benjamin W. WahA New Feature at ICDM 2005 What are the 10 most challenging problems in data mining, today?Different people have different views, a function of time as well. What do the experts think? - Experts we consulted: Previous organizers of IEEE ICDM and ACM KDDThey were asked to list their 10 problems (requests sent out in Oct 05,and replies Obtained in Nov 05) Replies: - Edited and presented in this paper - Hopefully be useful for young researchers - Not in any particular importance order
  • 80. 1. Developing a Unifying Theory ofData MiningThe current state of the art of data-mining research is too ``ad-hoc“ - techniques are designed for individual problems (e.g. classification or clustering) - no unifying theoryA theoretical framework is required that unifies:Data Mining tasks – Clustering Classification Association Rules etc.Data Mining approaches - Statistics Machine Learning Database systems etc.
  • 81. 1. Developing a Unifying Theory ofData MiningLong standing problems in statistical research - How to avoid spurious correlations? - sometimes related to the problem of mining for “deep knowledge”.Example: Strong correlation found between the timing of TV series by a particular star and the occurrences of small market crashes in Hong Kong. Can we conclude that there is a hidden cause behind the correlation?
  • 82. 2. Scaling Up for High DimensionalData and High Speed Data StreamsScaling up is needed because of following challenges: - Classifiers with hundreds or billions of features to be built for applications like text mining & drug safety analysis. Challenge – how to design classifiers to handle ultra high dimensional classification problems. - Satellite and Computer Network data comprise extremely large databases (e.g. 100TB). Data mining technology today is still slow.Challenge – how can data mining technology handle data of this scale.
  • 83. 2. Scaling Up for High DimensionalData and High Speed Data StreamsData Mining should be a continuous online process, rather than an occasional one shot process. E.g. Analysis of high speed network traffic for identifying anomalous events.Challenge: how to compute models over streaming data which accommodate changing environments from which data is drawn. (“Concept drift” or “Environment drift”)Incremental Mining and effective model updating to maintain accurate modeling of the current stream required. 3. Sequential and Time Series DataHow to efficiently and accurately cluster, classify and predict the trends in sequential and time series data ? Time series data used for predictions are contaminated by noiseHow to do accurate short-term and long-term predictions?Signal processing techniques introduce lags in the filtered data, which reduces accuracy
  • 84. 3. Sequential and Time Series DataReal time series data obtained from Wireless sensors in Hong Kong UST CS department hallway
  • 85. 4. Mining Complex Knowledge fromComplex DataImportant type of complex knowledge is in the form of graphs.Challenge: More research required in the field of discovering graphs and structured patterns from large data.Data that are not i.i.d. (independent and identically distributed) -many objects are not independent of each other, and are not of a single type.Challenge: Data mining systems required that can soundly mine the rich structure of relations among objects. -E.g.: interlinked Web pages, social networks, metabolic networks in the cell
  • 86. 4. Mining Complex Knowledge fromComplex DataMost organization’s data is in text form and in complex data formats like Image, Multimedia and Web data. Challenge: How to mine non-relational data.Integration of data mining and knowledge inference required.Challenge ( The biggest gap): unable to relate the results of mining to the real-world decisions they affect - all they can do is hand the results back to the user.More research on interestingness of knowledge
  • 87. 5. Data Mining in a Network SettingCommunity and Social Networks: -Linked data between emails, Web pages, blogs, citations, sequences and peopleProblems:It is critical to have right characterization of the “community” to be detected.
  • 88. Entities/ nodes are distributed. Hence, distributed means of identification desired.
  • 89. Snapshot based dataset may not be able to capture the real picture.Challenge:To understand – Network’s static structures (e.g. topologies & structures)
  • 90. Dynamic Behavior (e.g. growth factor, robustness, functional efficiency)5. Data Mining in a Network SettingMining in and for Computer Networks - Network links are increasing in speed.(1-10 Gig Ethernet) - To be able to detect anomalies, fast capture of IP packets at high speed links and analyzing massive amounts of data required.Challenge: - highly scalable solutions required. - i.e. Good algorithms required to (a) detect DoS attacks (b) trace back to find attackers (c ) drop packets that belong to attack traffic.
  • 91. 6. Distributed Data Mining andMining Multi-agent DataImportant in Network Problems.
  • 92. In Distributed Environment(sensor/IP Network),distributed probes are placed at locations within the network.
  • 93. Problems : 1] Need to correlate & discover data patterns at various probes. 2] Communication Overhead (amount of data shipped between various sites). 3] How to mine across multiple heterogeneous data sources.Adversary data mining: deliberately manipulate the data to sabotage them(produce false negatives) e.g. email spam, counter-terrorism, intrusion detection/computer security, click spam, search engine spam, fraud detection, shopbots, file sharing,etc.
  • 94. Multi-agent Data Mining : Agents are often distributed & have proactive and reactive features.http://www-ai.cs.uni-dortmund.de/auto?self=$ejr31cychttp://www.csc.liv.ac.uk/~ali/wp/MADM.pdf
  • 95. 7. Data Mining for Biological andEnvironmental ProblemsMining Biological data – extremely imp problem. eg.HIV Vaccine design.Molecular biology eg. DNA chemical properties, 3D structures,functional properties.
  • 96. 7. Data Mining for Ecological & Environmental ProblemsUtilize our natural environment & resources in proper way..But…How can Data Mining be used to study & find out contributing factors to find :1] Number of hurricane occurrences 2] Global climate changes and potential “bird flu” epidemics. 3] human-centered systems (e.g. user-adapted human-computer interaction or P2P transactions).“Killer” Applications (bioinformatics, CRM/personalization & security applications).Reported in Science Magazine
  • 97. 8. Data-mining-Process RelatedProblemsHow to automate mining process?Issues: 1] 90% of cost is in pre-processing. 2] Systematic documentation of data cleaning. 3] Combine visual interactive & automatic DM. 4] In exploratory data analysis,DM goal is undefined.Challenges: - The composition of data mining operations. - Data cleaning, with logging capabilities. - Visualization and mining automation.Need a methodology: help users avoid many data mining mistakes. -What are the approaches for multi-step mining queries? - What is a canonical set of data mining operations?
  • 99. 9. Security, Privacy and Data IntegrityHow to ensure the users privacy while their data are being mined?
  • 100. How to do data mining for protection of security and privacy?
  • 101. Knowledge integrity assessment - Data are intentionally modified from their original version, in order to misinform the recipients or for privacy and security - Development of measures to evaluate the knowledge integrity of a collection of -- Data -- Knowledge and patterns
  • 102. 9. Security, Privacy and Data IntegrityChallenges:1] Develop efficient algorithms for comparing the knowledge contents of the two (before and after) versions of the data. 2] Develop algorithms for estimating the impact that certain modifications of the data have on the statistical significance of individual patterns obtainable by broad classes of data mining algorithms.Headlines (Nov 21 2005) Senate Panel Approves Data Security Bill - The Senate Judiciary Committee on Thursday passed legislation designed to protect consumers against data security failures by, among other things, requiring companies to notify consumers when their personal information has been compromised. While several other committees in both the House and Senate have their own versions of data security legislation, S. 1789 breaks new ground by including provisions permitting consumers to access their personal files …http://www.cdt.org/privacy/
  • 103. 10. Dealing with Non-static,Unbalanced and Cost-sensitive DataData is non-static,constantly changing.eg of collecting data in 2000,then 2001,2002 ……Problem is to correct the bias.Deal with unbalanced & cost-sensitive data:There is much information on costs and benefits, but no overall model of profit and loss.Data may evolve with a bias introduced by samplingICML 2003 Workshop on Learning from Imbalanced Data Sets
  • 104. 10. Dealing with Non-static,Unbalanced and Cost-sensitive DataCardiogram ?blood test ?Pressure ?Temperature 39 degrees ?biopsy ? Each test incurs a cost• Data extremely unbalanced• Data change with time
  • 105. ConclusionThere is still a lack of timely exchange of important topics in the community as a whole.These problems are sampled from a small, albeit important, segment of the community.The list should obviously be a function of time for this dynamic field.