Showing posts with label ETL. Show all posts
Showing posts with label ETL. Show all posts

Monday, August 24, 2020

Python — Filtering data with Pandas Dataframe

If you are working as Python developer where you have to accomplished a lot of data cleansing stuffs. One of the data cleansing stuff is to remove unwanted data from your dataframe. Pandas is one of the most important packages that makes importing and analyzing data much easier with the help of its strong library.

For analyzing data, a programmer requires a lot of filtering operations. Pandas provide many methods to filter a Data frame and Dataframe.query() is one of them.

To understand filtering feature of Pandas, we are creating some sample data by using list feature of Python.

In this example, dataframe has been filtered on multiple conditions.

# Import pandas library

import pandas as pd

 

# intialise data of lists.

data = {'Name':['Ryan Arjun', 'Kimmy Wang', 'Rose Gray', 'Will Smith'],

        'Age':[20, 21, 19, 18],

        'Country':['India','Taiwan','Canada','Greenland'],

        'Sex':['Male','Female','Female','Male']}

 

# Create DataFrame

df = pd.DataFrame(data)

 

#show data in the dataframe

df

=======================================

   Age |   Country |       Name   |  Sex

--------------------------------------- 

0   20 |     India | Ryan Arjun   |  Male

1   21 |    Taiwan | Kimmy Wang   |Female

2   19 |    Canada |  Rose Gray   |Female

3   18 | Greenland | Will Smith   | Male

=======================================

 

# filtering with query method

# Where sex must be male

# and Country must be India

# and age must be greater than 15

df.query('Sex =="Male" and Country =="India" and Age>15', inplace = True)

 

#show data in the dataframe

df

 

===================================

Age | Country   |     Name  | Sex

-----------------------------------

20  |India      |Ryan Arjun | Male

===================================

By using query feature of pandas in Python can save a lot of data processing time because we can use multiple filters conditions in a single go.

To learn more, please follow us -

http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at - 

http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -

https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -

https://twitter.com/macxima

To Learn more, please visit our Medium account at -

https://medium.com/@macxima

Wednesday, January 20, 2016

DW - Full load & Incremental or Refresh load

Data is the business asset which is used to making the right business decision on the right time. It is very important to design our database correctly, up to whatever normal form we can bear. 
While OLAP cubes are still common across the BI technology marketplace, data warehousing has definitely seen growth as it is serving consumers in more modern and complete ways. In the current business world, every organization is spending lot of money on the BI technologies to quick access of the valuable data because they believe that data visualisation is the key of the success in making the right decisions at the right time.

So, it is become very important urgent or sometimes treated as do and die situation and need to apply very highly recommended approaches which should be based on the real cases.  We are trying to highlight the importance of using an incremental load and change data capture technique for loading data to the dashboards for the end users.
ETL tools such as SQL Server Warehouse Builder have the ability to set table loading to insert/update that will support both full load and incremental load with the use of the same ETL routines. The key to supporting this,  staging tables are created in the staging layer to join to source tables based upon update or create dates of the record.
What is Incremental Load?
As we know that Data is very important factor for an organisation for making the right business decision on the right time. So, Incremental load is the best option and it is an widely used method to load data in data warehouses from the respective source systems. This technique is employed to perform faster load in less time utilizing less system resources. To understand the incremental load functionality, we can take an example for the daily product hierarchy sales chain where we can put the staging tables between destination and source.
In this scenario, staging tables always refreshed in the beginning of the ETL process and then after load from the source files after proper data transformation. In the last section of the ETL process, targeted tables are filled from the staging tables based on the following conditions -

  1. If data is matched with the target table then matching data will be updated.
  2. If data is not matched with the target tables then data will be treated as new and append into the target tables.

  
Incremental data loading is the most common approach to capture ongoing changes and keeping historical data on a predefined schedule.  In the incremental approach, the key to supporting this is a created table in the staging layer to join to source tables based upon update or create dates of the record.

What is Full Load?
In this technique, there is no staging area and always truncate the target tables before load all the data from the source raw data files. This technique is very time consuming and easily lost the information but capable to provide the guaranteed information. Full data loading also known as the fully refresh data loading also.

To know more, click on 
  1. Data Warehouse Architecture and Multidimensional Model
  2. Data Warehouse - Dimension tables.
  3. Data Warehouse - Fact tables.
  4. Data Warehouse - Conceptual Modeling.
  5. Data Warehouse - Star schema.
  6. Data Warehouse - Snowflake schema.
  7. Data Warehouse - Fact constellations
  8. Collaboration of OLTP and OLAP systems
  9. Major differences between OLTP and OLAP
  10. Data Warehouse - Multidimensional Cube 

Sunday, October 25, 2015

SQL – LEAD function

LEAD function was introduced in SQL Server 2012 in the group of analytics functions which is very helpful to make analytics in T-SQL a possibility and would add some value from a BI perspective. LEAD function is used for accessing subsequent (or columns from the subsequent row) rows along with the current row which could make certain operations which done in a multi-step fashion be more efficient.
The basic fundamental of LEAD function, accesses data from a subsequent row in the same result set without the use of a self-join in SQL Server 2012. Use this analytic function in a SELECT statement to compare values in the current row with values in a previous row.

How to use LEAD function
Lead function supports non-deterministic nature. Non-deterministic functions may return different results each time they are called with a specific set of input values even if the database state that they access remains the same. By using below syntax, we can use this function where want to use.
LEAD (scalar_expression [,offset] [,default])
    OVER ( [ partition_by_clause ] order_by_clause )

Arguments
scalar_expression
Within this parameter, we can specify a scalar expression or column name whose value from the subsequent row is to be returned.
Scalar_expression cannot be an analytic function.

Offset, default
Within this parameter, we can specify an offset to access not only the next immediate row but any row after the current row. Its default value of 1 accesses the next immediate row whereas a value of 3 accesses the third row from the current row.

OVER ( [ partition_by_clause ] order_by_clause)
Partition_by_clause divides the result set produced by the FROM clause into partitions to which the function is applied. If not specified, the function treats all rows of the query result set as a single group. Order_by_clause determines the order of the data before the function is applied.

Return Types
The data type of the specified scalar_expression. NULL is returned if scalar_expression is nullable or default is set to NULL.

LEAD functions are flexible to let us specify the number of rows to move forward from the current row with the offset parameter. To better understand this analytical function, we can take an example to calculation daily performance of any stock where we need to calculate the current value divided by Next day’s value as shown below:

Performance= (Current Day Value/Next Day’s Value)-1

Now, we need to get the Next day’s value and LEAD function is capable to fulfill this requirement but first of all we need to know about our data table and position of the actual data into the table before using LEAD Function.

USE TEMPDB
GO

---- Declare Table variable for Daily Stock Values
DECLARE @TableStock Table
(
StockId int,
StockName Varchar(20),
StockDate Date,
StockValue real
)
---- Insert Values in the stock Table variable
INSERT INTO @TableStock (StockId, StockName, StockDate, StockValue)
VALUES
(101, 'StockName S1', '2015-10-23', '546.56'),
(101, 'StockName S1', '2015-10-22', '544.22'),
(101, 'StockName S1', '2015-10-21', '543.50'),
(101, 'StockName S1', '2015-10-20', '544.75'),
(101, 'StockName S1', '2015-10-19', '543.78')

---- Values in the Table Variable
SELECT StockId, StockName, StockDate, StockValue FROM @TableStock
StockId
StockName
StockDate
StockValue
101
StockName S1
10/23/2015
546.56
101
StockName S1
10/22/2015
544.22
101
StockName S1
10/21/2015
543.50
101
StockName S1
10/20/2015
544.75
101
StockName S1
10/19/2015
543.78



Get previous day’s value with the help of LAG Function as given below:

---- USE CTE to get the Next Day's Stock Value
;WITH CTE AS
(
SELECT StockId, StockName, StockDate, StockValue,
---- LEAD Function to get Next Day's Value
NextDayValue=LEAD(StockValue,1,0) Over (ORDER BY StockDate)
FROM @TableStock
)

----- Values in CTE table
SELECT StockId,
StockName, StockDate,
CurrentValue=StockValue,
NextDayValue FROM CTE
ORDER BY StockDate DESC;

StockId
StockName
StockDate
CurrentValue
NextDayValue
101
StockName S1
10/23/2015
546.56
0.00
101
StockName S1
10/22/2015
544.22
546.56
101
StockName S1
10/21/2015
543.50
544.22
101
StockName S1
10/20/2015
544.75
543.50
101
StockName S1
10/19/2015
543.78
544.75


Get the Daily performance now as shown below:

---- USE CTE to get the Next Day's Stock Value
;WITH CTE AS
(
SELECT StockId, StockName, StockDate, StockValue,
---- LEAD Function to get Next Day's Value
NextDayValue=LEAD(StockValue,1,0) Over (ORDER BY StockDate)
FROM @TableStock
)

----- Performance Calculation in CTE table
SELECT StockId,
StockName, StockDate,
CurrentValue=StockValue,
NextDayValue,
---- If Next day value is 0 then set current value
Performance=(StockValue/ (Case when NextDayValue=0 then StockValue else NextDayValue end) )-1
FROM CTE
ORDER BY StockDate DESC;

StockId
StockName
StockDate
CurrentValue
NextDayValue
Performance
101
StockName S1
10/23/2015
546.56
0.00
0.000000
101
StockName S1
10/22/2015
544.22
546.56
-0.004281
101
StockName S1
10/21/2015
543.50
544.22
-0.001323
101
StockName S1
10/20/2015
544.75
543.50
0.002300
101
StockName S1
10/19/2015
543.78
544.75
-0.001781

With the help of the LEAD function, we can get any next values for day, month, year or anything which you want to use in you analytics calculations.
Query at a Glance

USE TEMPDB
GO
---- Declare Table variable for Daily Stock Values
DECLARE @TableStock Table
(
StockId int,
StockName Varchar(20),
StockDate Date,
StockValue real
)

---- Insert Values in the stock Table variable
INSERT INTO @TableStock (StockId, StockName, StockDate, StockValue)
VALUES
(101, 'StockName S1', '2015-10-23', '546.56'),
(101, 'StockName S1', '2015-10-22', '544.22'),
(101, 'StockName S1', '2015-10-21', '543.50'),
(101, 'StockName S1', '2015-10-20', '544.75'),
(101, 'StockName S1', '2015-10-19', '543.78')

---- Values in the Table Variable
SELECT StockId, StockName, StockDate, StockValue FROM @TableStock

---- USE CTE to get the Next Day's Stock Value
;WITH CTE AS
(
SELECT StockId, StockName, StockDate, StockValue,
---- LEAD Function to get Next Day's Value
NextDayValue=LEAD(StockValue,1,0) Over (ORDER BY StockDate)
FROM @TableStock
)

----- Performance Calculation in CTE table
SELECT StockId,
StockName, StockDate,
CurrentValue=StockValue,
NextDayValue,
---- If Next day value is 0 then set current value
Performance=(StockValue/ (Case when NextDayValue=0 then StockValue else NextDayValue end) )-1
FROM CTE
ORDER BY StockDate DESC;



So, LEAD function can really help in looking at after records to compute records differences from a single T-SQL statement that is more readable and logical than previous methods for gathering this information. Learn more on another features of SQL as: