SlideShare a Scribd company logo
PYTHON PANDAS
Introduction
 Pandas is a Python package providing fast, flexible,
and expressive data structures designed to make
working with relational or labeled data both easy
and intuitive.
 It aims to be the fundamental high-level building
block for doing practical, real world data analysis in
Python.
 It has the broader goal of becoming the most
powerful and flexible open source data analysis /
manipulation tool available in any language
 The name Pandas is derived from the word Panel Data –
an Econometrics from Multidimensional data.
 Python library provides high-performance, easy-to-use
data structures and data analysis tools for the Python
programming language. Python with Pandas is used in a
wide range of fields including academic and commercial
domains including finance, economics, Statistics,
analytics, etc.
 Python was majorly used for data munging and
preparation. It had very little contribution towards
data analysis. Pandas solved this problem.
 Using Pandas, we can accomplish five typical steps in
the processing and analysis of data, regardless of the
origin of data:
 load
 prepare
 manipulate
 model, and
 analyze.
Pandas Features
 Fast and efficient DataFrame object with default and
customized indexing.
 Tools for loading data into in-memory data objects from
different file formats.
 Data alignment and integrated handling of missing data.
 Reshaping and pivoting of date sets.
 Label-based slicing, indexing and subsetting of large data sets.
 Columns from a data structure can be deleted or inserted.
 Group by data for aggregation and transformations.
 High performance merging and joining of data.
 Time Series functionality.
Installation of Pandas
 Python Anaconda is a free Python distribution with SciPy
stack and Spyder IDE for Windows OS.
 It is also available for Linux and Mac.
 Standard Python distribution doesn't come bundled with
Pandas module. A lightweight alternative is to install
Pandas using popular Python package installer, pip.
C:UsersSony>pip install pandas
Highlights of Pandas
 A fast and efficient DataFrame object for data manipulation with
integrated indexing;
 Tools for reading and writing data between in-memory data
structures and different formats: CSV and text files, Microsoft Excel,
SQL databases, and the fast HDF5 format;
 Intelligent data alignment and integrated handling of missing data:
gain automatic label-based alignment in computations and easily
manipulate messy data into an orderly form;
 Flexible reshaping and pivoting of data sets;
 Intelligent label-based slicing, fancy indexing, and subsetting of
large data sets;
 Columns can be inserted and deleted from data structures for size
mutability.
Dataset in Pandas
 Pandas deals with the following three data structures −
 Series
 DataFrame
 These data structures are built on top of Numpy array
 All Pandas data structures are value mutable (can be
changed). Except Series all are size mutable. Series is
size immutable.
 DataFrame is widely used and one of the most
important data structures. Panel is used much less.
Series
 Series is a one-dimensional array like structure with
homogeneous data. For example, the following series is a
collection of integers 10, 23, 56.
Panel
 Panel is a three-dimensional data structure with
heterogeneous data. It is hard to represent the panel in
graphical representation. But a panel can be illustrated as
a container of DataFrame.
DataFrame
 DataFrame is a two-dimensional array with heterogeneous
data. For example,
 The table represents the data of a sales team of an
organization with their overall performance rating. The data is
represented in rows and columns. Each column represents an
attribute and each row represents a person.
Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78
Series
 Series is a one-dimensional labeled array capable of
holding data of any type (integer, string, float, python
objects). The axis labels are collectively called index.
 A series can be created using various inputs like −
 Array
 Dict
Create a Series by array
 If data is an ndarray, then index passed must be of
the same length. If no index is passed, then by
default index will be range(n) where n is array length,
i.e., [0,1,2,3…. range(len(array))-1].
 Ex: series_1.py
 import pandas as pd
 import numpy as np
 data = np.array(['a','b','c','d'])
 s = pd.Series(data)
 print (s)
Create a Series from dict
 A dict can be passed as input and if no index is
specified, then the dictionary keys are taken in a
sorted order to construct index.
DataFrames
 A Data frame is a two-dimensional data structure, i.e., data
is aligned in a tabular fashion in rows and columns.
Features of DataFrame
 Potentially columns are of different types
 Size – Mutable
 Labeled axes (rows and columns)
 Can Perform Arithmetic operations on rows and columns
 A pandas DataFrame can be created using the following
constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
Create DataFrame
Pandas DataFrame can be created using various inputs like :
 Lists
 dict
 Series
 Syntax:
import pandas as pd
df = pd.DataFrame()
print (df)
The above syntax will generate an empty dataframe
with no columns and no index
Dataframe using Lists
 The DataFrame can be created using a single list or a
list of lists.
 Syntax:
import pandas as pd
df = pd.DataFrame()
print (df)
The above syntax will generate an empty dataframe
with no columns and no index
DataFrame using Dict of arrays & Lists
 All the arrays must be of same length. If index is
passed, then the length of the index should be equal
to the length of the arrays.
 If no index is passed, then by default, index will be
range(n), where n is the array length.
 List of Dictionaries can be passed as input data to
create a DataFrame. The dictionary keys are by
default taken as column names.
 We can also create a DataFrame with a list of
dictionaries, row indices, and column indices.
 Note: Here df2 DataFrame is created with a
column index other than the dictionary key; thus,
appended the NaN’s in place. Whereas, df1 is
created with column indices same as dictionary
keys, so NaN’s appended.
DataFrame from Dict of Series
 Dictionary of Series can be passed to form a
DataFrame.
 The resultant index is the union of all the series
indexes passed.
Dataset Manipulations
 Column wise manipulations in a dataframe
 We can perform Dataframe manipulations like:
Selecting required columns for display
Adding new columns
Deleting the columns
 Row wise manipulations in a dataframe
 We can do the following like:-
Row Selection,
Selecting using label
Selecting using integer location
Selecting using slicing
Addition of row, and
Deletion of row
 Dataset concatinating
 Dataset Merging
 Dataset Joining
Data Preprocessing
 In the real world, we usually come across lots of
raw data which is not fit to be readily processed by
machine learning algorithms. We need to
preprocess the raw data before it is fed into
various machine learning algorithms.
 In other simple words, we can say that before
providing the data to the machine learning
algorithms we need to preprocess the data.
Why preprocessing ?
 Real world data are generally
Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
Noisy: containing errors or outliers
Inconsistent: containing discrepancies in codes or names
Tasks in data preprocessing
Data cleaning: fill in missing values, smooth noisy data,
identify or remove outliers, and resolve inconsistencies.
Data integration: using multiple databases, data cubes, or
files.
Data transformation: normalization and aggregation.
Data reduction: reducing the volume but producing the
same or similar analytical results.
Data discretization: part of data reduction, replacing
numerical attributes with nominal ones.
Data cleaning
 Fill in missing values (attribute or class value):
 Ignore the tuple: usually done when class label is missing.
 Use the attribute mean (or majority nominal value) to fill in the missing value.
 Use the attribute mean (or majority nominal value) for all samples belonging to the
same class.
 Predict the missing value by using a learning algorithm: consider the attribute with the
missing value as a dependent (class) variable and run a learning algorithm (usually
Bayes or decision tree) to predict the missing value.
 Identify outliers and smooth out noisy data:
 Binning
 Sort the attribute values and partition them into bins (see "Unsupervised discretization" below);
 Then smooth by bin means, bin median, or bin boundaries.
 Clustering: group values in clusters and then detect and remove outliers (automatic or
manual)
 Regression: smooth by fitting the data into regression functions.
 Correct inconsistent data: use domain knowledge or expert decision.
Data transformation
 Normalization:
 Scaling attribute values to fall within a specified range.
 Example: to transform V in [min, max] to V' in [0,1],
apply V'=(V-Min)/(Max-Min)
 Scaling by using mean and standard deviation (useful when min
and max are unknown or when there are
outliers): V'=(V-Mean)/StDev
 Aggregation: moving up in the concept hierarchy on numeric
attributes.
 Generalization: moving up in the concept hierarchy on nominal
attributes.
 Attribute construction: replacing or adding new attributes
inferred by existing attributes.
Data reduction
 Reducing the number of attributes
 Data cube aggregation: applying roll-up, slice or dice operations.
 Removing irrelevant attributes: attribute selection (filtering and
wrapper methods), searching the attribute space
 Principle component analysis (numeric attributes only): searching for
a lower dimensional space that can best represent the data..
 Reducing the number of attribute values
 Binning (histograms): reducing the number of attributes by grouping
them into intervals (bins).
 Clustering: grouping values in clusters.
 Aggregation or generalization
 Reducing the number of tuples
 Sampling
Discretization and generating concept hierarchies
 Unsupervised discretization - class variable is not used.
 Equal-interval (equiwidth) binning: split the whole range of numbers in
intervals with equal size.
 Equal-frequency (equidepth) binning: use intervals containing equal
number of values.
 Supervised discretization - uses the values of the class variable.
 Using class boundaries. Three steps:
 Sort values.
 Place breakpoints between values belonging to different classes.
 If too many intervals, merge intervals with equal or similar class distributions.
 Entropy (information)-based discretization.
Generating concept hierarchies: recursively applying partitioning or
discretization methods.
Missing Values in the array set or the Dataset
 Identifying the no. of missing values in a dataset.

Function: - data.isna() or data.isnull()
 The above function returns true if the dataframe or
dataset is having null values.
 We can also count the number of null values in a column.
 Function:- data.isnull().sum() or data.isna().sum()
data.isnull().sum(axis=0) [column level] /
data.isnull().sum(axis=1) [row level]

More Related Content

PPTX
Lecture 3 intro2data
PPTX
4)12th_L-1_PYTHON-PANDAS-I.pptx
PPTX
Unit 3_Numpy_VP.pptx
PPTX
PDF
Python pandas I .pdf gugugigg88iggigigih
PPTX
XII IP New PYTHN Python Pandas 2020-21.pptx
PDF
Data Wrangling and Visualization Using Python
PPTX
Pandas Dataframe reading data Kirti final.pptx
Lecture 3 intro2data
4)12th_L-1_PYTHON-PANDAS-I.pptx
Unit 3_Numpy_VP.pptx
Python pandas I .pdf gugugigg88iggigigih
XII IP New PYTHN Python Pandas 2020-21.pptx
Data Wrangling and Visualization Using Python
Pandas Dataframe reading data Kirti final.pptx

Similar to python-pandas-For-Data-Analysis-Manipulate.pptx (20)

PPT
Python Pandas
PPTX
Unit 3_Numpy_VP.pptx
PPTX
python pandas ppt.pptx123456789777777777
PPTX
DATA SCIENCE_Pandas__(Section-C)(1).pptx
PPTX
Pandas.pptx
PDF
pandas-221217084954-937bb582.pdf
PPTX
Meetup Junio Data Analysis with python 2018
PPTX
introduction to data structures in pandas
PPTX
Data Analysis with Python Pandas
PPTX
Data engineering and analytics using python
PPTX
Comparing EDA with classical and Bayesian analysis.pptx
PDF
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
PDF
pandas and numpy_interview_Question_2025.pdf
PPTX
2. Data Preprocessing with Numpy and Pandas.pptx
PPTX
Introduction to a Python Libraries and python frameworks
PPTX
To understand the importance of Python libraries in data analysis.
PDF
Analysis using r
PPTX
Unit 2 - Data Manipulation with R.pptx
PPTX
Data Analysis packages
PPTX
Unit 4_Working with Graphs _python (2).pptx
Python Pandas
Unit 3_Numpy_VP.pptx
python pandas ppt.pptx123456789777777777
DATA SCIENCE_Pandas__(Section-C)(1).pptx
Pandas.pptx
pandas-221217084954-937bb582.pdf
Meetup Junio Data Analysis with python 2018
introduction to data structures in pandas
Data Analysis with Python Pandas
Data engineering and analytics using python
Comparing EDA with classical and Bayesian analysis.pptx
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
pandas and numpy_interview_Question_2025.pdf
2. Data Preprocessing with Numpy and Pandas.pptx
Introduction to a Python Libraries and python frameworks
To understand the importance of Python libraries in data analysis.
Analysis using r
Unit 2 - Data Manipulation with R.pptx
Data Analysis packages
Unit 4_Working with Graphs _python (2).pptx
Ad

Recently uploaded (20)

PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPTX
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
PPTX
ANICK 6 BIRTHDAY....................................................
PPTX
Introduction-to-Food-Packaging-and-packaging -materials.pptx
PDF
natwest.pdf company description and business model
PPTX
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
PPTX
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
PPTX
fundraisepro pitch deck elegant and modern
PDF
Presentation1 [Autosaved].pdf diagnosiss
PPTX
MERISTEMATIC TISSUES (MERISTEMS) PPT PUBLIC
PPTX
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
PPTX
The Effect of Human Resource Management Practice on Organizational Performanc...
PPTX
Intro to ISO 9001 2015.pptx wareness raising
PPT
First Aid Training Presentation Slides.ppt
PPTX
nose tajweed for the arabic alphabets for the responsive
PPTX
Module_4_Updated_Presentation CORRUPTION AND GRAFT IN THE PHILIPPINES.pptx
PPTX
Tour Presentation Educational Activity.pptx
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PDF
COLEAD A2F approach and Theory of Change
PPTX
3RD-Q 2022_EMPLOYEE RELATION - Copy.pptx
2025-08-10 Joseph 02 (shared slides).pptx
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
ANICK 6 BIRTHDAY....................................................
Introduction-to-Food-Packaging-and-packaging -materials.pptx
natwest.pdf company description and business model
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
fundraisepro pitch deck elegant and modern
Presentation1 [Autosaved].pdf diagnosiss
MERISTEMATIC TISSUES (MERISTEMS) PPT PUBLIC
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
The Effect of Human Resource Management Practice on Organizational Performanc...
Intro to ISO 9001 2015.pptx wareness raising
First Aid Training Presentation Slides.ppt
nose tajweed for the arabic alphabets for the responsive
Module_4_Updated_Presentation CORRUPTION AND GRAFT IN THE PHILIPPINES.pptx
Tour Presentation Educational Activity.pptx
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
COLEAD A2F approach and Theory of Change
3RD-Q 2022_EMPLOYEE RELATION - Copy.pptx
Ad

python-pandas-For-Data-Analysis-Manipulate.pptx

  • 2. Introduction  Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive.  It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.  It has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language
  • 3.  The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.  Python library provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.
  • 4.  Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem.  Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data:  load  prepare  manipulate  model, and  analyze.
  • 5. Pandas Features  Fast and efficient DataFrame object with default and customized indexing.  Tools for loading data into in-memory data objects from different file formats.  Data alignment and integrated handling of missing data.  Reshaping and pivoting of date sets.  Label-based slicing, indexing and subsetting of large data sets.  Columns from a data structure can be deleted or inserted.  Group by data for aggregation and transformations.  High performance merging and joining of data.  Time Series functionality.
  • 6. Installation of Pandas  Python Anaconda is a free Python distribution with SciPy stack and Spyder IDE for Windows OS.  It is also available for Linux and Mac.  Standard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to install Pandas using popular Python package installer, pip. C:UsersSony>pip install pandas
  • 7. Highlights of Pandas  A fast and efficient DataFrame object for data manipulation with integrated indexing;  Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;  Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;  Flexible reshaping and pivoting of data sets;  Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;  Columns can be inserted and deleted from data structures for size mutability.
  • 8. Dataset in Pandas  Pandas deals with the following three data structures −  Series  DataFrame  These data structures are built on top of Numpy array  All Pandas data structures are value mutable (can be changed). Except Series all are size mutable. Series is size immutable.  DataFrame is widely used and one of the most important data structures. Panel is used much less.
  • 9. Series  Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56. Panel  Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame.
  • 10. DataFrame  DataFrame is a two-dimensional array with heterogeneous data. For example,  The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person. Name Age Gender Rating Steve 32 Male 3.45 Lia 28 Female 4.6 Vin 45 Male 3.9 Katie 38 Female 2.78
  • 11. Series  Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects). The axis labels are collectively called index.  A series can be created using various inputs like −  Array  Dict
  • 12. Create a Series by array  If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].  Ex: series_1.py  import pandas as pd  import numpy as np  data = np.array(['a','b','c','d'])  s = pd.Series(data)  print (s)
  • 13. Create a Series from dict  A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index.
  • 14. DataFrames  A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Features of DataFrame  Potentially columns are of different types  Size – Mutable  Labeled axes (rows and columns)  Can Perform Arithmetic operations on rows and columns  A pandas DataFrame can be created using the following constructor − pandas.DataFrame( data, index, columns, dtype, copy)
  • 15. Create DataFrame Pandas DataFrame can be created using various inputs like :  Lists  dict  Series
  • 16.  Syntax: import pandas as pd df = pd.DataFrame() print (df) The above syntax will generate an empty dataframe with no columns and no index
  • 17. Dataframe using Lists  The DataFrame can be created using a single list or a list of lists.  Syntax: import pandas as pd df = pd.DataFrame() print (df) The above syntax will generate an empty dataframe with no columns and no index
  • 18. DataFrame using Dict of arrays & Lists  All the arrays must be of same length. If index is passed, then the length of the index should be equal to the length of the arrays.  If no index is passed, then by default, index will be range(n), where n is the array length.  List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.
  • 19.  We can also create a DataFrame with a list of dictionaries, row indices, and column indices.  Note: Here df2 DataFrame is created with a column index other than the dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with column indices same as dictionary keys, so NaN’s appended.
  • 20. DataFrame from Dict of Series  Dictionary of Series can be passed to form a DataFrame.  The resultant index is the union of all the series indexes passed.
  • 21. Dataset Manipulations  Column wise manipulations in a dataframe  We can perform Dataframe manipulations like: Selecting required columns for display Adding new columns Deleting the columns
  • 22.  Row wise manipulations in a dataframe  We can do the following like:- Row Selection, Selecting using label Selecting using integer location Selecting using slicing Addition of row, and Deletion of row
  • 23.  Dataset concatinating  Dataset Merging  Dataset Joining
  • 24. Data Preprocessing  In the real world, we usually come across lots of raw data which is not fit to be readily processed by machine learning algorithms. We need to preprocess the raw data before it is fed into various machine learning algorithms.  In other simple words, we can say that before providing the data to the machine learning algorithms we need to preprocess the data.
  • 25. Why preprocessing ?  Real world data are generally Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data Noisy: containing errors or outliers Inconsistent: containing discrepancies in codes or names
  • 26. Tasks in data preprocessing Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies. Data integration: using multiple databases, data cubes, or files. Data transformation: normalization and aggregation. Data reduction: reducing the volume but producing the same or similar analytical results. Data discretization: part of data reduction, replacing numerical attributes with nominal ones.
  • 27. Data cleaning  Fill in missing values (attribute or class value):  Ignore the tuple: usually done when class label is missing.  Use the attribute mean (or majority nominal value) to fill in the missing value.  Use the attribute mean (or majority nominal value) for all samples belonging to the same class.  Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value.  Identify outliers and smooth out noisy data:  Binning  Sort the attribute values and partition them into bins (see "Unsupervised discretization" below);  Then smooth by bin means, bin median, or bin boundaries.  Clustering: group values in clusters and then detect and remove outliers (automatic or manual)  Regression: smooth by fitting the data into regression functions.  Correct inconsistent data: use domain knowledge or expert decision.
  • 28. Data transformation  Normalization:  Scaling attribute values to fall within a specified range.  Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-Min)/(Max-Min)  Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers): V'=(V-Mean)/StDev  Aggregation: moving up in the concept hierarchy on numeric attributes.  Generalization: moving up in the concept hierarchy on nominal attributes.  Attribute construction: replacing or adding new attributes inferred by existing attributes.
  • 29. Data reduction  Reducing the number of attributes  Data cube aggregation: applying roll-up, slice or dice operations.  Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space  Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data..  Reducing the number of attribute values  Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).  Clustering: grouping values in clusters.  Aggregation or generalization  Reducing the number of tuples  Sampling
  • 30. Discretization and generating concept hierarchies  Unsupervised discretization - class variable is not used.  Equal-interval (equiwidth) binning: split the whole range of numbers in intervals with equal size.  Equal-frequency (equidepth) binning: use intervals containing equal number of values.  Supervised discretization - uses the values of the class variable.  Using class boundaries. Three steps:  Sort values.  Place breakpoints between values belonging to different classes.  If too many intervals, merge intervals with equal or similar class distributions.  Entropy (information)-based discretization. Generating concept hierarchies: recursively applying partitioning or discretization methods.
  • 31. Missing Values in the array set or the Dataset  Identifying the no. of missing values in a dataset.  Function: - data.isna() or data.isnull()  The above function returns true if the dataframe or dataset is having null values.  We can also count the number of null values in a column.  Function:- data.isnull().sum() or data.isna().sum() data.isnull().sum(axis=0) [column level] / data.isnull().sum(axis=1) [row level]