SlideShare a Scribd company logo
Python Programming
Pandas in Python
Sejal Kadam
Assistant Professor
Department of Electronics & Telecommunication
DJSCE, Mumbai
WHAT IS PANDAS?
• Pandas is an opensource library that allows you to perform data
manipulation in python.
• Pandas provide an easy way to create, manipulate and wrangle the
data.
• Pandas library is built on top of numpy, meaning pandas needs
numpy to operate.
• Pandas is also an elegant solution for time series data.
6/21/2024 DJSCE_EXTC_Sejal Kadam 2
WHY USE PANDAS?
• Pandas is a useful library in data analysis.
• It provides an efficient way to slice merge, concatenate or reshape
the data the data
• Easily handles missing data
• It includes a powerful time series tool to work with
• It uses Series for one-dimensional data structure and DataFrame for
multi-dimensional data structure
6/21/2024 DJSCE_EXTC_Sejal Kadam 3
HOW TO INSTALL PANDAS?
You can install Pandas using:
• Anaconda: conda install -c anaconda pandas
• In Jupyter Notebook :
import sys
!conda install --yes --prefix {sys.prefix} pandas
6/21/2024 DJSCE_EXTC_Sejal Kadam 4
WHAT IS A DATA FRAME?
A data frame is a two-dimensional array, with labeled axes (rows and
columns).
A data frame is a standard way to store data.
It can have any data structure like integer, float, and string.
Data: can be a list, dictionary or scalar value
Pandas data frame:
6/21/2024 DJSCE_EXTC_Sejal Kadam 5
WHAT IS A SERIES?
A series is a one-dimensional data structure.
It can have any data structure like integer, float, and string.
Data: can be a list, dictionary or scalar value
A series, by definition, cannot have multiple columns.
import pandas as pd
pd.Series([1., 2., 3.])
0 1.0
1 2.0
2 3.0
dtype: float64
6/21/2024 DJSCE_EXTC_Sejal Kadam 6
You can add the index with index parameter.
It helps to name the rows.
The length should be equal to the size of the column.
pd.Series([1., 2., 3.], index=['a', 'b', 'c’])
Output
a 1.0
b 2.0
c NaN
dtype: float64
6/21/2024 DJSCE_EXTC_Sejal Kadam 7
You create a Pandas series with a missing value.
Note, missing values in Python are noted "NaN."
You can use numpy to create missing value: np.nan artificially
pd.Series([1,2,np.nan])
Output
0 1.0
1 2.0
2 NaN
dtype: float64
6/21/2024 DJSCE_EXTC_Sejal Kadam 8
You can also use a dictionary to create a Pandas dataframe.
dic = {'Name': ["ABC", "XYZ"], 'Age': [30, 40]}
pd.DataFrame(data=dic)
Age Name
0 30 ABC
1 40 XYZ
6/21/2024 DJSCE_EXTC_Sejal Kadam 9
RANGE DATA
Pandas have a convenient API to create a range of date
pd.date_range(date,period,frequency)
• The first parameter is the starting date
• The second parameter is the number of periods (optional if the end date is specified)
• The last parameter is the frequency: day: 'D,' month: 'M' and year: 'Y.’
## Create date Days
dates_d = pd.date_range('20240101', periods=6, freq='D')
print('Day:', dates_d)
Output
Day: DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-
06'], dtype='datetime64[ns]', freq='D')
6/21/2024 DJSCE_EXTC_Sejal Kadam 10
# Months
dates_m = pd.date_range('20240131', periods=6, freq='M')
print('Month:', dates_m)
Output
Month: DatetimeIndex(['2024-01-31', '2024-02-28', '2024-03-31', '2024
-04-30','2024-05-31', '2024-06-30'], dtype='datetime64[ns]', freq='M')
6/21/2024 DJSCE_EXTC_Sejal Kadam 11
INSPECTING DATA
You can check the head or tail of the dataset with head(), or tail() preceded by the
name of the panda's data frame
Step 1) Create a random sequence with numpy. The sequence has 4 columns and 6
rows
random = np.random.randn(6,4)
Step 2) Then you create a data frame using pandas.
Use dates_m as an index for the data frame. It means each row will be given a
"name" or an index, corresponding to a date.
Finally, you give a name to the 4 columns with the argument columns
# Create data with date
df = pd.DataFrame(random,index=dates_m,columns=list('ABCD'))
6/21/2024 DJSCE_EXTC_Sejal Kadam 12
Step 3) Using head function
df.head(3)
Step 4) Using tail function
df.tail(3)
A B C D
2024-01-31 1.139433 1.318510 -0.181334 1.615822
2024-02-28 -0.081995 -0.063582 0.857751 -0.527374
2024-03-31 -0.519179 0.080984 -1.454334 1.314947
A B C D
2024-04-30 -0.685448 -0.011736 0.622172 0.104993
2024-05-31 -0.935888 -0.731787 -0.558729 0.768774
2024-06-30 1.096981 0.949180 -0.196901 -0.471556
6/21/2024 DJSCE_EXTC_Sejal Kadam 13
Step 5) An excellent practice to get a clue about the data is to use
describe(). It provides the counts, mean, std, min, max and percentile
of the dataset.
df.describe()
A B C D
COUNT 6.000000 6.000000 6.000000 6.000000
MEAN 0.002317 0.256928 -0.151896 0.467601
STD 0.908145 0.746939 0.834664 0.908910
MIN -0.935888 -0.731787 -1.454334 -0.527374
25% -0.643880 -0.050621 -0.468272 -0.327419
50% -0.300587 0.034624 -0.189118 0.436883
75% 0.802237 0.732131 0.421296 1.178404
MAX 1.139433 1.318510 0.857751 1.615822
6/21/2024 DJSCE_EXTC_Sejal Kadam 14
Few Functions:
df.mean() Returns the mean of all columns
df.corr() Returns the correlation between columns in a data frame
df.count() Returns the number of non-null values in each data frame column
df.max() Returns the highest value in each column
df.min() Returns the lowest value in each column
df.median() Returns the median of each column
6/21/2024 DJSCE_EXTC_Sejal Kadam 15
Accessing various data formats
It gives you the capability to read various types of data formats like CSV,
JSON, Excel, Pickle, etc.
It allows you to represent your data in a row and column tabular
fashion, which makes the data readable and presentable.
We can access csv file using read_csv() function.
For e.g.
df = pd.read_csv("data1.csv“)
6/21/2024 DJSCE_EXTC_Sejal Kadam 16
SLICE DATA
You can use the column name to extract data in a particular column.
## Slice
### Using name
df['A’]
Output:
2024-01-31 -0.168655
2024-02-28 0.689585
2024-03-31 0.767534
2024-04-30 0.557299
2024-05-31 -1.547836
2024-06-30 0.511551
Freq: M, Name: A, dtype: float64
6/21/2024 DJSCE_EXTC_Sejal Kadam 17
To select multiple columns, you need to use two times the bracket,
[[..,..]]
The first pair of bracket means you want to select columns, the second
pairs of bracket tells what columns you want to return.
df[['A', 'B']]. A B
2024-01-31 -0.168655 0.587590
2024-02-28 0.689585 0.998266
2024-03-31 0.767534 -0.940617
2024-04-30 0.557299 0.507350
2024-05-31 -1.547836 1.276558
2024-06-30 0.511551 1.572085
6/21/2024 DJSCE_EXTC_Sejal Kadam 18
You can also slice the rows
THE CODE BELOW RETURNS THE FIRST THREE ROWS
### USING A SLICE FOR ROW
df[0:3]
A B C D
2024-01-31 -0.168655 0.587590 0.572301 -0.031827
2024-02-28 0.689585 0.998266 1.164690 0.475975
2024-03-31 0.767534 -0.940617 0.227255 -0.341532
6/21/2024 DJSCE_EXTC_Sejal Kadam 19
The loc function is used to select columns by names.
As usual, the values before the coma stand for the rows and after refer to the
column.
You need to use the brackets to select more than one column.
## Multi col
df.loc[:,['A','B']]
A B
2024-01-31 -0.168655 0.587590
2024-02-28 0.689585 0.998266
2024-03-31 0.767534 -0.940617
2024-04-30 0.557299 0.507350
2024-05-31 -1.547836 1.276558
2024-06-30 0.511551 1.572085
6/21/2024 DJSCE_EXTC_Sejal Kadam 20
There is another method to select multiple rows and columns in
Pandas. You can use iloc[]. This method uses the index instead of the
columns name. The code below returns the same data frame as above
df.iloc[:, :2]
A B
2024-01-31 -0.168655 0.587590
2024-02-28 0.689585 0.998266
2024-03-31 0.767534 -0.940617
2024-04-30 0.557299 0.507350
2024-05-31 -1.547836 1.276558
2024-06-30 0.511551 1.572085
6/21/2024 DJSCE_EXTC_Sejal Kadam 21
DROP A COLUMN
You can drop columns using pd.drop()
df.drop(columns=['A', 'C’])
B D
2024-01-31 0.587590 -0.031827
2024-02-28 0.998266 0.475975
2024-03-31 -0.940617 -0.341532
2024-04-30 0.507350 -0.296035
2024-05-31 1.276558 0.523017
2024-06-30 1.572085 -0.594772
6/21/2024 DJSCE_EXTC_Sejal Kadam 22
CONCATENATION
You can concatenate two DataFrame in Pandas. You can use pd.concat()
First of all, you need to create two DataFrames. So far so good, you are
already familiar with dataframe creation
import numpy as np
df1 = pd.DataFrame({'name': ['ABC', 'XYZ','PQR'],'Age': ['25', '30', '50']},
index=[0, 1, 2])
df2 = pd.DataFrame({'name': ['LMN', 'XYZ' ],'Age': ['26', '11']},
index=[3, 4])
Finally, you concatenate the two DataFrame
df_concat = pd.concat([df1,df2])
df_concat
6/21/2024 DJSCE_EXTC_Sejal Kadam 23
AGE NAME
0 25 ABC
1 30 XYZ
2 50 PQR
3 26 LMN
4 11 XYZ
DROP_DUPLICATES
If a dataset can contain duplicates information use, `drop_duplicates` is an easy to exclude
duplicate rows. You can see that `df_concat` has a duplicate observation, `XYZ` appears twice in
the column `name.`
df_concat.drop_duplicates('name')
AGE NAME
0 25 ABC
1 30 XYZ
2 50 PQR
3 26 LMN
6/21/2024 DJSCE_EXTC_Sejal Kadam 24
SORT VALUES
You can sort value with sort_values
df_concat.sort_values('Age')
AGE NAME
4 11 XYZ
0 25 ABC
3 26 LMN
1 30 XYZ
2 50 PQR
6/21/2024 DJSCE_EXTC_Sejal Kadam 25
RENAME: CHANGE OF INDEX
You can use rename to rename a column in Pandas. The first value is
the current column name and the second value is the new column
name.
df_concat.rename(columns={"name": "Surname", "Age": "Age_ppl"})
AGE_PPL SURNAME
0 25 ABC
1 30 XYZ
2 50 PQR
3 26 LMN
4 11 XYZ
6/21/2024 DJSCE_EXTC_Sejal Kadam 26
Operations on Series using panda modules
We can perform binary operation on series like addition, subtraction and
many other operations.
In order to perform binary operation on series we have to use some function
like .add(),.sub() etc..
# adding two series data & data1 using
# .add
data.add(data1, fill_value=0)
# subtracting two series data & data1 using
# .sub
data.sub(data1, fill_value=0)
6/21/2024 DJSCE_EXTC_Sejal Kadam 27
Binary operation methods on series:
FUNCTION DESCRIPTION
add() Method is used to add series or list like objects with same length to the caller series
sub() Method is used to subtract series or list like objects with same length from the caller series
mul() Method is used to multiply series or list like objects with same length with the caller series
div() Method is used to divide series or list like objects with same length by the caller series
sum() Returns the sum of the values for the requested axis
prod() Returns the product of the values for the requested axis
mean() Returns the mean of the values for the requested axis
pow()
Method is used to put each element of passed series as exponential power of caller series
and returned the results
abs() Method is used to get the absolute numeric value of each element in Series/DataFrame
cov() Method is used to find covariance of two series
6/21/2024 DJSCE_EXTC_Sejal Kadam 28
6/21/2024 DJSCE_EXTC_Sejal Kadam 29

More Related Content

PDF
Pandas numpy Related Presentation.pptx.pdf
PPTX
PPT on Data Science Using Python
PPTX
interenship.pptx
PDF
Using the python_data_toolkit_timbers_slides
PPTX
Beginning direct3d gameprogramming05_thebasics_20160421_jintaeks
PPTX
Python-for-Data-Analysis.pptx
PDF
Building Machine Learning Pipelines
PDF
PyData Paris 2015 - Track 1.2 Gilles Louppe
Pandas numpy Related Presentation.pptx.pdf
PPT on Data Science Using Python
interenship.pptx
Using the python_data_toolkit_timbers_slides
Beginning direct3d gameprogramming05_thebasics_20160421_jintaeks
Python-for-Data-Analysis.pptx
Building Machine Learning Pipelines
PyData Paris 2015 - Track 1.2 Gilles Louppe

Similar to Pandas in Python for Data Exploration .pdf (20)

PPTX
Lecture 9.pptx
PDF
SimpleLR - Jupyter Notebook Python Programming
PPTX
Cp unit 3
PDF
Time Series Analysis and Mining with R
PDF
R Programming Homework Help
PPTX
More on Pandas.pptx
PPTX
Data Visualization_pandas in hadoop.pptx
PDF
Building ML Pipelines
PPTX
Lecture 1 Pandas Basics.pptx machine learning
PPT
SASasasASSSasSSSSSasasaSASsasASASasasASs
PDF
IBM Infosphere Datastage Interview Questions-1.pdf
PDF
R data mining-Time Series Analysis with R
PDF
Python for Data Analysis.pdf
PPTX
Python-for-Data-Analysis.pptx
PPTX
Python-for-Data-Analysis.pptx
PPTX
Python for data analysis
PDF
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
PDF
Python-for-Data-Analysis.pdf
PDF
maxbox starter60 machine learning
PDF
ClusterAnalysis
Lecture 9.pptx
SimpleLR - Jupyter Notebook Python Programming
Cp unit 3
Time Series Analysis and Mining with R
R Programming Homework Help
More on Pandas.pptx
Data Visualization_pandas in hadoop.pptx
Building ML Pipelines
Lecture 1 Pandas Basics.pptx machine learning
SASasasASSSasSSSSSasasaSASsasASASasasASs
IBM Infosphere Datastage Interview Questions-1.pdf
R data mining-Time Series Analysis with R
Python for Data Analysis.pdf
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
Python for data analysis
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
Python-for-Data-Analysis.pdf
maxbox starter60 machine learning
ClusterAnalysis
Ad

Recently uploaded (20)

PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Cardiovascular Pharmacology for pharmacy students.pptx
PDF
Business Ethics Teaching Materials for college
PPTX
Cell Structure & Organelles in detailed.
PDF
PSYCHOLOGY IN EDUCATION.pdf ( nice pdf ...)
PDF
01-Introduction-to-Information-Management.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Insiders guide to clinical Medicine.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Basic Mud Logging Guide for educational purpose
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Pharma ospi slides which help in ospi learning
STATICS OF THE RIGID BODIES Hibbelers.pdf
Microbial disease of the cardiovascular and lymphatic systems
Cardiovascular Pharmacology for pharmacy students.pptx
Business Ethics Teaching Materials for college
Cell Structure & Organelles in detailed.
PSYCHOLOGY IN EDUCATION.pdf ( nice pdf ...)
01-Introduction-to-Information-Management.pdf
human mycosis Human fungal infections are called human mycosis..pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
102 student loan defaulters named and shamed – Is someone you know on the list?
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Insiders guide to clinical Medicine.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Microbial diseases, their pathogenesis and prophylaxis
Basic Mud Logging Guide for educational purpose
Ad

Pandas in Python for Data Exploration .pdf

  • 1. Python Programming Pandas in Python Sejal Kadam Assistant Professor Department of Electronics & Telecommunication DJSCE, Mumbai
  • 2. WHAT IS PANDAS? • Pandas is an opensource library that allows you to perform data manipulation in python. • Pandas provide an easy way to create, manipulate and wrangle the data. • Pandas library is built on top of numpy, meaning pandas needs numpy to operate. • Pandas is also an elegant solution for time series data. 6/21/2024 DJSCE_EXTC_Sejal Kadam 2
  • 3. WHY USE PANDAS? • Pandas is a useful library in data analysis. • It provides an efficient way to slice merge, concatenate or reshape the data the data • Easily handles missing data • It includes a powerful time series tool to work with • It uses Series for one-dimensional data structure and DataFrame for multi-dimensional data structure 6/21/2024 DJSCE_EXTC_Sejal Kadam 3
  • 4. HOW TO INSTALL PANDAS? You can install Pandas using: • Anaconda: conda install -c anaconda pandas • In Jupyter Notebook : import sys !conda install --yes --prefix {sys.prefix} pandas 6/21/2024 DJSCE_EXTC_Sejal Kadam 4
  • 5. WHAT IS A DATA FRAME? A data frame is a two-dimensional array, with labeled axes (rows and columns). A data frame is a standard way to store data. It can have any data structure like integer, float, and string. Data: can be a list, dictionary or scalar value Pandas data frame: 6/21/2024 DJSCE_EXTC_Sejal Kadam 5
  • 6. WHAT IS A SERIES? A series is a one-dimensional data structure. It can have any data structure like integer, float, and string. Data: can be a list, dictionary or scalar value A series, by definition, cannot have multiple columns. import pandas as pd pd.Series([1., 2., 3.]) 0 1.0 1 2.0 2 3.0 dtype: float64 6/21/2024 DJSCE_EXTC_Sejal Kadam 6
  • 7. You can add the index with index parameter. It helps to name the rows. The length should be equal to the size of the column. pd.Series([1., 2., 3.], index=['a', 'b', 'c’]) Output a 1.0 b 2.0 c NaN dtype: float64 6/21/2024 DJSCE_EXTC_Sejal Kadam 7
  • 8. You create a Pandas series with a missing value. Note, missing values in Python are noted "NaN." You can use numpy to create missing value: np.nan artificially pd.Series([1,2,np.nan]) Output 0 1.0 1 2.0 2 NaN dtype: float64 6/21/2024 DJSCE_EXTC_Sejal Kadam 8
  • 9. You can also use a dictionary to create a Pandas dataframe. dic = {'Name': ["ABC", "XYZ"], 'Age': [30, 40]} pd.DataFrame(data=dic) Age Name 0 30 ABC 1 40 XYZ 6/21/2024 DJSCE_EXTC_Sejal Kadam 9
  • 10. RANGE DATA Pandas have a convenient API to create a range of date pd.date_range(date,period,frequency) • The first parameter is the starting date • The second parameter is the number of periods (optional if the end date is specified) • The last parameter is the frequency: day: 'D,' month: 'M' and year: 'Y.’ ## Create date Days dates_d = pd.date_range('20240101', periods=6, freq='D') print('Day:', dates_d) Output Day: DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01- 06'], dtype='datetime64[ns]', freq='D') 6/21/2024 DJSCE_EXTC_Sejal Kadam 10
  • 11. # Months dates_m = pd.date_range('20240131', periods=6, freq='M') print('Month:', dates_m) Output Month: DatetimeIndex(['2024-01-31', '2024-02-28', '2024-03-31', '2024 -04-30','2024-05-31', '2024-06-30'], dtype='datetime64[ns]', freq='M') 6/21/2024 DJSCE_EXTC_Sejal Kadam 11
  • 12. INSPECTING DATA You can check the head or tail of the dataset with head(), or tail() preceded by the name of the panda's data frame Step 1) Create a random sequence with numpy. The sequence has 4 columns and 6 rows random = np.random.randn(6,4) Step 2) Then you create a data frame using pandas. Use dates_m as an index for the data frame. It means each row will be given a "name" or an index, corresponding to a date. Finally, you give a name to the 4 columns with the argument columns # Create data with date df = pd.DataFrame(random,index=dates_m,columns=list('ABCD')) 6/21/2024 DJSCE_EXTC_Sejal Kadam 12
  • 13. Step 3) Using head function df.head(3) Step 4) Using tail function df.tail(3) A B C D 2024-01-31 1.139433 1.318510 -0.181334 1.615822 2024-02-28 -0.081995 -0.063582 0.857751 -0.527374 2024-03-31 -0.519179 0.080984 -1.454334 1.314947 A B C D 2024-04-30 -0.685448 -0.011736 0.622172 0.104993 2024-05-31 -0.935888 -0.731787 -0.558729 0.768774 2024-06-30 1.096981 0.949180 -0.196901 -0.471556 6/21/2024 DJSCE_EXTC_Sejal Kadam 13
  • 14. Step 5) An excellent practice to get a clue about the data is to use describe(). It provides the counts, mean, std, min, max and percentile of the dataset. df.describe() A B C D COUNT 6.000000 6.000000 6.000000 6.000000 MEAN 0.002317 0.256928 -0.151896 0.467601 STD 0.908145 0.746939 0.834664 0.908910 MIN -0.935888 -0.731787 -1.454334 -0.527374 25% -0.643880 -0.050621 -0.468272 -0.327419 50% -0.300587 0.034624 -0.189118 0.436883 75% 0.802237 0.732131 0.421296 1.178404 MAX 1.139433 1.318510 0.857751 1.615822 6/21/2024 DJSCE_EXTC_Sejal Kadam 14
  • 15. Few Functions: df.mean() Returns the mean of all columns df.corr() Returns the correlation between columns in a data frame df.count() Returns the number of non-null values in each data frame column df.max() Returns the highest value in each column df.min() Returns the lowest value in each column df.median() Returns the median of each column 6/21/2024 DJSCE_EXTC_Sejal Kadam 15
  • 16. Accessing various data formats It gives you the capability to read various types of data formats like CSV, JSON, Excel, Pickle, etc. It allows you to represent your data in a row and column tabular fashion, which makes the data readable and presentable. We can access csv file using read_csv() function. For e.g. df = pd.read_csv("data1.csv“) 6/21/2024 DJSCE_EXTC_Sejal Kadam 16
  • 17. SLICE DATA You can use the column name to extract data in a particular column. ## Slice ### Using name df['A’] Output: 2024-01-31 -0.168655 2024-02-28 0.689585 2024-03-31 0.767534 2024-04-30 0.557299 2024-05-31 -1.547836 2024-06-30 0.511551 Freq: M, Name: A, dtype: float64 6/21/2024 DJSCE_EXTC_Sejal Kadam 17
  • 18. To select multiple columns, you need to use two times the bracket, [[..,..]] The first pair of bracket means you want to select columns, the second pairs of bracket tells what columns you want to return. df[['A', 'B']]. A B 2024-01-31 -0.168655 0.587590 2024-02-28 0.689585 0.998266 2024-03-31 0.767534 -0.940617 2024-04-30 0.557299 0.507350 2024-05-31 -1.547836 1.276558 2024-06-30 0.511551 1.572085 6/21/2024 DJSCE_EXTC_Sejal Kadam 18
  • 19. You can also slice the rows THE CODE BELOW RETURNS THE FIRST THREE ROWS ### USING A SLICE FOR ROW df[0:3] A B C D 2024-01-31 -0.168655 0.587590 0.572301 -0.031827 2024-02-28 0.689585 0.998266 1.164690 0.475975 2024-03-31 0.767534 -0.940617 0.227255 -0.341532 6/21/2024 DJSCE_EXTC_Sejal Kadam 19
  • 20. The loc function is used to select columns by names. As usual, the values before the coma stand for the rows and after refer to the column. You need to use the brackets to select more than one column. ## Multi col df.loc[:,['A','B']] A B 2024-01-31 -0.168655 0.587590 2024-02-28 0.689585 0.998266 2024-03-31 0.767534 -0.940617 2024-04-30 0.557299 0.507350 2024-05-31 -1.547836 1.276558 2024-06-30 0.511551 1.572085 6/21/2024 DJSCE_EXTC_Sejal Kadam 20
  • 21. There is another method to select multiple rows and columns in Pandas. You can use iloc[]. This method uses the index instead of the columns name. The code below returns the same data frame as above df.iloc[:, :2] A B 2024-01-31 -0.168655 0.587590 2024-02-28 0.689585 0.998266 2024-03-31 0.767534 -0.940617 2024-04-30 0.557299 0.507350 2024-05-31 -1.547836 1.276558 2024-06-30 0.511551 1.572085 6/21/2024 DJSCE_EXTC_Sejal Kadam 21
  • 22. DROP A COLUMN You can drop columns using pd.drop() df.drop(columns=['A', 'C’]) B D 2024-01-31 0.587590 -0.031827 2024-02-28 0.998266 0.475975 2024-03-31 -0.940617 -0.341532 2024-04-30 0.507350 -0.296035 2024-05-31 1.276558 0.523017 2024-06-30 1.572085 -0.594772 6/21/2024 DJSCE_EXTC_Sejal Kadam 22
  • 23. CONCATENATION You can concatenate two DataFrame in Pandas. You can use pd.concat() First of all, you need to create two DataFrames. So far so good, you are already familiar with dataframe creation import numpy as np df1 = pd.DataFrame({'name': ['ABC', 'XYZ','PQR'],'Age': ['25', '30', '50']}, index=[0, 1, 2]) df2 = pd.DataFrame({'name': ['LMN', 'XYZ' ],'Age': ['26', '11']}, index=[3, 4]) Finally, you concatenate the two DataFrame df_concat = pd.concat([df1,df2]) df_concat 6/21/2024 DJSCE_EXTC_Sejal Kadam 23
  • 24. AGE NAME 0 25 ABC 1 30 XYZ 2 50 PQR 3 26 LMN 4 11 XYZ DROP_DUPLICATES If a dataset can contain duplicates information use, `drop_duplicates` is an easy to exclude duplicate rows. You can see that `df_concat` has a duplicate observation, `XYZ` appears twice in the column `name.` df_concat.drop_duplicates('name') AGE NAME 0 25 ABC 1 30 XYZ 2 50 PQR 3 26 LMN 6/21/2024 DJSCE_EXTC_Sejal Kadam 24
  • 25. SORT VALUES You can sort value with sort_values df_concat.sort_values('Age') AGE NAME 4 11 XYZ 0 25 ABC 3 26 LMN 1 30 XYZ 2 50 PQR 6/21/2024 DJSCE_EXTC_Sejal Kadam 25
  • 26. RENAME: CHANGE OF INDEX You can use rename to rename a column in Pandas. The first value is the current column name and the second value is the new column name. df_concat.rename(columns={"name": "Surname", "Age": "Age_ppl"}) AGE_PPL SURNAME 0 25 ABC 1 30 XYZ 2 50 PQR 3 26 LMN 4 11 XYZ 6/21/2024 DJSCE_EXTC_Sejal Kadam 26
  • 27. Operations on Series using panda modules We can perform binary operation on series like addition, subtraction and many other operations. In order to perform binary operation on series we have to use some function like .add(),.sub() etc.. # adding two series data & data1 using # .add data.add(data1, fill_value=0) # subtracting two series data & data1 using # .sub data.sub(data1, fill_value=0) 6/21/2024 DJSCE_EXTC_Sejal Kadam 27
  • 28. Binary operation methods on series: FUNCTION DESCRIPTION add() Method is used to add series or list like objects with same length to the caller series sub() Method is used to subtract series or list like objects with same length from the caller series mul() Method is used to multiply series or list like objects with same length with the caller series div() Method is used to divide series or list like objects with same length by the caller series sum() Returns the sum of the values for the requested axis prod() Returns the product of the values for the requested axis mean() Returns the mean of the values for the requested axis pow() Method is used to put each element of passed series as exponential power of caller series and returned the results abs() Method is used to get the absolute numeric value of each element in Series/DataFrame cov() Method is used to find covariance of two series 6/21/2024 DJSCE_EXTC_Sejal Kadam 28