Write custom aggregation function in Pandas
Last Updated :
20 Aug, 2020
Pandas in python in widely used for Data Analysis purpose and it consists of some fine data structures like Dataframe and Series. There are several functions in pandas that proves to be a great help for a programmer one of them is an aggregate function. This function returns a single value from multiple values taken as input which are grouped together on certain criteria. A few of the aggregate functions are average, count, maximum, among others.
Syntax: DataFrame.agg(func=None, axis=0, *args, **kwargs)
Parameters:
- axis: {0 or ‘index’, 1 or ‘columns’} = 0 or ‘index’ means the function is applied to each column and 1 or ‘columns' means the function is applied to each row.
- func: function, str, list or dict = It describes the function that is to be used for aggregation. Accepted combinations are: function, string function name (str), list of functions (list/dict).
- *args: It specifies the positional arguments to pass to the function.
- **kwargs: It specifies the keyword arguments to pass to the function.
Return: This function can return scalar, Series or Dataframe. The return is scalar when Series.agg is called with a single function, it is Series when Dataframe.agg is called with a single function, it will be Dataframe when Dataframe.agg is called with several functions.
Let's create a Dataframe:
Python3
# import pandas library
import pandas as pd
# create a Dataframe
df = pd.DataFrame([[10, 20, 30],
[40, 50, 60],
[70, 80, 90],
[100,110,120]],
columns=['Col_A', 'Col_B',
'Col_C'])
# show the dataframe
df
Output:
Now, let's perform some operations:
1. Performing aggregation over the rows: This performs aggregate functions over the rows of the Dataframe. As you can see in the below examples, the example 1 has two keywords inside the aggregate function, sum and min. The sum adds up the first (10,40,70,100), second (20,50,80,110) and third (30,60,90,120) element of each row separately and print it, the min finds the minimum number among the elements of rows and print it. Similar process is with the second example.
Example 1:
Python3
Output:
Example 2:
Python3
df.agg(['sum', 'min', 'max'])
Output:
2. Performing aggregation per column: This performs aggregate function on the columns, the columns are selected particularly as shown in the examples. In the first example, two columns are selected, 'Col_A' and 'Col_B' and operations are to be performed on them. For Col_A, the minimum value and the summed up value is calculated and for the Col_B, minimum and maximum value is calculated. Similar process is with example 2.
Example 1:
Python3
df.agg({'Col_A' : ['sum', 'min'],
'Col_B' : ['min', 'max']})
Output:
Example 2:
Python3
df.agg({'Col_A' : ['sum', 'min'],
'Col_B' : ['min', 'max'],
'Col_C' : ['sum', 'mean']})
Output:
Note: It will print NaN if a particular aggregation is not performed on a particular column.
3. Performing aggregation over the columns: This performs aggregate function over the columns. As shown in example 1, the mean of first (10,20,30), second (40,50,60), third (70,80,90) and fourth (100,110,120) elements of each column is calculated separately and printed.
Example:
Python3
df.agg("mean", axis = "columns")
Output:
4. Custom Aggregate function: Sometimes it becomes a need to create our own aggregate function.
Example: Consider a data frame consisting of student id (stu_id), subject code (sub_code) and marks (marks).
Python3
# import pandas library
import pandas as pd
# Creating DataFrame
df = pd.DataFrame(
{'stud_id' : [101, 102, 103, 104,
101, 102, 103, 104],
'sub_code' : ['CSE6001', 'CSE6001', 'CSE6001',
'CSE6001', 'CSE6002', 'CSE6002',
'CSE6002', 'CSE6002'],
'marks' : [77, 86, 55, 90,
65, 90, 80, 67]}
)
# Printing DataFrame
df
Output:
Now if you need to calculate the total marks (marks of two subjects) of each student (unique stu_id). This process can be done using custom aggregate function. Here my custom aggregate function is 'total'.
Python3
# Importing reduce for
# rolling computations
from functools import reduce
# define a Custom aggregation
# function for finding total
def total(series):
return reduce(lambda x, y: x + y, series)
# Grouping the output according to
# student id and printing the corresponding
# total marks and to check whether the
# output is correct or not, sum function
# is also used to print the sum.
df.groupby('stud_id').agg({'marks': ['sum', total]})
Output:
As you can see, both the columns have same values of total marks, so our aggregate function is correctly calculating the total marks in this case.
Similar Reads
Count distinct in Pandas aggregation
In this article, let's see how we can count distinct in pandas aggregation. So to count the distinct in pandas aggregation we are going to use groupby() and agg() method. Â groupby(): This method is used to split the data into groups based on some criteria. Pandas objects can be split on any of thei
2 min read
Groupby without aggregation in Pandas
Pandas is a great python package for manipulating data and some of the tools which we learn as a beginner are an aggregation and group by functions of pandas. Groupby() is a function used to split the data in dataframe into groups based on a given condition. Aggregation on other hand operates on se
4 min read
pandas.concat() function in Python
The pandas.concat() function does all the heavy lifting of performing concatenation operations along with an axis of Pandas objects while performing optional set logic (union or intersection) of the indexes (if any) on the other axes. Pandas concat() function SyntaxSyntax: concat(objs, axis, join, i
4 min read
Grouping and Aggregating with Pandas
When working with large datasets it's used to group and summarize the data to make analysis easier. Pandas a popular Python library provides powerful tools for this. In this article you'll learn how to use Pandas' groupby() and aggregation functions step by step with clear explanations and practical
3 min read
pandas.crosstab() function in Python
pandas.crosstab() function in Python is used to compute a cross-tabulation (contingency table) of two or more categorical variables. By default, computes a frequency table of the factors unless an array of values and an aggregation function are passed. It also supports aggregation when additional da
5 min read
Using SQLite Aggregate functions in Python
In this article, we are going to see how to use the aggregate function in SQLite Python. An aggregate function is a database management function that groups the values of numerous rows into a single summary value. Average (i.e., arithmetic mean), sum, max, min, Count are common aggregation functions
3 min read
How to combine Groupby and Multiple Aggregate Functions in Pandas?
Pandas is an open-source Python library built on top of NumPy. It allows data structures and functions to manipulate and analyze numerical data and time series efficiently. It is widely used in data analysis for tasks like data manipulation, cleaning and exploration. One of its key feature is to gro
3 min read
pandas.eval() function in Python
This method is used to evaluate a Python expression as a string using various back ends. It returns ndarray, numeric scalar, DataFrame, Series. Syntax : pandas.eval(expr, parser='pandas', engine=None, truediv=True, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False)
2 min read
Pyspark - Aggregation on multiple columns
In this article, we will discuss how to perform aggregation on multiple columns in Pyspark using Python. We can do this by using Groupby() function Let's create a dataframe for demonstration:Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql i
3 min read
Apply function to every row in a Pandas DataFrame
Python is a great language for performing data analysis tasks. It provides a huge amount of Classes and functions which help in analyzing and manipulating data more easily. In this article, we will see how we can apply a function to every row in a Pandas Dataframe. Apply Function to Every Row in a P
7 min read