Recommender System using Pyspark - Python
Last Updated :
28 Apr, 2025
A recommender system is a type of information filtering system that provides personalized recommendations to users based on their preferences, interests, and past behaviors. Recommender systems come in a variety of forms, such as content-based, collaborative filtering, and hybrid systems. Content-based systems make recommendations for products based on how closely their characteristics match those of products the user has previously expressed interest in. Collaborative filtering systems recommend items based on the preferences of users who have similar interests to the user being recommended. Hybrid systems combine both content-based and collaborative filtering approaches to make recommendations.
We will implement this with the help of Collaborative Filtering. Collaborative filtering involves making predictions (filtering) about a user's interests by compiling preferences or taste data from numerous users (collaborating). The essential premise is that, if two users A and B share the same opinion on a subject, A is more likely to share B's opinion on a related but unrelated subject, x, than the opinion of a randomly selected user.
Recommender System using Pyspark
Collaborative filtering is implemented by the machine learning library Spark MLlib using Alternating Least Squares. These parameters apply to the MLlib implementation:
- The number of blocks used to parallelize computation is numBlocks (set to -1 to auto-configure).
- The number of latent factors in the model is its rank.
- The number of iterations to execute is known as an iteration.
- The regularisation parameter in ALS is specified by lambda.
- Whether to utilize the ALS variation tailored for implicit feedback data or the explicit feedback variant is determined by implicitPrefs.
- The implicit feedback variant of ALS has a parameter called alpha that controls the initial level of confidence in preference observations.
In this, we will use the dataset of the book review.
Step 1: Import the necessary libraries and functions and Setup Spark Session
Python3
#importing the required pyspark library
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
#Setup Spark Session
spark = SparkSession.builder.appName('Recommender').getOrCreate()
spark
Output:
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.3.1
Master
local[*]
AppName
Recommender
Step 2: Reading the data from the data set
Python3
#CSV file can be downloaded from the link mentioned above.
data = spark.read.csv('book_ratings.csv',
inferSchema=True,header=True)
data.show(5)
Output:
+-------+-------+------+
|book_id|user_id|rating|
+-------+-------+------+
| 1| 314| 5|
| 1| 439| 3|
| 1| 588| 5|
| 1| 1169| 4|
| 1| 1185| 4|
+-------+-------+------+
only showing top 5 rows
Describe the dataset
Python3
Output:
+-------+-----------------+------------------+------------------+
|summary| book_id| user_id| rating|
+-------+-----------------+------------------+------------------+
| count| 981756| 981756| 981756|
| mean|4943.275635697668|25616.759933221696|3.8565335989797873|
| stddev|2873.207414896143|15228.338825882149|0.9839408559619973|
| min| 1| 1| 1|
| max| 10000| 53424| 5|
+-------+-----------------+------------------+------------------+
Step 3: Splitting the data into training and testing
Python3
# Dividing the data using random split into train_data and test_data
# in 80% and 20% respectively
train_data, test_data = data.randomSplit([0.8, 0.2])
Step 4: Import the Alternating Least Squares(ALS) Method and apply it.
Python3
# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5,
regParam=0.01,
userCol="user_id",
itemCol="book_id",
ratingCol="rating")
#Fitting the model on the train_data
model = als.fit(train_data)
Step 5: Predictions
Python3
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test_data)
#Displaying predictions calculated by the model
predictions.show()
Output:
+-------+-------+------+----------+
|book_id|user_id|rating|prediction|
+-------+-------+------+----------+
| 2| 6342| 3| 4.8064413|
| 1| 17984| 5| 4.9681554|
| 1| 38475| 4| 4.4078903|
| 2| 6630| 5| 4.344222|
| 1| 32055| 4| 3.990228|
| 1| 33697| 4| 3.7945805|
| 1| 18313| 5| 4.533183|
| 1| 5461| 3| 3.8614116|
| 1| 47800| 5| 4.914357|
| 2| 10751| 3| 4.160536|
| 1| 16377| 4| 5.304298|
| 1| 45493| 5| 3.998557|
| 2| 10509| 2| 1.8626969|
| 1| 33890| 3| 3.6022692|
| 1| 37284| 5| 4.8147345|
| 1| 1185| 4| 3.7463336|
| 1| 44397| 5| 5.0251017|
| 1| 46977| 4| 4.0746284|
| 1| 10944| 5| 4.343548|
| 2| 8167| 2| 3.705464|
+-------+-------+------+----------+
only showing top 20 rows
Evaluations
Python3
#Printing and calculating RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))
Output:
Root-mean-square error = nan
Step 6: Recommendations
Now, we will predict/recommend the book to a single user - user1 (let's say, userId:5461) with the help of our trained model.
Python3
#Filtering user with user id "5461" with book id on which it has given the reviews
user1 = test_data.filter(test_data['user_id']==5461).select(['book_id','user_id'])
#Displaying user1 data
user1.show()
Output:
+-------+-------+
|book_id|user_id|
+-------+-------+
| 1| 5461|
| 11| 5461|
| 19| 5461|
| 46| 5461|
| 60| 5461|
| 66| 5461|
| 93| 5461|
| 111| 5461|
| 121| 5461|
| 172| 5461|
| 194| 5461|
| 212| 5461|
| 222| 5461|
| 245| 5461|
| 264| 5461|
| 281| 5461|
| 301| 5461|
| 354| 5461|
| 388| 5461|
| 454| 5461|
+-------+-------+
only showing top 20 rows
Python3
#Traning and evaluating for user1 with our model trained with the help of training data
recommendations = model.transform(user1)
#Displaying the predictions of books for user1
recommendations.orderBy('prediction',ascending=False).show()
Output:
+-------+-------+----------+
|book_id|user_id|prediction|
+-------+-------+----------+
| 19| 5461| 5.3429904|
| 11| 5461| 4.830688|
| 66| 5461| 4.804107|
| 245| 5461| 4.705879|
| 388| 5461| 4.6276107|
| 1161| 5461| 4.612251|
| 60| 5461| 4.5895457|
| 1402| 5461| 4.5184|
| 1088| 5461| 4.454755|
| 5152| 5461| 4.415825|
| 121| 5461| 4.3423634|
| 93| 5461| 4.3357944|
| 1796| 5461| 4.30891|
| 172| 5461| 4.2679276|
| 454| 5461| 4.245925|
| 1211| 5461| 4.2431927|
| 731| 5461| 4.1873074|
| 1094| 5461| 4.1829815|
| 222| 5461| 4.182873|
| 264| 5461| 4.1469045|
+-------+-------+----------+
only showing top 20 rows
In the above output, there are predictions for the book IDs for the user with userId "5461".
Step 7: Stop the spark
Python3
Similar Reads
Recommendation System in Python
Industry leaders like Netflix, Amazon and Uber Eats have transformed how individuals access products and services. They do this by using recommendation algorithms that improve the user experience. These systems offer personalized recommendations based on users interests and preferences. In this arti
6 min read
Logistic Regression using PySpark Python
In this tutorial series, we are going to cover Logistic Regression using Pyspark. Logistic Regression is one of the basic ways to perform classification (donât be confused by the word âregressionâ). Logistic Regression is a classification method. Some examples of classification are: Spam detectionDi
3 min read
Implementation of Movie Recommender System - Python
Recommender Systems provide personalized suggestions for items that are most relevant to each user by predicting preferences according to user's past choices. They are used in various areas like movies, music, news, search queries, etc. These recommendations are made in two ways: Collaborative filte
4 min read
Music Recommendation System Using Machine Learning
When did we see a video on youtube let's say it was funny then the next time you open your youtube app you get recommendations of some funny videos in your feed ever thought about how? This is nothing but an application of Machine Learning using which recommender systems are built to provide persona
4 min read
Crop Recommendation System using TensorFlow
In this tutorial, we will make a recommendation system that will take in the different environmental attributes such as the nitrogen, phosphorous, potassium content in the soil, temperature, etc., and predict what is the best crop that the user can plant so that it survives in the given climatic con
8 min read
Python PySpark sum() Function
PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data anal
3 min read
K-Means Clustering using PySpark Python
In this tutorial series, we are going to cover K-Means Clustering using Pyspark. K-means is a clustering algorithm that groups data points into K distinct clusters based on their similarity. It is an unsupervised learning technique that is widely used in data mining, machine learning, and pattern re
4 min read
How to Check PySpark Version
Knowing the version of PySpark you're working with is crucial for compatibility and troubleshooting purposes. In this article, we will walk through the steps to check the PySpark version in the environment.What is PySpark?PySpark is the Python API for Apache Spark, a powerful distributed computing s
3 min read
Setting Up a Data Science Environment in Python
Data Science is about understanding the data using programming and statistics. But before you begin working on any project itâs important to prepare your computer by setting up the right tools. This article will guide you how to setup data science environment in python. Also make sure you have a lap
4 min read
RoadMap for DSA in Python
Mastering Data Structures and Algorithms (DSA) is key to optimizing code and solving problems efficiently. Whether you're building applications or preparing for tech interviews at companies like Google, Microsoft, or Netflix, DSA knowledge is crucial. This roadmap will guide you from basic concepts
4 min read