Data Science Interview Questions | Data Science Interview Questions And Answers | Simplilearn

Data Science Interview Questions

What do you understand from Measures and Dimensions?
Each field from the data source is automatically assigned a
datatype (such as string, integer) and a role (dimension or
measure)
Aggregation applied on measures is ‘Sum’ by default but you
can always change the default aggregation in the settings
Can you
solve?
You have two buckets - one of 3 liters and other of 5
liters.You are expected to measure exactly 4 liters. How
will you complete the task?
Note:There is no third bucket

Step 1: Fill in 5 liter bucket and empty it in the 3 liter bucket. You are
left with 2 liter in the 5 liter bucket
Step 2: Empty the 3 liter bucket and pour the contents of 5 liter
bucket in it. So 3 liter bucket now has 2 liters
Step 3: Fill the 5 liter bucket again and pour the water in 3 liter bucket
(already has 2 liters of water from step 2)
You now have 4 liters in the 5 liter bucket
53

What are the datatypes supported in Tableau?1 List the differences between supervised
and unsupervised learning01

1 List the differences between supervised and unsupervised learning
Requires both an input and an output to
be given to the model for it to be
trained.
• Uses known and labeled data as input • Uses unlabeled data as input
• Most commonly used unsupervised
learning algorithms are k-means clustering,
hierarchical clustering, apriori algorithm
• Most commonly used supervised learning
algorithms are decision tree, logistic
regression, support vector machine
• Supervised learning has a feedback
mechanism
• Unsupervised learning has no feedback
mechanism
Supervised Learning Unsupervised Learning

What are the datatypes supported in Tableau?1 How is logistic regression done?02

2 How is logistic regression done?
Logistic Regression measures the relationship between the dependent variable (our label, what we
want to predict) and the one or more independent variables (our features), by estimating probabilities
using it’s underlying logistic function (sigmoid)
X1
X2
X3
X4
0.5
0.8
0.9
0.1
0.9
0.1
0 or 1
Inputs Probabilities Values close to
0 and 1
Linear
Model
Sigmoid
Function
Threshold
Classifier

2
0
100 1
0
Sigmoid
Curve
Sigmoid Function
y = m*x + c
p =
1
1 + ⅇ
− y
p
ln (
1-p
) = m*x + c
No. of hours studied No. of hours studied
Marks
Pass
How is logistic regression done?

What are the datatypes supported in Tableau?1 Explain the steps in making a decision
tree
03

3 Explain the steps in making a decision tree
Take the entire dataset as input
Calculate entropy of target variable as well as predictor attributes
Calculate information gain of all attributes
Choose the attribute with highest information gain as the root node
Repeat the same process on every branch till the decision node of each
branch is finalized

3 Explain the steps in making a decision tree
NoYes
Yes
Salary >
$50,000
No
Commute
> 1 hour
YesNo
Decline Offer
Play Decline OfferOffers
Incentives
Decline OfferAccept Offer
Tip: You should know the
formulae for entropy and
information gain!
For example, if you want to build a decision tree to decide whether
we should accept or decline a job offer

What are the datatypes supported in Tableau?1 How do you build a random forest
model?
04

4 How do you build a random forest model?
Randomly select “k” features from total “m” features
Where k << m
Among the “k” features, calculate the node “d” using the best split point
Split the node into daughter nodes using the best split
Repeat steps 2 and 3 steps until leaf nodes are finalized
Build forest by repeating steps 1 to 4 for “n” number times to create “n”
number of trees

What are the datatypes supported in Tableau?1 How can you avoid overfitting of your
model?
05

5 How can you avoid overfitting of your model?
There are three main methods to avoid overfitting:
Keep the model simple: take into
account fewer variables, thereby
removing some of the noise in the
training data
Use cross-validation
techniques such as k-folds
cross-validation
Use regularization techniques
such as LASSO that penalize
certain model parameters if
they’re likely to cause
overfitting

measure)
There are 9 balls out of which one ball is heavy in weight
and rest are of the same weight. In how many minimum
Weightings will you find the heavier ball?
Can you
solve?

measure)
You will need to perform 2 weightings:
Step 1: Place three balls on each side
Scenario(a): Balance out
Out of the remaining three balls from step 1, take two balls and
place one ball on each side – if they balance out then the left out
ball will be the heavier ball. Otherwise, you will see it in the balance.
Scenario(b): Not balanced out
If the balls in step 1 do not balance out, then take those three balls
and reproduce step 2 to find out the heavier ball.

What are the datatypes supported in Tableau?1 Differentiate between univariate,
bivariate and multivariate analysis
06

6 Differentiate between univariate, bivariate and multivariate analysis
This type of data contains only one variable. The purpose of
univariate analysis is to describe the data and find patterns that exist
within it
Example: height of students
The patterns can be studied by drawing conclusions using mean,
median and mode, dispersion or range, minimum, maximum etc
Height (in cm)
164
167.3
170
174.2
178
180

This type of data involves two different variables. The analysis of
this type of data deals with causes and relationships and the analysis
is done to find out the relationship among the two variables
Example: temperature and ice cream sales in summer season
Here, the relationship is visible from the table that temperature and
sales are directly proportional to each other
Temperature (in
Celsius)
Sales
20 2000
25 2100
26 2300
28 2400
30 2600
35 3100

When the data involves three or more variables, it is categorized
under multivariate.
It is similar to bivariate but contains more than one dependent
variable
Example: data for house price prediction
The patterns can be studied by drawing conclusions using mean,
median and mode, dispersion or range, minimum, maximum etc
No. of
rooms
Floor Sqft. Area Price
2 0 900 40,00,00
3 2 1100 60,00,000
3.5 5 1500 90,00,000
4 3 2100 1,20,00,000

What are the datatypes supported in Tableau?107 What are the feature selection methods
to select the right variables?

7 What are the feature selection methods to select the right variables?
Following are the methods of variable selection you can use:
There are two main methods for feature selection:
Filter Methods Wrapper Methods
• Linear Discriminant Analysis
• ANOVA
• Chi-Sqaure
• Forward Selection
• Backward Selection
• Recursive Feature Elimination

What are the datatypes supported in Tableau?1
In your choice of language: Write a program that prints
the numbers from 1 to 50. But for multiples of three
print “Fizz” instead of the number and for the multiples
of five print “Buzz”. For numbers which are multiples of
both three and five print “FizzBuzz”
08

What are the datatypes supported in Tableau?1 You are given a dataset consisting of variables
having more than 30% missing values? How
will you deal with them?
09

9 You are given a dataset consisting of variables having more than 30% missing values?
How will you deal with them?
Ways to handle missing data values:
If dataset is huge, we can
simply remove the rows
with missing data values.
It is the quickest way
i.e. we use the rest of the
data to predict the values
We can substitute missing
values with mean of rest of
the data using pandas
dataframe in python
i.e. df.mean()
df.fillna(mean)

What are the datatypes supported in Tableau?1 For the given points, how will you
calculate the Eucledian Distance, in
Python?
10

1
0
For the given points, how will you calculate the Eucledian Distance, in Python?
Given points:
plot1 = [1,3]
plot2 = [2,5]
euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )

measure)
What is the angle between the hour and minute
hands of a clock when the time is half past six?
Can you
solve?

measure)
• The minute hand has travelled for 30 minutes. So, it has
covered 30×6=180°
• The hour hand has travelled for 6.5 hours. So, it has
covered 6.5×30=195°
• The difference between the two will give the angle between
the two hands. Thus, the required angle=195°-180°=15°
Note: A clock is a complete circle having 360 degrees
In 1 hour, the hour hand covers: 360/12 = 30°
In 1 minute, the minute hand covers 360/60 = 6°

What are the datatypes supported in Tableau?1 Explain dimensionality reduction, and list
its benefits?
11

1
1
Explain dimensionality reduction, and list its benefits?
Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with
lesser dimensions (fields) to convey similar information concisely
It helps in data compressing and reducing the storage space
It reduces computation time as less dimensions lead to less
computing
It removes redundant features
For example: there is no point in storing a value in two different units
(meters and inches)

What are the datatypes supported in Tableau?1 How will you calculate eigen values and
eigen vectors of a 3 by 3 matrix?12

1
2
How will you calculate eigen values and eigen vectors of a 3 by 3 matrix?
-2 -4 2
-2 1 2
4 2 5
Characteristic equation:
Expanding determinant: (-2 – λ) [(1-λ) (5-λ)-2x2] + 4[(-2) x (5-λ) -4x2] + 2[(-2) x 2-4(1-λ)] =0
- λ3 + 4 λ2 + 27λ – 90 = 0,
λ 3 - 4 λ2 -27 λ + 90 = 0

1
2
By hit and trial:
Hence (λ-3) is a factor
So, eigen values are 3, -5, 6
Calculate eigenvector for λ=3
For X = 1,
33 – 4 x 32 - 27 x 3 +90 = 0
λ 3 - 4 λ2 - 27 λ +90 = (λ – 3) (λ2 – λ – 30)
(λ – 3) (λ2 – λ – 30) = (λ – 3) (λ+5) (λ-6),
-5 -4Y +2Z =0,
-2 -2Y +2Z =0

1
2
Subracting the two equation:
Subracting back into second equation:
Similarly, we can calculate the eigen
vectors for -5 and 6
Z = -
1
2
.
3 + 2Y = 0,
Y = -
3
2
.

What are the datatypes supported in Tableau?1 How should you maintain your deployed
model?
13

1
3
How should you maintain your deployed model?
CompareEvaluateMonitor Rebuild
Constant monitoring of
all of the models is
needed to determine the
performance accuracy of
the models
Evaluation metrics of the
current model is
calculated to determine
if new algorithm is
needed
The new models are
compared against each
other to determine which
model performs the best
The best performing
model is re-built on
current state of data

What are the datatypes supported in Tableau?1 What are recommender systems?14

1
4
What are recommender systems?
A recommender system predicts the "rating" or "preference“, a user
would give to a product
Collaborative Filtering Content-based Filtering
Example:
Last.fm recommends tracks that are often
played by other users with similar interests
Example:
Pandora uses the properties of a song to
recommend music with similar properties

What are the datatypes supported in Tableau?1 How to find RMSE and MSE in linear
regression model?
15

1
5
How to find RMSE and MSE in linear regression model?
RMSE and MSE are the two of the most common measures of accuracy for a linear regression
RMSE indicates the Root Mean Square Error
MSE indicates the Mean Square Error

measure)
If it rains on Saturday with probability 0.6, and it rains on
Sunday with probability 0.2 , what is the probability that
it rains this weekend?
Can you
solve?

measure)
Total probability – (Probability that it will not rain on Saturday)
(Probability that it will not rain on Sunday)
1−(1−0.6)(1−0.2)=0.68
Can you
solve?
U

What are the datatypes supported in Tableau?1 How can you select k for k-means?16

1
6
How can you select k for k-means?
We use “Elbow Method” to select k for k-means
• The idea of the elbow method is to run k-means clustering on the
dataset where ‘k’ is the number of clusters
• Within sum of squares (WSS) is defined as the sum of the squared distance
between each member of the cluster and its centroid
WSS
No . of. clusters
Elbow Point

What are the datatypes supported in Tableau?1 What is the significance of p-value?17

1
7
What is the significance of p-value?
p-value
typically ≤ 0.05
p-value
typically > 0.05
p-value
Cutoff 0.05
Indicates strong evidence against the null hypothesis, so
you reject the null hypothesis
Indicates weak evidence against the null hypothesis, so
you fail to reject the null hypothesis
Considered to be marginal (could go either way)

What are the datatypes supported in Tableau?1 How can outlier values be treated?18

1
8 How can outlier values be treated?
1. You can drop outliers only if it is a garbage value
Example. Height of adult = abc ft. This cannot be true as height
cannot be a string value. In this case, outliers can be removed
2. If the outliers have extreme values, they can be removed
For example, if all the data points are clustered between 0 to 10 but
one point lies at 100, then we can remove this point
Actual Values
PredictedValues

1
8 How can outlier values be treated?
If you cannot drop outliers, you can try the following:
1. Try a different model. Data detected as outliers by linear model can
be fit by non-linear model. So, be sure you are choosing the right
model
2. Try normalizing the data. This way the extreme data points are
pulled to a similar range
3. You can use algorithms which are less affected by outliers,
example random forest
Actual Values
PredictedValues

What are the datatypes supported in Tableau?1 How can you say that a time series data
is stationary?
19

1
9
How can you say that a time series data is stationary?
We can say that a time-series is stationary when the variance and mean of the series is
constant with time
Stationary Non-Stationary Stationary Non-Stationary
Here, mean is
constant with time
Here, mean is
increasing with time
Here, variance is
constant with time
Here, variance is
changing with time

What are the datatypes supported in Tableau?1 How can you calculate accuracy using
confusion matrix?
20

20 How can you calculate accuracy using confusion matrix?
Total=650 actual
p n
predicted
P 262 15
N 26 347
False Positive
True Negative
True Positive
False Negative
Accuracy = (True Positive + True Negative) / Total Observations
= (262+347) / 650
= 609 / 650
= 0.93

What are the datatypes supported in Tableau?1 Write the equation and calculate
precision and recall rate21

21 Write the equation and calculate precision and recall rate
Total=650 actual
p n
predicted
P 262 15
N 26 347
False Positive
True Negative
True Positive
False Negative
Precision = (True Positive) / (True Positive + False Positive)
Recall Rate = (True Positive ) / (Total Positive + False Negative)
Precision = 262/277 = 0.94
Recall = 262/288 = 0.90

measure)
if a drawer contains 12 red socks, 16 blue socks, and 20
white socks, how many must you pull out to be sure of
having a matching pair?
Can you
solve?

measure)
The answer is 4,
An example:
First pick is white
Second pick is red
Third pick blue, so no pairs yet
Fourth pick is 100% guaranteed to be a pair, because
it's either white, blue or red.
So, four picks guarantees a pair.
If it was four colors, the answer would be 5, and so
on.

What are the datatypes supported in Tableau?1 ‘People who bought this, also bought…’
recommendations seen on Amazon is a
result of which algorithm?
22

22
Collaborative Filtering exploits the behavior of other users and their
purchase history in terms of ratings, selection etc.
It makes predictions on what might interest a person based on the
preference of many other users!
In this algorithm, features of the items are not known
Recommendation
engine is done using
Collaborative Filtering
‘People who bought this, also bought…’ recommendations seen on Amazon is a result of
which algorithm?

22
‘People who bought this, also bought…’ recommendations seen on Amazon is a result of
which algorithm?
For example, suppose x number of people buy a new
phone and then also buys a tempered glass with it.
Next time, when a person buys a phone, he will be
recommended to buy a tempered glass along with it.

What are the datatypes supported in Tableau?1 Write a SQL query to list all orders with
customer information
23

23 Write a SQL query to list all orders with customer information
SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country
FROM Order
JOIN Customer
ON Order.CustomerId = Customer.Id
Orderid
CustomerId
OrderNumber
Total Amount
Id
FirstName
LastName
City
Country
Order Table Customer Table

You are given a dataset on cancer detection. You’ve
build a classification model and achieved an accuracy
of 96%. Why shouldn’t you be happy with your model
performance? What can you do about it?
24

24
Cancer detection
results in
IMBALANCED
DATA
You are given a dataset on cancer detection. You’ve build a classification model and achieved an
accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about
it?
In an imbalanced dataset, accuracy should not be used as a measure of performance because it is
important to focus on the remaining 4%, which are the people who were wrongly diagnosed.
Wrong diagnosis is of major concern because there can be people who have cancer but were not predicted so.

24
Cancer detection
results in
IMBALANCED
DATA
In an imbalanced dataset, accuracy should not be used as a measure of performance because it is
important to focus on the remaining 4%, which are the people who were wrongly diagnosed.
Wrong diagnosis is of major concern because there can be people who have cancer but were not predicted so.
Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity
(True Negative Rate), F measure to determine class wise performance of the classifier
You are given a dataset on cancer detection. You’ve build a classification model and achieved an
accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about
it?

What are the datatypes supported in Tableau?1 Which of the following machine learning
algorithm can be used for imputing missing
values of both categorical and continuous
variables?
25

25 Which of the following machine learning algorithm can be used for imputing missing values
of both categorical and continuous variables?
K-means clustering
Linear regression
K-NN
Decision trees

measure)
Given a box of matches and two ropes, not
necessarily identical, measure a period of 45
minutes
Can you
solve?
Note: The ropes are not uniform in nature and the rope takes
exactly 60 minutes to completely burn out

measure)
We have two ropes A and B.
• Light A from both the ends and B from one end.
• When A is finished burning we know that 30 minutes have
elapsed and B has 30 minutes remaining.
• Now, light the other end of B also so that remaining part of
B will burn taking 15 minutes to burn.
• Thus, we have got 30+15 = 45 minutes.

Below are the 8 actual values of target
variable in the train file.
[0,0,0,1,1,1,1,1]
What is the entropy of the target variable?
26

26 What is the entropy of the target variable?
-(5/8 log(5/8) + 3/8 log(3/8))
5/8 log(5/8) + 3/8 log(3/8)
3/8 log(5/8) + 5/8 log(3/8)
5/8 log(3/8) – 3/8 log(5/8)
[0,0,0,1,1,1,1,1]

26 What is the entropy of the target variable?
-(5/8 log(5/8) + 3/8 log(3/8))
5/8 log(5/8) + 3/8 log(3/8)
3/8 log(5/8) + 5/8 log(3/8)
5/8 log(3/8) – 3/8 log(5/8)
[0,0,0,1,1,1,1,1]
Hint:

We want to predict the probability of death from heart
disease based on three
risk factors: age, gender, and blood cholesterol level.
What is the most appropriate algorithm for this use case?
27

27 Choose the right algorithm
Logistic regression
Linear regression
K-means clustering
Apriori algorithm

After studying the behavior of a population, you have
identified four specific individual types who are valuable to
your study. You would like to find all users who are most
similar to each individual type.
Which algorithm is most appropriate for this study?
28

28 Choose the right algorithm
K-means clustering
Linear regression
Association rules
Decision trees

You have run the association rules algorithm on your
dataset, and the two rules
{banana, apple} => {grape} and
{apple, orange}=> {grape}
have been found to be relevant.
What else must be true?
29

29 Choose the right answer
{banana, apple, grape, orange} must be a frequent itemset
{banana, apple} => {orange} must be a relevant rule
{grape} => {banana, apple} must be a relevant rule
{grape, apple} must be a frequent itemset

Your organization has a website where visitors randomly receive one
of two coupons. It is also possible that visitors to the website will not
receive a coupon.
You have been asked to determine if offering a coupon to visitors to
your website has any impact on their purchase decision. Which
analysis method should you use?
30

30 Choose the right analysis method
One-way ANOVA
K-means clustering
Association rules
Student T-test

Data Science Interview Questions | Data Science Interview Questions And Answers | Simplilearn

Data Science Interview Questions | Data Science Interview Questions And Answers | Simplilearn

More Related Content

What's hot (20)

Similar to Data Science Interview Questions | Data Science Interview Questions And Answers | Simplilearn (20)

More from Simplilearn (20)

Recently uploaded (20)

Data Science Interview Questions | Data Science Interview Questions And Answers | Simplilearn

Editor's Notes