SlideShare a Scribd company logo
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop Market
 According to Forrester: growth rate of 13% for
the next 5 years, which is more
than twice w.r.t. predicted general IT growth
 U.S. and International Operations (29%) and
Enterprises (27%) lead the adoption of Big
Data globally
 Asia Pacific to be fastest growing Hadoop
market with a CAGR of 59.2 %
 Companies focusing on improving customer
relationships (55%) and making the business
more data-focused (53%)
2013 2014 2015 2016
Hadoop Market
CAGR of 58.2 %
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Job Trends
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Agenda for Today
Hadoop Interview Questions
 Big Data & Hadoop
 HDFS
 MapReduce
 Apache Hive
 Apache Pig
 Apache HBase and Sqoop
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop Interview Questions
“The harder I practice, the luckier I get.”
Gary Player
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop
Q. What are the five V’s associated with Big Data?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop
Q. What are the five V’s associated with Big Data?
Big
Data
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop
Q. Differentiate between structured, semi-structured and unstructured data?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop
 Structured  Semi - Structured  Unstructured
 Organized data format
 Data schema is fixed
 Example:
RDBMS data, etc.
 Partial organized data
 Lacks formal structure
of a data model
 Example:
XML & JSON files, etc.
 Un-organized data
 Unknown schema
 Example:
multi - media files, etc.
Q. Differentiate between structured, semi-structured and unstructured data?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop
Q. How Hadoop differs from Traditional Processing System using RDBMS?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop
Q. How Hadoop differs from Traditional Processing System using RDBMS?
RDBMS Hadoop
RDBMS relies on the structured data and the schema of
the data is always known.
Any kind of data can be stored into Hadoop i.e. Be it
structured, unstructured or semi-structured.
RDBMS provides limited or no processing capabilities. Hadoop allows us to process the data in distributed
parallel fashion.
RDBMS is based on ‘schema on write’ where schema
validation is done before loading the data.
On the contrary, Hadoop follows the schema on read
policy.
In RDBMS, reads are fast because the schema of the data
is already known.
The writes are fast in HDFS because no schema validation
happens during HDFS write.
Suitable for OLTP (Online Transaction Processing) Suitable for OLAP (Online Analytical Processing)
Licensed software Hadoop is an open source framework.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop
Q. Explain the components of Hadoop and their services.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop
Q. Explain the components of Hadoop and their services.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop
Q. What are the main Hadoop configuration files?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop
Q. What are the main Hadoop configuration files?
hadoop-env.sh core-site.xml
hdfs-site.xml yarn-site.xml
mapred-site.xml masters
slaves
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS Interview Questions
“A person who never made a mistake never tried
anything new.”
Albert Einstein
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS
ensures the fault tolerance capability of the system?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS
ensures the fault tolerance capability of the system?
 HDFS replicates the blocks and
stores on different DataNodes
 Default Replication Factor is set
to 3
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. What is the problem in having lots of small files in HDFS? Provide one method to overcome this
problem.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. What is the problem in having lots of small files in HDFS? Provide one method to overcome this
problem.
> hadoop archive –archiveName edureka_archive.har /input/location /output/location
Problem:
 Too Many Small Files = Too Many Blocks
 Too Many Blocks == Too Many Metadata
 Managing this huge number of metadata is
difficult
 Increase in cost of seek
Solution:
 Hadoop Archive
 It clubs small HDFS files into a single archive
HDFS Files
(small)
.HAR file
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size
configuration and default replication factor. Then, how many blocks will be created in total and what
will be the size of each block?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size
configuration and default replication factor. Then, how many blocks will be created in total and what
will be the size of each block?
 Default Block Size = 128 MB
 514 MB / 128 MB = 4.05 == 5 Blocks
 Replication Factor = 3
 Total Blocks = 5 * 3 = 15
 Total size = 514 * 3 = 1542 MB
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. How to copy a file into HDFS with a different block size to that of existing block size configuration?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. How to copy a file into HDFS with a different block size to that of existing block size configuration?
 Block size: 32 MB = 33554432 Bytes ( Default block size: 128 MB)
 Command:
hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /local/test.txt /sample_hdfs
 Check the block size of test.txt
hadoop fs -stat %o /sample_hdfs/test.txt
HDFS
Files
(existing)
128
MB
128
MB
test.txt
(local)
-Ddfs.blocksize=33554432
test.txt
(HDFS)
32
MB
32
MB
move to HDFS: /sample_hdfs
HDFS HDFS
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. What is a block scanner in HDFS?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. What is a block scanner in HDFS?
 Block scanner maintains integrity of the data blocks
 It runs periodically on every DataNode to verify whether
the data blocks stored are correct or not
Steps:
1. DataNode reports to NameNode
2. NameNode schedules the creation of new
replicas using the good replicas
3. Once replication factor (uncorrupted replicas)
reaches to the required level, deletion of
corrupted blocks takes place
Note: This question is generally asked for the position
Hadoop Admin
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. Can multiple clients write into an HDFS file concurrently?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. Can multiple clients write into an HDFS file concurrently?
 HDFS follows Single Writer Multiple Reader Model
 The client which opens a file for writing is granted a lease
by the NameNode
 NameNode rejects write request of other clients for the
file which is currently being written by someone else
HDFS
ReadWrite
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. What do you mean by the High Availability of a NameNode? How is it achieved?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS
Q. What do you mean by the High Availability of a NameNode? How is it achieved?
 NameNode used to be Single Point of Failure in
Hadoop 1.x
 High Availability refers to the condition where a
NameNode must remain active throughout the cluster
 HDFS HA Architecture in Hadoop 2.x allows us to
have two NameNode in an Active/Passive
configuration.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce Interview Questions
“Never tell me the sky’s the limit when there are
footprints on the moon.”
–Author Unknown
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. Explain the process of spilling in MapReduce?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. Explain the process of spilling in MapReduce?
Local Disc
 The output of a map task is written into a circular
memory buffer (RAM).
 Default Buffer size is set to 100 MB as specified in
mapreduce.task.io.sort.mb
 Spilling is a process of copying the data from memory
buffer to disc after a certain threshold is reached
 Default spilling threshold is 0.8 as specified in
mapreduce.map.sort.spill.percent
20 %
50 %80%80%
Spill data
Node Manager
RAM
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. What is the difference between blocks, input splits and records?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. What is the difference between blocks, input splits and records?
Blocks
Input Splits
Records
Physical Division
Logical Division
 Blocks: Data in HDFS is physically
stored as blocks
 Input Splits: Logical chunks of data to
be processed by an individual mapper
 Records: Each input split is comprised
of records e.g. in a text file each line is
a record
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. What is the role of RecordReader in Hadoop MapReduce?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. What is the role of RecordReader in Hadoop MapReduce?
 RecordReader converts the data present in a file into (key, value) pairs suitable for reading by the
Mapper task
 The RecordReader instance is defined by the Input Format
1 David
2 Cassie
3 Remo
4 Ramesh
…
RecordReader
Key Value
0 1 David
57 2 Cassie
122 3 Remo
171 4 Ramesh
…
Mapper
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. What is the significance of counters in MapReduce?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
1 David
2%^&%d
3 Jeff
4 Shawn
5$*&!#$
MapReduce
Q. What is the significance of counters in MapReduce?
 Used for gathering statistics about the job:
 for quality control
 for application-level statistics
 Easier to retrieve counters as compared to log messages for large distributed job
 For example: Counting the number of invalid records, etc.
MapReduce Output
Counter: 02
+1
1
invalid records
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. Why the output of map tasks are stored ( spilled ) into local disc and not in HDFS?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. Why the output of map tasks are stored ( spilled ) into local disc and not in HDFS?
 The outputs of map task are the intermediate key-value
pairs which is then processed by reducer
 Intermediate output is not required after completion of
job
 Storing these intermediate output into HDFS and
replicating it will create unnecessary overhead.
Local Disc
Mapper Reducer
NodeManager
HDFS
output
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. Define Speculative Execution
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. Define Speculative Execution
 If a task is detected to be running slower, an equivalent
task is launched so as to maintain the critical path of the
job
 Scheduler tracks the progress of all the tasks (map and
reduce) and launches speculative duplicates for slower
tasks
 After completion of a task, all running duplicates task are
killed
MRTask
(slow)
Node Manager
MRTask
(duplicate)
Node Manager
Scheduler
slow
task
progress
launch
speculative
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. How will you prevent a file from splitting in case you want the whole file to be processed by the
same mapper?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. How will you prevent a file from splitting in case you want the whole file to be processed by the
same mapper?
Method 1: Increase the minimum split size to be larger than the largest file inside the driver section
i. conf.set ("mapred.min.split.size", “size_larger_than_file_size");
ii. Input Split Computation Formula - max ( minimumSize, min ( maximumSize, blockSize ) )
public class NonSplittableTextInputFormat extends TextInputFormat {
@Override
protected boolean isSplitable (JobContext context, Path file) {
return false;
}
}
Method 2: Modify the InputFormat class that you want to use:
i. Subclass the concrete subclass of FileInputFormat and override the isSplitable() method to return
false as shown below:
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. Is it legal to set the number of reducer task to zero? Where the output will be stored in this case?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. Is it legal to set the number of reducer task to zero? Where the output will be stored in this case?
 Legal to set the number of reducer task to zero
 It is done when there is no need for a reducer like in the
cases where inputs needs to be transformed into a
particular format, map side join etc.
 Map outputs is directly stored into the HDFS as specified
by the client
HDFS
(Input)
Map Reduce
HDFS
(Output)
HDFS
(Input)
Map Reduce
HDFS
(Output)
Reducer set to zero
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. What is the role of Application Master in a MapReduce Job?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. What is the role of Application Master in a MapReduce Job?
 Acts as a helper process for ResourceManager
 Initializes the job and track of the job’s progress
 Retrieves the input splits computed by the client
 Negotiates the resources needed for running a job with
the ResourceManager
 Creates a map task object for each split
Client RM NM AM
submit job
launch AM
ask for resources
run task
status
unregister
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. What do you mean by MapReduce task running in uber mode?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. What do you mean by MapReduce task running in uber mode?
 If a job is small, ApplicationMaster chooses to run the tasks in its own JVM and are called
uber task
 It reduces the overhead of allocating new containers for running the tasks
 A MapReduce job is decided as uber task if:
 It requires less than 10 mappers
 It requires only one reducer
 The input size is less than the HDFS block size
 Parameters to be set for deciding uber task:
 mapreduce.job.ubertask.maxmaps
 mapreduce.job.ubertask.maxreduces
 mapreduce.job.ubertask.maxbytes
 To enable uber task: mapreduce.job.ubertask.enable to true.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Client Node
JVM
MR
Code
MR
Job
run job
Node Manager
RM Node
Node Manager
MR
Task
(uber)
AppMaster JVM
ResourceManager
JVM
HDFS
1. Submit Job
2. Launch
AppMaster
3. output
Copy job
resources
Criteria:
 It requires less than 10 mappers
 It requires only one reducer
 The input size is less than the HDFS
block size
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. How will you enhance the performance of MapReduce job when dealing with
too many small files?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce
Q. How will you enhance the performance of MapReduce job when dealing with
too many small files?
 CombineFileInputFormat can be used to solve this
problem
 CombineFileInputFormat packs all the small files
into input splits where each split is processed by a
single mapper
 Takes node and rack locality into account when
deciding which blocks to place in the same split
 Can process the input files efficiently in a typical
MapReduce job
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive Interview Questions
“Generally, the question that seems to be
complicated have simple answers.”
– Anonymous
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Q. Where does the data of a Hive table gets stored?
Q. Why HDFS is not used by the Hive metastore for storage?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Q. Where does the data of a Hive table gets stored?
 By default, the Hive table is stored in an HDFS directory: /user/hive/warehouse
 It is specified in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml
Q. Why HDFS is not used by the Hive metastore for storage?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Q. Where does the data of a Hive table gets stored?
 By default, the Hive table is stored in an HDFS directory: /user/hive/warehouse
 It is specified in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml
Q. Why HDFS is not used by the Hive metastore for storage?
 Editing files or data present in HDFS is not allowed.
 Metastore stores metadata using RDBMS to provide low query latency
 HDFS read/write operations are time consuming processes
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Scenario:
Suppose, I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration.
Then, what will happen if we have multiple clients trying to access Hive at the same time?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Scenario:
Suppose, I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration.
Then, what will happen if we have multiple clients trying to access Hive at the same time?
 Multiple client access is not allowed in default metastore configuration or embedded mode
 One may use following two metastore configurations:
1. Local Metastore Configuration 2. Remote Metastore Configuration
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Q. What is the difference between external table and managed table?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Q. What is the difference between external table and managed table?
Managed Table:
 Hive responsible for managing the table
data
 While dropping the table, Metadata
information along with the table data is
deleted from the Hive warehouse
External Table:
 Hive is responsible for managing only table
metadata not the table data
 While dropping the table, Hive just deletes the
metadata information leaving the table data
untouched
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Q. When should we use SORT BY instead of ORDER BY ?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Q. When should we use SORT BY instead of ORDER BY ?
 SORT BY clause sorts the data using multiple reducers
Reducer OutputDataset
Reducer 1
Reducer 2
Reducer n
Output
 ORDER BY sorts all of the data together using a single
reducer
SORT BY should be used to sort huge datasets
Dataset
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Q. What is the difference between partition and bucket in Hive?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Q. What is the difference between partition and bucket in Hive?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Hive
Scenario:
CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
Now, after inserting 50,000 tuples in this table, I want to know the total revenue generated for the month -
January. But, Hive is taking too much time in processing this query. How will you solve this problem?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
 Create a partitioned table:
 CREATE TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING) PARTITIONED BY (month
STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
 Enable dynamic partitioning in Hive:
 SET hive.exec.dynamic.partition = true;
 SET hive.exec.dynamic.partition.mode = nonstrict;
 Transfer the data :
 INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id, amount, country,
month FROM transaction_details;
 Run the query :
 SELECT SUM(amount) FROM partitioned_transaction WHERE month= ‘January’;
Apache Hive
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Q. What is dynamic partitioning and when is it used?
Apache Hive
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Q. What is dynamic partitioning and when is it used?
 Values for partition columns are known during runtime
 One may use dynamic partition in following cases:
 Loading data from an existing non-partitioned table to improve the sampling (query latency)
 Values of the partitions are not known before hand and therefore, finding these unknown
partition values manually from huge data sets is a tedious task
Apache Hive
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Q. How Hive distributes the rows into buckets?
Apache Hive
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Q. How Hive distributes the rows into buckets?
 Bucket number is determined for a row by using the formula:
hash_function (bucketing_column) modulo (num_of_buckets)
 hash_function depends on the column data type i.e. for int type it is equal to value of column
 hash_function for other data types is complex to calculate
Id Name
1 John
2 Mike
3 Shawn
2, Mike
1, John
3, Shawn
Bucket 1
Bucket 2
 hash_function (1) = 1
 hash_function (2) = 2
 hash_function (3) = 3
hash_function (id) = id
 1 mod 2 = 1
 2 mod 2 = 0
 3 mod 2 = 1
id mod 2 = bucket num
Apache Hive
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Scenario:
Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the following
entries:
id first_name last_name e-mail gender ip
1 Hugh Jackman hugh32@sun.co Male 136.90.241.52
2 David Lawrence dlawrence@gmail.co Male 101.177.15.130
3 Andy Hall anyhall@yahoo.co Female 114.123.153.64
4 Samuel Jackson samjackson@rediff.co Male 91.121.145.67
5 Emily Rose rosemily@edureka.co Female 117.123.108.98
How will you consume this CSV file into the Hive warehouse using built-in SerDe?
Apache Hive
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
 A SerDe allows us to convert the unstructured bytes into a record that we can process using Hive.
 CREATE EXTERNAL TABLE sample (id INT, first_name STRING, last_name STRING, email STRING, gender
STRING, ip_address STRING)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
STORED AS TEXTFILE LOCATION ‘/temp’;
 SELECT first_name FROM sample WHERE gender = ‘male’;
Note:
 Hive provides several built – in
SerDe like for JSON, TSV etc.
 Useful in cases where you
have embedded commas in
delimited fields
Apache Hive
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Scenario:
 I have a lot of small CSV files present in /input directory in HDFS and I want to create a single Hive
table corresponding to these files.
 The data in these files are in the format: {id, name, e-mail, country}
Now, as we know, Hadoop performance degrades when we use lots of small files. So, how will you
solve this problem?
Apache Hive
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
 Create a temporary table:
CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS TEXTFILE;
 Load the data from the input directory into temp_table:
LOAD DATA INPATH ‘/input’ INTO TABLE temp_table;
 Create a table that will store data in SequenceFile format:
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS SEQUENCEFILE;
 Transfer the data from the temporary table into the sample_seqfile table:
INSERT OVERWRITE TABLE sample_seqfile SELECT * FROM temp_table;
Apache Hive
 When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used
for a given record
 Sequence files are flat files consisting of binary key-value pairs
 Using sequence file, one can club two or more smaller files to make them one single sequence file
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Pig Interview Questions
“Whenever you are asked if you can do a job, tell
them, 'Certainly I can!' , Then get busy and find out
how to do it.”
–Theodore Roosevelt
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Pig
Q. What is the difference between logical and physical plans?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Pig
Q. What is the difference between logical and physical plans?
Logical Plan:
 Created for each line in pig script if no syntax error is
found by interpreter
 No data processing happens during creation of logical
plan
Physical Plan:
 Physical plan is basically a series of map reduce jobs
 Describes the physical operators to execute the script,
without reference to how they will be executed in
MapReduce
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Pig
Q. What is a bag in Pig Latin?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Pig
Q. What is a bag in Pig Latin?
 Unordered collection of tuples
 Duplicate tuples are allowed
 Tuples with differing numbers of fields is allowed
 For example:
{ (Linkin Park, 7, California),
(Metallica, 8),
(Mega Death, Los Angeles) }
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Pig
Q. How Apache Pig handles unstructured data which is difficult in case of Apache
Hive?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Pig
Q. How Apache Pig handles unstructured data which is difficult in case of Apache
Hive?
No Datatype
{a, b, c}
$2
(positional notation)
c
(3rd field)
missing schema JOIN, COGROUP, etc NULL (schema)
schema is NULL Byte Array (default) Data type definition
(runtime)
Byte Array (default)
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Pig
Q. What are the different execution modes available in Pig?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Pig
Q. What are the different execution modes available in Pig?
MapReduce Mode:
 Default mode
 Requires access to a Hadoop
cluster
 Input and output data are present
on HDFS
Local Mode:
 Requires access to a single machine
 ‘-x ’ flag is used to specify the local
mode environment (pig -x local)
 Input and output data are present on
local file system
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Pig
Q. What does Flatten do in Pig?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Apache Pig
Q. What does Flatten do in Pig?
 Flatten un-nests bags and tuples.
 For tuples, the Flatten operator will substitute the fields of a tuple in place of the tuple
 For example:
 Un-nesting bags is a little complex as it requires creating new tuples
(a, (b, c))
GENERATE $0, flatten($1)
(a, b, c)
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HBase & Sqoop Interview Questions
“Take risks: if you win, you will be happy; if you
lose, you will be wise.”
–Anonymous
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HBase
Q. What are the key components of HBase?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HBase
Q. What are the key components of HBase?
 HMaster manages the Region
Servers
 Region Server manages a group of
regions
 Zooeeper acts as a coordinator
inside HBase environment
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HBase
Q. How do we back up a HBase cluster?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HBase
Q. How do we back up a HBase cluster?
1. Full Shutdown Backup
 Useful for cases where HBase cluster shutdown is
possible
 Steps:
• Stop HBase: Stop the HBase services first
• Distcp: Copy the contents of the HBase directory
into another HDFS directory in different or same
cluster
2. Live Cluster Backup
 Useful for live cluster that cannot afford downtime
 Steps:
• CopyTable: Copy data from one table to
another on the same or different cluster
• Export: Dumps the content of a table into
HDFS on the same cluster
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HBase
Q. What is a Bloom filter and how does it help in searching rows?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HBase
Q. What is a Bloom filter and how does it help in searching rows?
 Used to improve the overall throughput of the cluster
 Space efficient mechanism to test whether a HFile contains a specific row or row-col cell
 Saves the time in scanning non - relevant blocks for a given row key
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HBase
Q. What is the role of JDBC driver in a Sqoop set up?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Sqoop
Q. What is the role of JDBC driver in a Sqoop set up?
 To connect to different relational databases Sqoop needs a connector
 Almost every DB vendor makes this connecter available as a JDBC driver which is specific to
that DB
 Sqoop needs the JDBC driver of each of the database that it needs to interact with
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Q. When to use --target-dir and when to use --warehouse-dir while importing data?
Sqoop
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Q. When to use --target-dir and when to use --warehouse-dir while importing data?
 --target-dir is used for specifying a particular directory in HDFS
 --warehouse-dir is used for specifying the parent directory of all the Sqoop jobs
 In the later case, Sqoop will create directory with the same name as that of table under the
parent directory
Sqoop
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Q. What does the following query do:
$ sqoop import --connect jdbc:mysql://host/dbname --table EMPLOYEES  --where
"start_date” > '2012-11-09‘
Sqoop
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
It imports the employees who have joined after 9-Nov-2012
Sqoop
Q. What does the following query do:
$ sqoop import --connect jdbc:mysql://host/dbname --table EMPLOYEES  --where
"start_date” > '2012-11-09‘
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Scenario:
In a Sqoop import command you have mentioned to run 8 parallel MapReduce tasks but
Sqoop runs only 4
What can be the reason?
Sqoop
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Scenario:
In a Sqoop import command you have mentioned to run 8 parallel MapReduce tasks but
Sqoop runs only 4
What can be the reason?
In this case, the MapReduce cluster is configured to run 4 parallel tasks. Therefore, the
Sqoop command must have the number of parallel tasks less or equal to that of the
MapReduce cluster
Sqoop
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Q. Give a Sqoop command to show all the databases in a MySQL server.
Sqoop
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Q. Give a Sqoop command to show all the databases in a MySQL server.
 Issue the command given below:
$ sqoop list-databases --connect jdbc:mysql://database.example.com/
Sqoop
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Learning Resources
 Top 50 Hadoop Interview Questions:
www.edureka.co/blog/interview-questions/top-50-hadoop-interview-questions-2016
 HDFS Interview Questions:
www.edureka.co/blog/interview-questions/hadoop-interview-questions-hdfs-2
 MapReduce Interview Questions:
www.edureka.co/blog/interview-questions/hadoop-interview-questions-mapreduce
 Apache Hive Interview Questions:
www.edureka.co/blog/interview-questions/hive-interview-questions
 Apache Pig Interview Questions:
www.edureka.co/blog/interview-questions/hadoop-interview-questions-pig
 Apache HBase Interview Questions:
www.edureka.co/blog/interview-questions/hbase-interview-questions
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Thank You…
Questions/Queries/Feedback

More Related Content

PDF
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
PDF
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
PDF
Hadoop Developer
PDF
Understanding Big Data And Hadoop
PPTX
Introduction to Big Data and Hadoop
PPTX
Learn Hadoop
PDF
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
PPTX
Hadoop Adminstration with Latest Release (2.0)
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Developer
Understanding Big Data And Hadoop
Introduction to Big Data and Hadoop
Learn Hadoop
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Adminstration with Latest Release (2.0)

What's hot (20)

PDF
Introduction to Big Data and Hadoop
PDF
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
PDF
Hadoop Architecture and HDFS
PDF
Introduction to Big data & Hadoop -I
PPTX
Big Data and Hadoop Introduction
PPTX
Big Data & Hadoop Tutorial
PDF
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
ODT
Hadoop Interview Questions and Answers by rohit kapa
PDF
Introduction to Big Data & Hadoop
PDF
Hadoop Career Path and Interview Preparation
PPTX
Hadoop and Big Data
PDF
Hadoop MapReduce Framework
PPTX
Hadoop for Data Warehousing professionals
PDF
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
PDF
Webinar: Big Data & Hadoop - When not to use Hadoop
PPTX
Introduction to Hadoop Administration
PDF
Introduction to Hadoop
PDF
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
A day in the life of hadoop administrator!
Introduction to Big Data and Hadoop
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Hadoop Architecture and HDFS
Introduction to Big data & Hadoop -I
Big Data and Hadoop Introduction
Big Data & Hadoop Tutorial
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Hadoop Interview Questions and Answers by rohit kapa
Introduction to Big Data & Hadoop
Hadoop Career Path and Interview Preparation
Hadoop and Big Data
Hadoop MapReduce Framework
Hadoop for Data Warehousing professionals
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Webinar: Big Data & Hadoop - When not to use Hadoop
Introduction to Hadoop Administration
Introduction to Hadoop
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
A day in the life of hadoop administrator!
Ad

Viewers also liked (20)

PDF
Data Scientist/Engineer Job Demand Analysis
PDF
Energy to 2050
DOCX
Seo executive perfomance appraisal 2
DOCX
Principal engineer perfomance appraisal 2
DOCX
Purchasing executive perfomance appraisal 2
PDF
Trivadis TechEvent 2017 Secrets of creation of reliable + maintainable (=cost...
DOCX
Production executive perfomance appraisal 2
PDF
MA2017 | Hazmin Rahim | Future Cities and Startup Collaboration
PDF
Trivadis TechEvent 2017 With the CLI through the Oracle Cloud Martin Berger
PDF
Hadoop 31-frequently-asked-interview-questions
PDF
Trivadis TechEvent 2017 Data Science in the Silicon Valley by Stefano Brunelli
PDF
Leveraging Service Computing and Big Data Analytics for E-Commerce
PPTX
Top 10 database engineer interview questions and answers
PPTX
Productive data engineer speaker notes
PPTX
Top 10 data engineer interview questions and answers
DOCX
Logistic executive perfomance appraisal 2
PPTX
MA2017 | Danny Nou | The Science of Empathy
DOCX
Data engineer perfomance appraisal 2
PPTX
2017 Florida Data Science for Social Good Big Reveal
DOCX
Computer software engineer performance appraisal
Data Scientist/Engineer Job Demand Analysis
Energy to 2050
Seo executive perfomance appraisal 2
Principal engineer perfomance appraisal 2
Purchasing executive perfomance appraisal 2
Trivadis TechEvent 2017 Secrets of creation of reliable + maintainable (=cost...
Production executive perfomance appraisal 2
MA2017 | Hazmin Rahim | Future Cities and Startup Collaboration
Trivadis TechEvent 2017 With the CLI through the Oracle Cloud Martin Berger
Hadoop 31-frequently-asked-interview-questions
Trivadis TechEvent 2017 Data Science in the Silicon Valley by Stefano Brunelli
Leveraging Service Computing and Big Data Analytics for E-Commerce
Top 10 database engineer interview questions and answers
Productive data engineer speaker notes
Top 10 data engineer interview questions and answers
Logistic executive perfomance appraisal 2
MA2017 | Danny Nou | The Science of Empathy
Data engineer perfomance appraisal 2
2017 Florida Data Science for Social Good Big Reveal
Computer software engineer performance appraisal
Ad

Similar to Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoop Tutorial | Edureka (20)

DOCX
Hadoop admin training
PDF
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
DOCX
PDF
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
PPT
Hadoop in action
PDF
Hadoop training kit from lcc infotech
PPTX
Hadoop Training in Delhi
PPT
Hadoop training by keylabs
PDF
Hadoop and Mapreduce Certification
PDF
field_guide_to_hadoop_pentaho
PPTX
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PPTX
Big Data Training in Amritsar
PPTX
Hybrid Data Warehouse Hadoop Implementations
PPTX
Big Data Training in Mohali
PPT
Hadoop presentation
PPTX
Big data overview
PPTX
Big Data Training in Ludhiana
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPTX
Top Hadoop Big Data Interview Questions and Answers for Fresher
Hadoop admin training
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Hadoop in action
Hadoop training kit from lcc infotech
Hadoop Training in Delhi
Hadoop training by keylabs
Hadoop and Mapreduce Certification
field_guide_to_hadoop_pentaho
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Big Data Training in Amritsar
Hybrid Data Warehouse Hadoop Implementations
Big Data Training in Mohali
Hadoop presentation
Big data overview
Big Data Training in Ludhiana
Hadoop introduction , Why and What is Hadoop ?
Top Hadoop Big Data Interview Questions and Answers for Fresher

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Approach and Philosophy of On baking technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced IT Governance
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Modernizing your data center with Dell and AMD
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Mobile App Security Testing_ A Comprehensive Guide.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Monthly Chronicles - July 2025
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Machine learning based COVID-19 study performance prediction
Advanced IT Governance
NewMind AI Weekly Chronicles - August'25 Week I

Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoop Tutorial | Edureka

  • 2. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Market  According to Forrester: growth rate of 13% for the next 5 years, which is more than twice w.r.t. predicted general IT growth  U.S. and International Operations (29%) and Enterprises (27%) lead the adoption of Big Data globally  Asia Pacific to be fastest growing Hadoop market with a CAGR of 59.2 %  Companies focusing on improving customer relationships (55%) and making the business more data-focused (53%) 2013 2014 2015 2016 Hadoop Market CAGR of 58.2 %
  • 4. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Agenda for Today Hadoop Interview Questions  Big Data & Hadoop  HDFS  MapReduce  Apache Hive  Apache Pig  Apache HBase and Sqoop
  • 5. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Interview Questions “The harder I practice, the luckier I get.” Gary Player
  • 6. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. What are the five V’s associated with Big Data?
  • 7. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. What are the five V’s associated with Big Data? Big Data
  • 8. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. Differentiate between structured, semi-structured and unstructured data?
  • 9. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop  Structured  Semi - Structured  Unstructured  Organized data format  Data schema is fixed  Example: RDBMS data, etc.  Partial organized data  Lacks formal structure of a data model  Example: XML & JSON files, etc.  Un-organized data  Unknown schema  Example: multi - media files, etc. Q. Differentiate between structured, semi-structured and unstructured data?
  • 10. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. How Hadoop differs from Traditional Processing System using RDBMS?
  • 11. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. How Hadoop differs from Traditional Processing System using RDBMS? RDBMS Hadoop RDBMS relies on the structured data and the schema of the data is always known. Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured. RDBMS provides limited or no processing capabilities. Hadoop allows us to process the data in distributed parallel fashion. RDBMS is based on ‘schema on write’ where schema validation is done before loading the data. On the contrary, Hadoop follows the schema on read policy. In RDBMS, reads are fast because the schema of the data is already known. The writes are fast in HDFS because no schema validation happens during HDFS write. Suitable for OLTP (Online Transaction Processing) Suitable for OLAP (Online Analytical Processing) Licensed software Hadoop is an open source framework.
  • 12. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. Explain the components of Hadoop and their services.
  • 13. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. Explain the components of Hadoop and their services.
  • 14. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. What are the main Hadoop configuration files?
  • 15. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data & Hadoop Q. What are the main Hadoop configuration files? hadoop-env.sh core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml masters slaves
  • 16. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Interview Questions “A person who never made a mistake never tried anything new.” Albert Einstein
  • 17. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS ensures the fault tolerance capability of the system?
  • 18. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS ensures the fault tolerance capability of the system?  HDFS replicates the blocks and stores on different DataNodes  Default Replication Factor is set to 3
  • 19. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What is the problem in having lots of small files in HDFS? Provide one method to overcome this problem.
  • 20. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What is the problem in having lots of small files in HDFS? Provide one method to overcome this problem. > hadoop archive –archiveName edureka_archive.har /input/location /output/location Problem:  Too Many Small Files = Too Many Blocks  Too Many Blocks == Too Many Metadata  Managing this huge number of metadata is difficult  Increase in cost of seek Solution:  Hadoop Archive  It clubs small HDFS files into a single archive HDFS Files (small) .HAR file
  • 21. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block?
  • 22. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block?  Default Block Size = 128 MB  514 MB / 128 MB = 4.05 == 5 Blocks  Replication Factor = 3  Total Blocks = 5 * 3 = 15  Total size = 514 * 3 = 1542 MB
  • 23. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. How to copy a file into HDFS with a different block size to that of existing block size configuration?
  • 24. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. How to copy a file into HDFS with a different block size to that of existing block size configuration?  Block size: 32 MB = 33554432 Bytes ( Default block size: 128 MB)  Command: hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /local/test.txt /sample_hdfs  Check the block size of test.txt hadoop fs -stat %o /sample_hdfs/test.txt HDFS Files (existing) 128 MB 128 MB test.txt (local) -Ddfs.blocksize=33554432 test.txt (HDFS) 32 MB 32 MB move to HDFS: /sample_hdfs HDFS HDFS
  • 25. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What is a block scanner in HDFS?
  • 26. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What is a block scanner in HDFS?  Block scanner maintains integrity of the data blocks  It runs periodically on every DataNode to verify whether the data blocks stored are correct or not Steps: 1. DataNode reports to NameNode 2. NameNode schedules the creation of new replicas using the good replicas 3. Once replication factor (uncorrupted replicas) reaches to the required level, deletion of corrupted blocks takes place Note: This question is generally asked for the position Hadoop Admin
  • 27. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. Can multiple clients write into an HDFS file concurrently?
  • 28. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. Can multiple clients write into an HDFS file concurrently?  HDFS follows Single Writer Multiple Reader Model  The client which opens a file for writing is granted a lease by the NameNode  NameNode rejects write request of other clients for the file which is currently being written by someone else HDFS ReadWrite
  • 29. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What do you mean by the High Availability of a NameNode? How is it achieved?
  • 30. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Q. What do you mean by the High Availability of a NameNode? How is it achieved?  NameNode used to be Single Point of Failure in Hadoop 1.x  High Availability refers to the condition where a NameNode must remain active throughout the cluster  HDFS HA Architecture in Hadoop 2.x allows us to have two NameNode in an Active/Passive configuration.
  • 31. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Interview Questions “Never tell me the sky’s the limit when there are footprints on the moon.” –Author Unknown
  • 32. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Explain the process of spilling in MapReduce?
  • 33. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Explain the process of spilling in MapReduce? Local Disc  The output of a map task is written into a circular memory buffer (RAM).  Default Buffer size is set to 100 MB as specified in mapreduce.task.io.sort.mb  Spilling is a process of copying the data from memory buffer to disc after a certain threshold is reached  Default spilling threshold is 0.8 as specified in mapreduce.map.sort.spill.percent 20 % 50 %80%80% Spill data Node Manager RAM
  • 34. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the difference between blocks, input splits and records?
  • 35. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the difference between blocks, input splits and records? Blocks Input Splits Records Physical Division Logical Division  Blocks: Data in HDFS is physically stored as blocks  Input Splits: Logical chunks of data to be processed by an individual mapper  Records: Each input split is comprised of records e.g. in a text file each line is a record
  • 36. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the role of RecordReader in Hadoop MapReduce?
  • 37. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the role of RecordReader in Hadoop MapReduce?  RecordReader converts the data present in a file into (key, value) pairs suitable for reading by the Mapper task  The RecordReader instance is defined by the Input Format 1 David 2 Cassie 3 Remo 4 Ramesh … RecordReader Key Value 0 1 David 57 2 Cassie 122 3 Remo 171 4 Ramesh … Mapper
  • 38. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the significance of counters in MapReduce?
  • 39. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING 1 David 2%^&%d 3 Jeff 4 Shawn 5$*&!#$ MapReduce Q. What is the significance of counters in MapReduce?  Used for gathering statistics about the job:  for quality control  for application-level statistics  Easier to retrieve counters as compared to log messages for large distributed job  For example: Counting the number of invalid records, etc. MapReduce Output Counter: 02 +1 1 invalid records
  • 40. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Why the output of map tasks are stored ( spilled ) into local disc and not in HDFS?
  • 41. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Why the output of map tasks are stored ( spilled ) into local disc and not in HDFS?  The outputs of map task are the intermediate key-value pairs which is then processed by reducer  Intermediate output is not required after completion of job  Storing these intermediate output into HDFS and replicating it will create unnecessary overhead. Local Disc Mapper Reducer NodeManager HDFS output
  • 42. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Define Speculative Execution
  • 43. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Define Speculative Execution  If a task is detected to be running slower, an equivalent task is launched so as to maintain the critical path of the job  Scheduler tracks the progress of all the tasks (map and reduce) and launches speculative duplicates for slower tasks  After completion of a task, all running duplicates task are killed MRTask (slow) Node Manager MRTask (duplicate) Node Manager Scheduler slow task progress launch speculative
  • 44. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. How will you prevent a file from splitting in case you want the whole file to be processed by the same mapper?
  • 45. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. How will you prevent a file from splitting in case you want the whole file to be processed by the same mapper? Method 1: Increase the minimum split size to be larger than the largest file inside the driver section i. conf.set ("mapred.min.split.size", “size_larger_than_file_size"); ii. Input Split Computation Formula - max ( minimumSize, min ( maximumSize, blockSize ) ) public class NonSplittableTextInputFormat extends TextInputFormat { @Override protected boolean isSplitable (JobContext context, Path file) { return false; } } Method 2: Modify the InputFormat class that you want to use: i. Subclass the concrete subclass of FileInputFormat and override the isSplitable() method to return false as shown below:
  • 46. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Is it legal to set the number of reducer task to zero? Where the output will be stored in this case?
  • 47. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. Is it legal to set the number of reducer task to zero? Where the output will be stored in this case?  Legal to set the number of reducer task to zero  It is done when there is no need for a reducer like in the cases where inputs needs to be transformed into a particular format, map side join etc.  Map outputs is directly stored into the HDFS as specified by the client HDFS (Input) Map Reduce HDFS (Output) HDFS (Input) Map Reduce HDFS (Output) Reducer set to zero
  • 48. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the role of Application Master in a MapReduce Job?
  • 49. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What is the role of Application Master in a MapReduce Job?  Acts as a helper process for ResourceManager  Initializes the job and track of the job’s progress  Retrieves the input splits computed by the client  Negotiates the resources needed for running a job with the ResourceManager  Creates a map task object for each split Client RM NM AM submit job launch AM ask for resources run task status unregister
  • 50. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What do you mean by MapReduce task running in uber mode?
  • 51. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. What do you mean by MapReduce task running in uber mode?  If a job is small, ApplicationMaster chooses to run the tasks in its own JVM and are called uber task  It reduces the overhead of allocating new containers for running the tasks  A MapReduce job is decided as uber task if:  It requires less than 10 mappers  It requires only one reducer  The input size is less than the HDFS block size  Parameters to be set for deciding uber task:  mapreduce.job.ubertask.maxmaps  mapreduce.job.ubertask.maxreduces  mapreduce.job.ubertask.maxbytes  To enable uber task: mapreduce.job.ubertask.enable to true.
  • 52. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Client Node JVM MR Code MR Job run job Node Manager RM Node Node Manager MR Task (uber) AppMaster JVM ResourceManager JVM HDFS 1. Submit Job 2. Launch AppMaster 3. output Copy job resources Criteria:  It requires less than 10 mappers  It requires only one reducer  The input size is less than the HDFS block size
  • 53. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. How will you enhance the performance of MapReduce job when dealing with too many small files?
  • 54. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Q. How will you enhance the performance of MapReduce job when dealing with too many small files?  CombineFileInputFormat can be used to solve this problem  CombineFileInputFormat packs all the small files into input splits where each split is processed by a single mapper  Takes node and rack locality into account when deciding which blocks to place in the same split  Can process the input files efficiently in a typical MapReduce job
  • 55. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Interview Questions “Generally, the question that seems to be complicated have simple answers.” – Anonymous
  • 56. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. Where does the data of a Hive table gets stored? Q. Why HDFS is not used by the Hive metastore for storage?
  • 57. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. Where does the data of a Hive table gets stored?  By default, the Hive table is stored in an HDFS directory: /user/hive/warehouse  It is specified in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml Q. Why HDFS is not used by the Hive metastore for storage?
  • 58. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. Where does the data of a Hive table gets stored?  By default, the Hive table is stored in an HDFS directory: /user/hive/warehouse  It is specified in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml Q. Why HDFS is not used by the Hive metastore for storage?  Editing files or data present in HDFS is not allowed.  Metastore stores metadata using RDBMS to provide low query latency  HDFS read/write operations are time consuming processes
  • 59. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Scenario: Suppose, I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time?
  • 60. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Scenario: Suppose, I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time?  Multiple client access is not allowed in default metastore configuration or embedded mode  One may use following two metastore configurations: 1. Local Metastore Configuration 2. Remote Metastore Configuration
  • 61. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. What is the difference between external table and managed table?
  • 62. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. What is the difference between external table and managed table? Managed Table:  Hive responsible for managing the table data  While dropping the table, Metadata information along with the table data is deleted from the Hive warehouse External Table:  Hive is responsible for managing only table metadata not the table data  While dropping the table, Hive just deletes the metadata information leaving the table data untouched
  • 63. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. When should we use SORT BY instead of ORDER BY ?
  • 64. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. When should we use SORT BY instead of ORDER BY ?  SORT BY clause sorts the data using multiple reducers Reducer OutputDataset Reducer 1 Reducer 2 Reducer n Output  ORDER BY sorts all of the data together using a single reducer SORT BY should be used to sort huge datasets Dataset
  • 65. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. What is the difference between partition and bucket in Hive?
  • 66. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Q. What is the difference between partition and bucket in Hive?
  • 67. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Hive Scenario: CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ; Now, after inserting 50,000 tuples in this table, I want to know the total revenue generated for the month - January. But, Hive is taking too much time in processing this query. How will you solve this problem?
  • 68. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING  Create a partitioned table:  CREATE TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;  Enable dynamic partitioning in Hive:  SET hive.exec.dynamic.partition = true;  SET hive.exec.dynamic.partition.mode = nonstrict;  Transfer the data :  INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id, amount, country, month FROM transaction_details;  Run the query :  SELECT SUM(amount) FROM partitioned_transaction WHERE month= ‘January’; Apache Hive
  • 69. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. What is dynamic partitioning and when is it used? Apache Hive
  • 70. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. What is dynamic partitioning and when is it used?  Values for partition columns are known during runtime  One may use dynamic partition in following cases:  Loading data from an existing non-partitioned table to improve the sampling (query latency)  Values of the partitions are not known before hand and therefore, finding these unknown partition values manually from huge data sets is a tedious task Apache Hive
  • 71. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. How Hive distributes the rows into buckets? Apache Hive
  • 72. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. How Hive distributes the rows into buckets?  Bucket number is determined for a row by using the formula: hash_function (bucketing_column) modulo (num_of_buckets)  hash_function depends on the column data type i.e. for int type it is equal to value of column  hash_function for other data types is complex to calculate Id Name 1 John 2 Mike 3 Shawn 2, Mike 1, John 3, Shawn Bucket 1 Bucket 2  hash_function (1) = 1  hash_function (2) = 2  hash_function (3) = 3 hash_function (id) = id  1 mod 2 = 1  2 mod 2 = 0  3 mod 2 = 1 id mod 2 = bucket num Apache Hive
  • 73. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Scenario: Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the following entries: id first_name last_name e-mail gender ip 1 Hugh Jackman [email protected] Male 136.90.241.52 2 David Lawrence [email protected] Male 101.177.15.130 3 Andy Hall [email protected] Female 114.123.153.64 4 Samuel Jackson [email protected] Male 91.121.145.67 5 Emily Rose [email protected] Female 117.123.108.98 How will you consume this CSV file into the Hive warehouse using built-in SerDe? Apache Hive
  • 74. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING  A SerDe allows us to convert the unstructured bytes into a record that we can process using Hive.  CREATE EXTERNAL TABLE sample (id INT, first_name STRING, last_name STRING, email STRING, gender STRING, ip_address STRING) ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’ STORED AS TEXTFILE LOCATION ‘/temp’;  SELECT first_name FROM sample WHERE gender = ‘male’; Note:  Hive provides several built – in SerDe like for JSON, TSV etc.  Useful in cases where you have embedded commas in delimited fields Apache Hive
  • 75. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Scenario:  I have a lot of small CSV files present in /input directory in HDFS and I want to create a single Hive table corresponding to these files.  The data in these files are in the format: {id, name, e-mail, country} Now, as we know, Hadoop performance degrades when we use lots of small files. So, how will you solve this problem? Apache Hive
  • 76. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING  Create a temporary table: CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING) ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS TEXTFILE;  Load the data from the input directory into temp_table: LOAD DATA INPATH ‘/input’ INTO TABLE temp_table;  Create a table that will store data in SequenceFile format: CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING) ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS SEQUENCEFILE;  Transfer the data from the temporary table into the sample_seqfile table: INSERT OVERWRITE TABLE sample_seqfile SELECT * FROM temp_table; Apache Hive  When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record  Sequence files are flat files consisting of binary key-value pairs  Using sequence file, one can club two or more smaller files to make them one single sequence file
  • 77. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Interview Questions “Whenever you are asked if you can do a job, tell them, 'Certainly I can!' , Then get busy and find out how to do it.” –Theodore Roosevelt
  • 78. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What is the difference between logical and physical plans?
  • 79. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What is the difference between logical and physical plans? Logical Plan:  Created for each line in pig script if no syntax error is found by interpreter  No data processing happens during creation of logical plan Physical Plan:  Physical plan is basically a series of map reduce jobs  Describes the physical operators to execute the script, without reference to how they will be executed in MapReduce
  • 80. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What is a bag in Pig Latin?
  • 81. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What is a bag in Pig Latin?  Unordered collection of tuples  Duplicate tuples are allowed  Tuples with differing numbers of fields is allowed  For example: { (Linkin Park, 7, California), (Metallica, 8), (Mega Death, Los Angeles) }
  • 82. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. How Apache Pig handles unstructured data which is difficult in case of Apache Hive?
  • 83. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. How Apache Pig handles unstructured data which is difficult in case of Apache Hive? No Datatype {a, b, c} $2 (positional notation) c (3rd field) missing schema JOIN, COGROUP, etc NULL (schema) schema is NULL Byte Array (default) Data type definition (runtime) Byte Array (default)
  • 84. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What are the different execution modes available in Pig?
  • 85. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What are the different execution modes available in Pig? MapReduce Mode:  Default mode  Requires access to a Hadoop cluster  Input and output data are present on HDFS Local Mode:  Requires access to a single machine  ‘-x ’ flag is used to specify the local mode environment (pig -x local)  Input and output data are present on local file system
  • 86. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What does Flatten do in Pig?
  • 87. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Q. What does Flatten do in Pig?  Flatten un-nests bags and tuples.  For tuples, the Flatten operator will substitute the fields of a tuple in place of the tuple  For example:  Un-nesting bags is a little complex as it requires creating new tuples (a, (b, c)) GENERATE $0, flatten($1) (a, b, c)
  • 88. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase & Sqoop Interview Questions “Take risks: if you win, you will be happy; if you lose, you will be wise.” –Anonymous
  • 89. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. What are the key components of HBase?
  • 90. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. What are the key components of HBase?  HMaster manages the Region Servers  Region Server manages a group of regions  Zooeeper acts as a coordinator inside HBase environment
  • 91. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. How do we back up a HBase cluster?
  • 92. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. How do we back up a HBase cluster? 1. Full Shutdown Backup  Useful for cases where HBase cluster shutdown is possible  Steps: • Stop HBase: Stop the HBase services first • Distcp: Copy the contents of the HBase directory into another HDFS directory in different or same cluster 2. Live Cluster Backup  Useful for live cluster that cannot afford downtime  Steps: • CopyTable: Copy data from one table to another on the same or different cluster • Export: Dumps the content of a table into HDFS on the same cluster
  • 93. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. What is a Bloom filter and how does it help in searching rows?
  • 94. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. What is a Bloom filter and how does it help in searching rows?  Used to improve the overall throughput of the cluster  Space efficient mechanism to test whether a HFile contains a specific row or row-col cell  Saves the time in scanning non - relevant blocks for a given row key
  • 95. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HBase Q. What is the role of JDBC driver in a Sqoop set up?
  • 96. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Sqoop Q. What is the role of JDBC driver in a Sqoop set up?  To connect to different relational databases Sqoop needs a connector  Almost every DB vendor makes this connecter available as a JDBC driver which is specific to that DB  Sqoop needs the JDBC driver of each of the database that it needs to interact with
  • 97. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. When to use --target-dir and when to use --warehouse-dir while importing data? Sqoop
  • 98. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. When to use --target-dir and when to use --warehouse-dir while importing data?  --target-dir is used for specifying a particular directory in HDFS  --warehouse-dir is used for specifying the parent directory of all the Sqoop jobs  In the later case, Sqoop will create directory with the same name as that of table under the parent directory Sqoop
  • 99. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. What does the following query do: $ sqoop import --connect jdbc:mysql://host/dbname --table EMPLOYEES --where "start_date” > '2012-11-09‘ Sqoop
  • 100. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING It imports the employees who have joined after 9-Nov-2012 Sqoop Q. What does the following query do: $ sqoop import --connect jdbc:mysql://host/dbname --table EMPLOYEES --where "start_date” > '2012-11-09‘
  • 101. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Scenario: In a Sqoop import command you have mentioned to run 8 parallel MapReduce tasks but Sqoop runs only 4 What can be the reason? Sqoop
  • 102. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Scenario: In a Sqoop import command you have mentioned to run 8 parallel MapReduce tasks but Sqoop runs only 4 What can be the reason? In this case, the MapReduce cluster is configured to run 4 parallel tasks. Therefore, the Sqoop command must have the number of parallel tasks less or equal to that of the MapReduce cluster Sqoop
  • 103. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. Give a Sqoop command to show all the databases in a MySQL server. Sqoop
  • 104. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Q. Give a Sqoop command to show all the databases in a MySQL server.  Issue the command given below: $ sqoop list-databases --connect jdbc:mysql://database.example.com/ Sqoop
  • 105. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Learning Resources  Top 50 Hadoop Interview Questions: www.edureka.co/blog/interview-questions/top-50-hadoop-interview-questions-2016  HDFS Interview Questions: www.edureka.co/blog/interview-questions/hadoop-interview-questions-hdfs-2  MapReduce Interview Questions: www.edureka.co/blog/interview-questions/hadoop-interview-questions-mapreduce  Apache Hive Interview Questions: www.edureka.co/blog/interview-questions/hive-interview-questions  Apache Pig Interview Questions: www.edureka.co/blog/interview-questions/hadoop-interview-questions-pig  Apache HBase Interview Questions: www.edureka.co/blog/interview-questions/hbase-interview-questions
  • 106. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Thank You… Questions/Queries/Feedback