SlideShare a Scribd company logo
Sqoop – Advanced Options
2015
Contents
1 What is Sqoop ?
2 Import and Export data using Sqoop
3 Import and Export command in Sqoop
4 Saved Jobs in Sqoop
5 Option File
6 Important Sqoop Options
What is Sqoop?
Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and
structured data stores such as relational databases.
Import and Export using Sqoop
The import command in Sqoop transfers the data from RDBMS to HDFS/Hive/HBase.
The export command in Sqoop transfers the data from HDFS/Hive/HBase back to
RDBMS.
Import command in Sqoop
The command to import data into Hive :
The command to import data into HDFS :
The command to import data in HBase :
sqoop import --connect <connect-string>/dbname --username uname -P
--table table_name --hive-import -m 1
sqoop import --connect <connect-string>/dbname --username uname --P
--table table_name -m 1
sqoop import --connect <connect-string>/dbname --username root -P
--table table_name --hbase-table table_name
--column-family col_fam_name --hbase-row-key row_key_name --hbase-create-table -m 1
Export command in Sqoop
The command to export data from RDBMS to Hive :
The command to export data from RDBMS to HDFS :
Limitations of Import and Export command:
- Import and Export commands are convenient to use when one wants to transfer data from RDBMS to
HDFS/Hive/HBase and vice-a-versa for a limited number of times.
So what if there is a requirement of executing the import and export commands several times a day ?
In such situations Saved Sqoop Job can save your time.
sqoop export --connect <connect-string>/db_name --table table_name -m 1
--export-dir <path_to_export_dir>
sqoop export --connect <connect-string>/db_name --table table_name -m 1
--export-dir <path_to_export_dir>
Saved Jobs in Sqoop
The Saved Sqoop Job remembers the parameters used by a job so they can be re-
executed by invoking the job several times.
Following command creates saved jobs:
The command above just creates a job with the job name you specify.
It means that the job you created is now available in your saved jobs list which can be
executed later.
Following command executes a saved job :
sqoop job --create job_name --import --connect <connect-string>/dbname  --table table_name
sqoop job --exec job_name --username uname –P
Sample Saved Job
sqoop job --create JOB1
-- import --connect jdbc:mysql://192.168.56.1:3306/adventureworks
-username XXX
-password XXX
--table transactionhistory
--target-dir /user/cloudera/datasets/trans
-m 1
--columns "TransactionID,ProductId,TransactionDate"
--check-column TransactionDate
--incremental lastmodified
--last-value "2004-09-01 00:00:00";
Important Options in Saved Jobs in Sqoop
Sqoop option Usage
--connect Connection string for the source database
--table Source table name
--columns Columns to be extracted
--username User name for accessing source table
--password Password for accessing source table
--check-column
Specifies the column to be examined when determining which rows
to import.
--incremental Specifies how Sqoop determines which rows are new.
--last-value
Specifies the maximum value of the check column from the previous
import. For the first execution of the job, “last-value” is treated as
the upper bound and data is extracted from first record till the upper
bound.
--target-dir Target HDFS directory
--m Number of mapper tasks
--compress
Specifies that compression has to be applied while loading data into
target.
--fields-terminated-by Fields separator in output directory
Sqoop Metastore
• A Sqoop metastore keeps track of all jobs.
• By default, the metastore is contained in your home directory under .sqoop and is
only used for your own jobs. If you want to share jobs, you would need to install a
JDBC-compliant database and use the --meta-connect argument to specify its
location when issuing job commands.
• Important Sqoop commands:
• $ sqoop job –list – Lists all jobs available in metastore
• sqoop job --exec JOB1 – Executes JOB1
• sqoop job --show JOB1 – Displays metadata of JOB1
Option File
Certain arguments in import, export commands and saved jobs are to be written every
time you execute them.
What would be an alternative to this repetitive work ?
For instance following arguments are used repetitively in import and export
commands as well as saved jobs :
• So these arguments can be saved in a single text file say option.txt.
• While executing the command just include this file for the argument --options-file.
• Following command shows the use of –options-file argument:
import
-connect
jdbc:mysql//localhost
-username
-P
Option.txt
sqoop --options-file <path_to_option_file>/db_name --table table_name
Option File
1. Each argument in the option file should be on a new line.
2. -connect in option file cannot be written as --connect.
3. Same is the case for other arguments too.
4. Option file is generally used when large number of Sqoop jobs use a common set
of parameters such as:
1. Source RDBMS ID, Password
2. Source database URL
3. Field Separator
4. Compression type
Sqoop Design Guidelines for Performance
1. Sqoop imports data in parallel from database sources. You can specify the number
of map tasks (parallel processes) to use to perform the import by using the -
m argument. Some databases may see improved performance by increasing this
value to 8 or 16. Do not increase the degree of parallelism greater than that
available within your MapReduce cluster;
2. By default, the import process will use JDBC. Some databases can perform imports
in a more high-performance fashion by using database-specific data movement
tools. For example, MySQL provides the mysqldump tool which can export data
from MySQL to other systems very quickly. By supplying the --direct argument,
you are specifying that Sqoop should attempt the direct import channel.
Thank You

More Related Content

PDF
SQL to Hive Cheat Sheet
PDF
Habits of Effective Sqoop Users
PDF
PPTX
From oracle to hadoop with Sqoop and other tools
PDF
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
PDF
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
PDF
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Introduction to Apache Hive
SQL to Hive Cheat Sheet
Habits of Effective Sqoop Users
From oracle to hadoop with Sqoop and other tools
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive

What's hot (19)

PDF
Apache Sqoop: Unlocking Hadoop for Your Relational Database
PDF
Sqoop tutorial
PPTX
Hadoop on osx
PDF
Introduction to Apache Sqoop
PPTX
Apache sqoop with an use case
PPTX
Hive commands
PDF
Hive Quick Start Tutorial
PPTX
HiveServer2
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
PPTX
Data analysis scala_spark
PDF
Hive Anatomy
PPTX
Big data, just an introduction to Hadoop and Scripting Languages
PPT
Hive User Meeting August 2009 Facebook
PDF
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
PDF
Apache Spark Tutorial
PDF
Beginning hive and_apache_pig
PDF
Introduction to scoop and its functions
PPTX
Hadoop & HDFS for Beginners
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Sqoop tutorial
Hadoop on osx
Introduction to Apache Sqoop
Apache sqoop with an use case
Hive commands
Hive Quick Start Tutorial
HiveServer2
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Data analysis scala_spark
Hive Anatomy
Big data, just an introduction to Hadoop and Scripting Languages
Hive User Meeting August 2009 Facebook
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
DataEngConf SF16 - Collecting and Moving Data at Scale
Apache Spark Tutorial
Beginning hive and_apache_pig
Introduction to scoop and its functions
Hadoop & HDFS for Beginners
Ad

Viewers also liked (13)

PDF
Big Data Analytics using Mahout
PDF
Thai Software & Software Market Survey 2015
PDF
สมุดกิจกรรม Code for Kids
PPT
ITSS Overview
PDF
Big data: Loading your data with flume and sqoop
PDF
Big data processing using Hadoop with Cloudera Quickstart
PDF
New Data Transfer Tools for Hadoop: Sqoop 2
PDF
Mobile User and App Analytics in China
PDF
Apache Sqoop: A Data Transfer Tool for Hadoop
PDF
Install Apache Hadoop for Development/Production
PDF
Machine Learning using Apache Spark MLlib
PDF
Kanban boards step by step
PPTX
Flume vs. kafka
Big Data Analytics using Mahout
Thai Software & Software Market Survey 2015
สมุดกิจกรรม Code for Kids
ITSS Overview
Big data: Loading your data with flume and sqoop
Big data processing using Hadoop with Cloudera Quickstart
New Data Transfer Tools for Hadoop: Sqoop 2
Mobile User and App Analytics in China
Apache Sqoop: A Data Transfer Tool for Hadoop
Install Apache Hadoop for Development/Production
Machine Learning using Apache Spark MLlib
Kanban boards step by step
Flume vs. kafka
Ad

Similar to Advanced Sqoop (20)

PPTX
Hadoop_File_Formats_and_Data_Ingestion.pptx
PDF
Hollywood mode off: security testing at scale
PDF
Introduction to WP-CLI: Manage WordPress from the command line
PDF
linux installation.pdf
PPTX
BigData - Apache Spark Sqoop Introduce Basic
PDF
Odoo command line interface
PDF
Sqoop Explanation with examples and syntax
PPT
Power point on linux commands,appache,php,mysql,html,css,web 2.0
PPT
Linux presentation
PDF
CMake Tutorial
PPT
SQLMAP Tool Usage - A Heads Up
PPTX
Performance all teh things
PDF
Big data using Hadoop, Hive, Sqoop with Installation
PDF
Linux file commands and shell scripts
TXT
Mbuild help
DOCX
50 Most Frequently Used UNIX Linux Commands -hmftj
PPTX
Introducing Node.js in an Oracle technology environment (including hands-on)
PDF
Ansible automation tool with modules
PDF
Airflow presentation
PDF
Devops for beginners
Hadoop_File_Formats_and_Data_Ingestion.pptx
Hollywood mode off: security testing at scale
Introduction to WP-CLI: Manage WordPress from the command line
linux installation.pdf
BigData - Apache Spark Sqoop Introduce Basic
Odoo command line interface
Sqoop Explanation with examples and syntax
Power point on linux commands,appache,php,mysql,html,css,web 2.0
Linux presentation
CMake Tutorial
SQLMAP Tool Usage - A Heads Up
Performance all teh things
Big data using Hadoop, Hive, Sqoop with Installation
Linux file commands and shell scripts
Mbuild help
50 Most Frequently Used UNIX Linux Commands -hmftj
Introducing Node.js in an Oracle technology environment (including hands-on)
Ansible automation tool with modules
Airflow presentation
Devops for beginners

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Spectroscopy.pptx food analysis technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KodekX | Application Modernization Development
Chapter 3 Spatial Domain Image Processing.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Electronic commerce courselecture one. Pdf
Review of recent advances in non-invasive hemoglobin estimation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
Mobile App Security Testing_ A Comprehensive Guide.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Spectroscopy.pptx food analysis technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Advanced Sqoop

  • 1. Sqoop – Advanced Options 2015
  • 2. Contents 1 What is Sqoop ? 2 Import and Export data using Sqoop 3 Import and Export command in Sqoop 4 Saved Jobs in Sqoop 5 Option File 6 Important Sqoop Options
  • 3. What is Sqoop? Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured data stores such as relational databases.
  • 4. Import and Export using Sqoop The import command in Sqoop transfers the data from RDBMS to HDFS/Hive/HBase. The export command in Sqoop transfers the data from HDFS/Hive/HBase back to RDBMS.
  • 5. Import command in Sqoop The command to import data into Hive : The command to import data into HDFS : The command to import data in HBase : sqoop import --connect <connect-string>/dbname --username uname -P --table table_name --hive-import -m 1 sqoop import --connect <connect-string>/dbname --username uname --P --table table_name -m 1 sqoop import --connect <connect-string>/dbname --username root -P --table table_name --hbase-table table_name --column-family col_fam_name --hbase-row-key row_key_name --hbase-create-table -m 1
  • 6. Export command in Sqoop The command to export data from RDBMS to Hive : The command to export data from RDBMS to HDFS : Limitations of Import and Export command: - Import and Export commands are convenient to use when one wants to transfer data from RDBMS to HDFS/Hive/HBase and vice-a-versa for a limited number of times. So what if there is a requirement of executing the import and export commands several times a day ? In such situations Saved Sqoop Job can save your time. sqoop export --connect <connect-string>/db_name --table table_name -m 1 --export-dir <path_to_export_dir> sqoop export --connect <connect-string>/db_name --table table_name -m 1 --export-dir <path_to_export_dir>
  • 7. Saved Jobs in Sqoop The Saved Sqoop Job remembers the parameters used by a job so they can be re- executed by invoking the job several times. Following command creates saved jobs: The command above just creates a job with the job name you specify. It means that the job you created is now available in your saved jobs list which can be executed later. Following command executes a saved job : sqoop job --create job_name --import --connect <connect-string>/dbname --table table_name sqoop job --exec job_name --username uname –P
  • 8. Sample Saved Job sqoop job --create JOB1 -- import --connect jdbc:mysql://192.168.56.1:3306/adventureworks -username XXX -password XXX --table transactionhistory --target-dir /user/cloudera/datasets/trans -m 1 --columns "TransactionID,ProductId,TransactionDate" --check-column TransactionDate --incremental lastmodified --last-value "2004-09-01 00:00:00";
  • 9. Important Options in Saved Jobs in Sqoop Sqoop option Usage --connect Connection string for the source database --table Source table name --columns Columns to be extracted --username User name for accessing source table --password Password for accessing source table --check-column Specifies the column to be examined when determining which rows to import. --incremental Specifies how Sqoop determines which rows are new. --last-value Specifies the maximum value of the check column from the previous import. For the first execution of the job, “last-value” is treated as the upper bound and data is extracted from first record till the upper bound. --target-dir Target HDFS directory --m Number of mapper tasks --compress Specifies that compression has to be applied while loading data into target. --fields-terminated-by Fields separator in output directory
  • 10. Sqoop Metastore • A Sqoop metastore keeps track of all jobs. • By default, the metastore is contained in your home directory under .sqoop and is only used for your own jobs. If you want to share jobs, you would need to install a JDBC-compliant database and use the --meta-connect argument to specify its location when issuing job commands. • Important Sqoop commands: • $ sqoop job –list – Lists all jobs available in metastore • sqoop job --exec JOB1 – Executes JOB1 • sqoop job --show JOB1 – Displays metadata of JOB1
  • 11. Option File Certain arguments in import, export commands and saved jobs are to be written every time you execute them. What would be an alternative to this repetitive work ? For instance following arguments are used repetitively in import and export commands as well as saved jobs : • So these arguments can be saved in a single text file say option.txt. • While executing the command just include this file for the argument --options-file. • Following command shows the use of –options-file argument: import -connect jdbc:mysql//localhost -username -P Option.txt sqoop --options-file <path_to_option_file>/db_name --table table_name
  • 12. Option File 1. Each argument in the option file should be on a new line. 2. -connect in option file cannot be written as --connect. 3. Same is the case for other arguments too. 4. Option file is generally used when large number of Sqoop jobs use a common set of parameters such as: 1. Source RDBMS ID, Password 2. Source database URL 3. Field Separator 4. Compression type
  • 13. Sqoop Design Guidelines for Performance 1. Sqoop imports data in parallel from database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the - m argument. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; 2. By default, the import process will use JDBC. Some databases can perform imports in a more high-performance fashion by using database-specific data movement tools. For example, MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel.