SlideShare a Scribd company logo
Apache Sqoop

A Data Transfer Tool for Hadoop




         Arvind Prabhakar, Cloudera Inc. Sept 21, 2011
What is Sqoop?

● Allows easy import and export of data from structured
  data stores:
   ○ Relational Database
   ○ Enterprise Data Warehouse
   ○ NoSQL Datastore

● Allows easy integration with Hadoop based systems:
   ○ Hive
   ○ HBase
   ○ Oozie
Agenda

● Motivation

● Importing and exporting data using Sqoop

● Provisioning Hive Metastore

● Populating HBase tables

● Sqoop Connectors

● Current Status and Road Map
Motivation

● Structured data stored in Databases and EDW is not easily
  accessible for analysis in Hadoop

● Access to Databases and EDW from Hadoop Clusters is
  problematic.

● Forcing MapReduce to access data from Databases/EDWs is
  repititive, error-prone and non-trivial.

● Data preparation often required for efficient consumption
  by Hadoop based data pipelines. 

● Current methods of transferring data are inefficient/ad-
  hoc.
Enter: Sqoop

    A tool to automate data transfer between structured     
    datastores and Hadoop.

Highlights

 ● Uses datastore metadata to infer structure definitions
 ● Uses MapReduce framework to transfer data in parallel
 ● Allows structure definitions to be provisioned in Hive
   metastore
 ● Provides an extension mechanism to incorporate high
   performance connectors for external systems. 
Importing Data

mysql> describe ORDERS;
+-----------------+-------------+------+-----+---------+-------+
| Field        | Type        | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+-------+
| ORDER_NUMBER | int(11) | NO | PRI | NULL |                            |
| ORDER_DATE | datetime | NO | | NULL |                             |
| REQUIRED_DATE | datetime | NO | | NULL |                            |
| SHIP_DATE           | datetime | YES | | NULL |                 |
| STATUS           | varchar(15) | NO | | NULL |               |
| COMMENTS              | text     | YES | | NULL |             |
| CUSTOMER_NUMBER | int(11) | NO | | NULL |                               |
+-----------------+-------------+------+-----+---------+-------+
7 rows in set (0.00 sec)
Importing Data
$ sqoop import --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS --username test --password ****
 ...

INFO mapred.JobClient: Counters: 12
INFO mapred.JobClient:   Job Counters 
INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12873
...
INFO mapred.JobClient:     Launched map tasks=4
INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
INFO mapred.JobClient:   FileSystemCounters
INFO mapred.JobClient:     HDFS_BYTES_READ=505
INFO mapred.JobClient:     FILE_BYTES_WRITTEN=222848
INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=35098
INFO mapred.JobClient:   Map-Reduce Framework
INFO mapred.JobClient:     Map input records=326
INFO mapred.JobClient:     Spilled Records=0
INFO mapred.JobClient:     Map output records=326
INFO mapred.JobClient:     SPLIT_RAW_BYTES=505
INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.2754 seconds (3.0398
KB/sec)
INFO mapreduce.ImportJobBase: Retrieved 326 records.
Importing Data

$ hadoop fs -ls
Found 32 items
....
drwxr-xr-x - arvind staff 0 2011-09-13 19:12 /user/arvind/ORDERS
....

$ hadoop fs -ls /user/arvind/ORDERS

arvind@ap-w510:/opt/ws/apache/sqoop$ hadoop fs -ls /user/arvind/ORDERS
Found 6 items
... 0 2011-09-13 19:12 /user/arvind/ORDERS/_SUCCESS
... 0 2011-09-13 19:12 /user/arvind/ORDERS/_logs
... 8826 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00000
... 8760 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00001
... 8841 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00002
... 8671 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00003
Exporting Data

$ sqoop export --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS_CLEAN --username test --password **** 
  --export-dir /user/arvind/ORDERS
...
INFO mapreduce.ExportJobBase: Transferred 34.7178 KB in 6.7482 seconds (5.1447 KB/sec)
INFO mapreduce.ExportJobBase: Exported 326 records.
$



  ● Default Delimiters: ',' for fields, New-Lines for records
  ● Optionally Specify Escape sequence 
  ● Delimiters can be specified for both import and export
Exporting Data

Exports can optionally use Staging Tables

 ● Map tasks populate staging table

 ● Each map write is broken down into many transactions

 ● Staging table is then used to populate the target table in a
   single transaction

 ● In case of failure, staging table provides insulation from
   data corruption.
Importing Data into Hive

$ sqoop import --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS --username test --password **** --hive-import
 ...

INFO mapred.JobClient: Counters: 12
INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.3995 seconds (3.0068
KB/sec)
INFO mapreduce.ImportJobBase: Retrieved 326 records.
INFO hive.HiveImport: Removing temporary files from import process: ORDERS/_logs
INFO hive.HiveImport: Loading uploaded data into Hive
...
WARN hive.TableDefWriter: Column ORDER_DATE had to be cast to a less precise type in
Hive
WARN hive.TableDefWriter: Column REQUIRED_DATE had to be cast to a less precise type
in Hive
WARN hive.TableDefWriter: Column SHIP_DATE had to be cast to a less precise type in
Hive
...
$
Importing Data into Hive

$ hive
hive> show tables;
OK
...
orders
...
hive> describe orders;
OK
order_number int
order_date string
required_date string
ship_date string
status string
comments string
customer_number int
Time taken: 0.236 seconds
hive>
Importing Data into HBase

$ bin/sqoop import --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS --username test --password **** 
  --hbase-create-table --hbase-table ORDERS --column-family mysql
...
INFO mapreduce.HBaseImportJob: Creating missing HBase table ORDERS
...
INFO mapreduce.ImportJobBase: Retrieved 326 records.
$


  ● Sqoop creates the missing table if instructed
  ● If no Row-Key specified, the Primary Key column is used.
  ● Each output column placed in same column family
  ● Every record read results in an HBase put operation
  ● All values are converted to their string representation and
    inserted as UTF-8 bytes.
Importing Data into HBase

hbase(main):001:0> list
TABLE 
ORDERS 
1 row(s) in 0.3650 seconds

hbase(main):002:0>  describe 'ORDERS'
DESCRIPTION                             ENABLED
{NAME => 'ORDERS', FAMILIES => [                true
 {NAME => 'mysql', BLOOMFILTER => 'NONE',
  REPLICATION_SCOPE => '0', COMPRESSION => 'NONE',
  VERSIONS => '3', TTL => '2147483647',
  BLOCKSIZE => '65536', IN_MEMORY => 'false',
  BLOCKCACHE => 'true'}]}
1 row(s) in 0.0310 seconds

hbase(main):003:0>
Importing Data into HBase

hbase(main):001:0> scan 'ORDERS', { LIMIT => 1 }
ROW COLUMN+CELL
10100 column=mysql:CUSTOMER_NUMBER,timestamp=1316036948264,
    value=363
10100 column=mysql:ORDER_DATE, timestamp=1316036948264,
    value=2003-01-06 00:00:00.0
10100 column=mysql:REQUIRED_DATE, timestamp=1316036948264,
    value=2003-01-13 00:00:00.0
10100 column=mysql:SHIP_DATE, timestamp=1316036948264,
    value=2003-01-10 00:00:00.0
10100 column=mysql:STATUS, timestamp=1316036948264,
    value=Shipped
1 row(s) in 0.0130 seconds

hbase(main):012:0>
Sqoop Connectors

● Connector Mechanism allows creation of new connectors
  that improve/augment Sqoop functionality.

● Bundled connectors include:
   ○ MySQL, PostgreSQL, Oracle, SQLServer, JDBC
   ○ Direct MySQL, Direct PostgreSQL

● Regular connectors are JDBC based.

● Direct Connectors use native tools for high-performance
  data transfer implementation.
Import using Direct MySQL Connector

$ sqoop import --connect jdbc:mysql://localhost/acmedb 
   --table ORDERS --username test --password **** --direct
...
manager.DirectMySQLManager: Beginning mysqldump fast
path import
...

Direct import works as follows:
 ● Data is partitioned into splits using JDBC
 ● Map tasks used mysqldump to do the import with conditional
   selection clause (-w 'ORDER_NUMBER' > ...)
 ● Header and footer information was stripped out

Direct Export similarly uses            mysqlimport   utility.
Third Party Connectors

● Oracle - Developed by Quest Software

● Couchbase - Developed by Couchbase

● Netezza - Developed by Cloudera

● Teradata - Developed by Cloudera

● Microsoft SQL Server - Developed by Microsoft

● Microsoft PDW - Developed by Microsoft

● Volt DB - Developed by VoltDB
Current Status

Sqoop is currently in Apache Incubator

  ● Status Page
     http://incubator.apache.org/projects/sqoop.html

  ● Mailing Lists
     sqoop-user@incubator.apache.org
     sqoop-dev@incubator.apache.org

  ● Release
     Current shipping version is 1.3.0
Hadoop World 2011


A gathering of Hadoop practitioners, developers,
business executives, industry luminaries and
innovative companies in the Hadoop ecosystem.

    ● Network: 1400 attendees, 25+ sponsors
    ● Learn: 60 sessions across 5 tracks for             November 8-9
         ○ Developers                              Sheraton New York Hotel &
         ○ IT Operations                                  Towers, NYC
         ○ Enterprise Architects
         ○ Data Scientists
         ○ Business Decision Makers                 Learn more and register at
                                                     www.hadoopworld.com
    ● Train: Cloudera training and certification
       (November 7, 10, 11)
Sqoop Meetup



      Monday, November 7 - 2011, 8pm - 9pm

                       at

     Sheraton New York Hotel & Towers, NYC
Thank you!

   Q&A

More Related Content

PDF
PPTX
NOSQL Databases types and Uses
PDF
Dynamic Partition Pruning in Apache Spark
PPTX
The Basics of MongoDB
PPT
Data Warehouse Modeling
PPTX
Apache hive introduction
PPTX
Building an Effective Data Warehouse Architecture
PPTX
Key-Value NoSQL Database
NOSQL Databases types and Uses
Dynamic Partition Pruning in Apache Spark
The Basics of MongoDB
Data Warehouse Modeling
Apache hive introduction
Building an Effective Data Warehouse Architecture
Key-Value NoSQL Database

What's hot (20)

PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PPTX
Introduction to HiveQL
PDF
Apache Iceberg: An Architectural Look Under the Covers
PPTX
Introduction to sqoop
PPTX
Oracle database performance tuning
PDF
Intro to HBase
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PPTX
Introduction to MongoDB
PDF
Spark SQL
PDF
Unified Stream and Batch Processing with Apache Flink
PDF
Databricks Delta Lake and Its Benefits
PDF
Annexe Big Data
PPTX
Introduction to NoSQL Databases
PPTX
Apache sqoop with an use case
PDF
Parquet performance tuning: the missing guide
PDF
Hadoop Overview & Architecture
 
PPTX
Hadoop technology
PDF
Moving to Databricks & Delta
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Introduction to HiveQL
Apache Iceberg: An Architectural Look Under the Covers
Introduction to sqoop
Oracle database performance tuning
Intro to HBase
Building robust CDC pipeline with Apache Hudi and Debezium
Introduction to MongoDB
Spark SQL
Unified Stream and Batch Processing with Apache Flink
Databricks Delta Lake and Its Benefits
Annexe Big Data
Introduction to NoSQL Databases
Apache sqoop with an use case
Parquet performance tuning: the missing guide
Hadoop Overview & Architecture
 
Hadoop technology
Moving to Databricks & Delta
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Ad

Viewers also liked (14)

PDF
สมุดกิจกรรม Code for Kids
PPTX
Advanced Sqoop
PDF
Introduction to Apache Sqoop
PDF
Mobile User and App Analytics in China
PDF
Thai Software & Software Market Survey 2015
PDF
Big data: Loading your data with flume and sqoop
PDF
New Data Transfer Tools for Hadoop: Sqoop 2
PPT
ITSS Overview
PDF
Big Data Analytics using Mahout
PDF
Big data processing using Hadoop with Cloudera Quickstart
PDF
Install Apache Hadoop for Development/Production
PDF
Machine Learning using Apache Spark MLlib
PPTX
Flume vs. kafka
PDF
Kanban boards step by step
สมุดกิจกรรม Code for Kids
Advanced Sqoop
Introduction to Apache Sqoop
Mobile User and App Analytics in China
Thai Software & Software Market Survey 2015
Big data: Loading your data with flume and sqoop
New Data Transfer Tools for Hadoop: Sqoop 2
ITSS Overview
Big Data Analytics using Mahout
Big data processing using Hadoop with Cloudera Quickstart
Install Apache Hadoop for Development/Production
Machine Learning using Apache Spark MLlib
Flume vs. kafka
Kanban boards step by step
Ad

Similar to Apache Sqoop: A Data Transfer Tool for Hadoop (20)

PDF
Couchbas for dummies
PDF
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
PDF
MySQL 5.7 Tutorial Dutch PHP Conference 2015
PDF
MySQL 5.7. Tutorial - Dutch PHP Conference 2015
PDF
Migrations from PLSQL and Transact-SQL - m18
PDF
M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
PDF
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
PDF
In-memory ColumnStore Index
ODP
Drupal database Mssql to MySQL migration
PDF
How to migrate from Oracle Database with ease
PPTX
Serverless in-action
PPTX
MySQL Without the SQL -- Oh My! Longhorn PHP Conference
PDF
Streaming ETL - from RDBMS to Dashboard with KSQL
PDF
NoSQL and MySQL: News about JSON
PPTX
2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...
PDF
SQL on Hadoop
PDF
Write Faster SQL with Trino.pdf
PPTX
Optimizing your Database Import!
PDF
Beyond php - it's not (just) about the code
PDF
Hw09 Sqoop Database Import For Hadoop
Couchbas for dummies
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL 5.7 Tutorial Dutch PHP Conference 2015
MySQL 5.7. Tutorial - Dutch PHP Conference 2015
Migrations from PLSQL and Transact-SQL - m18
M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
In-memory ColumnStore Index
Drupal database Mssql to MySQL migration
How to migrate from Oracle Database with ease
Serverless in-action
MySQL Without the SQL -- Oh My! Longhorn PHP Conference
Streaming ETL - from RDBMS to Dashboard with KSQL
NoSQL and MySQL: News about JSON
2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...
SQL on Hadoop
Write Faster SQL with Trino.pdf
Optimizing your Database Import!
Beyond php - it's not (just) about the code
Hw09 Sqoop Database Import For Hadoop

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Electronic commerce courselecture one. Pdf
PPTX
Tartificialntelligence_presentation.pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25-Week II
The Rise and Fall of 3GPP – Time for a Sabbatical?
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Electronic commerce courselecture one. Pdf
Tartificialntelligence_presentation.pptx
SOPHOS-XG Firewall Administrator PPT.pptx
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
Group 1 Presentation -Planning and Decision Making .pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Empathic Computing: Creating Shared Understanding

Apache Sqoop: A Data Transfer Tool for Hadoop

  • 1. Apache Sqoop A Data Transfer Tool for Hadoop Arvind Prabhakar, Cloudera Inc. Sept 21, 2011
  • 2. What is Sqoop? ● Allows easy import and export of data from structured data stores: ○ Relational Database ○ Enterprise Data Warehouse ○ NoSQL Datastore ● Allows easy integration with Hadoop based systems: ○ Hive ○ HBase ○ Oozie
  • 3. Agenda ● Motivation ● Importing and exporting data using Sqoop ● Provisioning Hive Metastore ● Populating HBase tables ● Sqoop Connectors ● Current Status and Road Map
  • 4. Motivation ● Structured data stored in Databases and EDW is not easily accessible for analysis in Hadoop ● Access to Databases and EDW from Hadoop Clusters is problematic. ● Forcing MapReduce to access data from Databases/EDWs is repititive, error-prone and non-trivial. ● Data preparation often required for efficient consumption by Hadoop based data pipelines.  ● Current methods of transferring data are inefficient/ad- hoc.
  • 5. Enter: Sqoop     A tool to automate data transfer between structured          datastores and Hadoop. Highlights ● Uses datastore metadata to infer structure definitions ● Uses MapReduce framework to transfer data in parallel ● Allows structure definitions to be provisioned in Hive metastore ● Provides an extension mechanism to incorporate high performance connectors for external systems. 
  • 6. Importing Data mysql> describe ORDERS; +-----------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------+-------------+------+-----+---------+-------+ | ORDER_NUMBER | int(11) | NO | PRI | NULL | | | ORDER_DATE | datetime | NO | | NULL | | | REQUIRED_DATE | datetime | NO | | NULL | | | SHIP_DATE | datetime | YES | | NULL | | | STATUS | varchar(15) | NO | | NULL | | | COMMENTS | text | YES | | NULL | | | CUSTOMER_NUMBER | int(11) | NO | | NULL | | +-----------------+-------------+------+-----+---------+-------+ 7 rows in set (0.00 sec)
  • 7. Importing Data $ sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password ****  ... INFO mapred.JobClient: Counters: 12 INFO mapred.JobClient:   Job Counters  INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12873 ... INFO mapred.JobClient:     Launched map tasks=4 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0 INFO mapred.JobClient:   FileSystemCounters INFO mapred.JobClient:     HDFS_BYTES_READ=505 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=222848 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=35098 INFO mapred.JobClient:   Map-Reduce Framework INFO mapred.JobClient:     Map input records=326 INFO mapred.JobClient:     Spilled Records=0 INFO mapred.JobClient:     Map output records=326 INFO mapred.JobClient:     SPLIT_RAW_BYTES=505 INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.2754 seconds (3.0398 KB/sec) INFO mapreduce.ImportJobBase: Retrieved 326 records.
  • 8. Importing Data $ hadoop fs -ls Found 32 items .... drwxr-xr-x - arvind staff 0 2011-09-13 19:12 /user/arvind/ORDERS .... $ hadoop fs -ls /user/arvind/ORDERS arvind@ap-w510:/opt/ws/apache/sqoop$ hadoop fs -ls /user/arvind/ORDERS Found 6 items ... 0 2011-09-13 19:12 /user/arvind/ORDERS/_SUCCESS ... 0 2011-09-13 19:12 /user/arvind/ORDERS/_logs ... 8826 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00000 ... 8760 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00001 ... 8841 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00002 ... 8671 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00003
  • 9. Exporting Data $ sqoop export --connect jdbc:mysql://localhost/acmedb --table ORDERS_CLEAN --username test --password **** --export-dir /user/arvind/ORDERS ... INFO mapreduce.ExportJobBase: Transferred 34.7178 KB in 6.7482 seconds (5.1447 KB/sec) INFO mapreduce.ExportJobBase: Exported 326 records. $ ● Default Delimiters: ',' for fields, New-Lines for records ● Optionally Specify Escape sequence  ● Delimiters can be specified for both import and export
  • 10. Exporting Data Exports can optionally use Staging Tables ● Map tasks populate staging table ● Each map write is broken down into many transactions ● Staging table is then used to populate the target table in a single transaction ● In case of failure, staging table provides insulation from data corruption.
  • 11. Importing Data into Hive $ sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password **** --hive-import  ... INFO mapred.JobClient: Counters: 12 INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.3995 seconds (3.0068 KB/sec) INFO mapreduce.ImportJobBase: Retrieved 326 records. INFO hive.HiveImport: Removing temporary files from import process: ORDERS/_logs INFO hive.HiveImport: Loading uploaded data into Hive ... WARN hive.TableDefWriter: Column ORDER_DATE had to be cast to a less precise type in Hive WARN hive.TableDefWriter: Column REQUIRED_DATE had to be cast to a less precise type in Hive WARN hive.TableDefWriter: Column SHIP_DATE had to be cast to a less precise type in Hive ... $
  • 12. Importing Data into Hive $ hive hive> show tables; OK ... orders ... hive> describe orders; OK order_number int order_date string required_date string ship_date string status string comments string customer_number int Time taken: 0.236 seconds hive>
  • 13. Importing Data into HBase $ bin/sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password **** --hbase-create-table --hbase-table ORDERS --column-family mysql ... INFO mapreduce.HBaseImportJob: Creating missing HBase table ORDERS ... INFO mapreduce.ImportJobBase: Retrieved 326 records. $ ● Sqoop creates the missing table if instructed ● If no Row-Key specified, the Primary Key column is used. ● Each output column placed in same column family ● Every record read results in an HBase put operation ● All values are converted to their string representation and inserted as UTF-8 bytes.
  • 14. Importing Data into HBase hbase(main):001:0> list TABLE  ORDERS  1 row(s) in 0.3650 seconds hbase(main):002:0>  describe 'ORDERS' DESCRIPTION ENABLED {NAME => 'ORDERS', FAMILIES => [ true {NAME => 'mysql', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0310 seconds hbase(main):003:0>
  • 15. Importing Data into HBase hbase(main):001:0> scan 'ORDERS', { LIMIT => 1 } ROW COLUMN+CELL 10100 column=mysql:CUSTOMER_NUMBER,timestamp=1316036948264, value=363 10100 column=mysql:ORDER_DATE, timestamp=1316036948264, value=2003-01-06 00:00:00.0 10100 column=mysql:REQUIRED_DATE, timestamp=1316036948264, value=2003-01-13 00:00:00.0 10100 column=mysql:SHIP_DATE, timestamp=1316036948264, value=2003-01-10 00:00:00.0 10100 column=mysql:STATUS, timestamp=1316036948264, value=Shipped 1 row(s) in 0.0130 seconds hbase(main):012:0>
  • 16. Sqoop Connectors ● Connector Mechanism allows creation of new connectors that improve/augment Sqoop functionality. ● Bundled connectors include: ○ MySQL, PostgreSQL, Oracle, SQLServer, JDBC ○ Direct MySQL, Direct PostgreSQL ● Regular connectors are JDBC based. ● Direct Connectors use native tools for high-performance data transfer implementation.
  • 17. Import using Direct MySQL Connector $ sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password **** --direct ... manager.DirectMySQLManager: Beginning mysqldump fast path import ... Direct import works as follows: ● Data is partitioned into splits using JDBC ● Map tasks used mysqldump to do the import with conditional selection clause (-w 'ORDER_NUMBER' > ...) ● Header and footer information was stripped out Direct Export similarly uses mysqlimport utility.
  • 18. Third Party Connectors ● Oracle - Developed by Quest Software ● Couchbase - Developed by Couchbase ● Netezza - Developed by Cloudera ● Teradata - Developed by Cloudera ● Microsoft SQL Server - Developed by Microsoft ● Microsoft PDW - Developed by Microsoft ● Volt DB - Developed by VoltDB
  • 19. Current Status Sqoop is currently in Apache Incubator ● Status Page      http://incubator.apache.org/projects/sqoop.html ● Mailing Lists      [email protected]      [email protected] ● Release      Current shipping version is 1.3.0
  • 20. Hadoop World 2011 A gathering of Hadoop practitioners, developers, business executives, industry luminaries and innovative companies in the Hadoop ecosystem. ● Network: 1400 attendees, 25+ sponsors ● Learn: 60 sessions across 5 tracks for November 8-9 ○ Developers Sheraton New York Hotel & ○ IT Operations Towers, NYC ○ Enterprise Architects ○ Data Scientists ○ Business Decision Makers Learn more and register at www.hadoopworld.com ● Train: Cloudera training and certification      (November 7, 10, 11)
  • 21. Sqoop Meetup Monday, November 7 - 2011, 8pm - 9pm at Sheraton New York Hotel & Towers, NYC
  • 22. Thank you! Q&A