Big data ingestion using Apache Sqoop

By Prasad Khode

Apache Sqoop is a tool designed to efficiently transfer bulk data to and fro Apache Hadoop and structured datastores such as Relational databases.

We  can use Sqoop to import data from external structured data stores into Hadoop Distributed File System or related systems like Hive and HBase. Similarly,  Sqoop can be used to extract data from Hadoop and export it to external structured data stores such as Relational databases and Enterprise data warehouses. …

Sqoop Import:

Import of data to HDFS  happens in two steps.

  •  First Sqoop examines  the database to gather the metadata for the data being imported.
  • Second step is a map-only Hadoop job that submits data to the cluster .This job performs the actual data transfer using the metadata captured in the first step.


Sqoop Import Syntax:

Here  is the syntax to import all records from MySQL database table:

$ sqoop  import \

–connect jdbc:mysql://<IP_ADDRESS/HOST_NAME>:<PORT_NO>/<DB_NAME> \

–username <database_user_name> \

–password <database_password> \

–table <table_name>

import: This is the sub-command that instructs Sqoop to initiate import.

–connect <connection_string>, –username <user_name>, –password <password>: These are connection parameters used to connect with the database. These are very same connection parameters we use to connect to the database via JDBC connection.

 –table <table_name>: This parameter specifies the table from which  data will be imported.


Sqoop Export:

The export  happens in two steps.

  • The first step is to introspect the database for metadata.
  • The second step is to transfer the data.

Sqoop divides the input data set into splits and then uses individual map tasks to push the splits to the database. Each map task performs this transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.


Sqoop Export Syntax:

Here is the syntax to export the data from Hadoop Eco System

$ sqoop export \

–connect jdbc:mysql<IP_ADDRESS/HOST_NAME>:<PORT_NO>/<DB_NAME> \

–username <database_user_name> \

–password <database_password> \

–table <table_name> \

export-dir <hdfs_path>

export: This is the sub-command that instructs Sqoop to initiate export.

–export-dir <hdfs_path>: This is the directory from which data will be exported.