Introducing the Tidal Automation Adapter for Hadoop Sqoop

The adapter provides easy import and export of data from structured data stores such as relational databases and enterprise data warehouses. Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. The adapter allows users to automate the tasks carried out by Sqoop.

Data Import

The import is performed in two steps:

  1. Sqoop introspects the database to gather the necessary metadata for the data being imported.

  2. Map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the actual data transfer using the metadata captured in the previous step.

The imported data is saved in a directory on HDFS based on the table being imported. As is the case with most aspects of Sqoop operation, the user can specify any alternative directory where the files should be populated.

Data Export

The first step is to introspect the database for metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into splits, and then uses individual map tasks to push the splits to the database. Each map task performs this transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.

Prerequisites

Before you can run the adapter, meet these prerequisites:

  • Linux is the only supported production platform for Apache Hadoop. However, the adapter can run on any platform supported by the TA Master. Refer the TA Compatibility Matrix for current version support.

  • JDK is installed on TA Master machine and TA Master machine has environmental variable JAVA_HOME pointing to the directory where JDK is installed and not to a JRE directory.

  • Apache Sqoop is installed on with Hadoop Cluster.

  • Hadoop cluster and database are accessible to each other and the TA Master over the network.

  • TA Master has the database drivers available on CLASSPATH.

  • Hadoop cluster has the database drivers available on HADOOP_CLASSPATH. This can be configured in hadoop-env.conf.

Software Requirements

The TA 6.5 Sqoop Adapter is installed with the TA 6.5 Master and client and cannot be used with an earlier TA version. Refer the TA Compatibility Matrix for a complete list of hardware and software requirements.