Soubhik Barari

How to Set Up Hadoop for Big Data Processing

Published 26 Oct 2016

For anyone needing to analyze huge datasets, this is a public resource on how to set up your own instance of Apache Hadoop, the go-to environment for processing big data. If you can navigate basic UNIX commands, work with SQL databases, and are looking to speed up queries on large datasets (>1 Terabyte), Hadoop is definitely the tool for you. It’s what’s helped me clean and merge together the world’s most comprehensive database of applied tariffs - these notes largely come out of my experience working on this job.

Before we get started, you’ll need an account on AWS.

Table of Contents:

1. What is Hadoop

Hadoop is an all-encompassing Apache environment for storing and processing large-scale datasets (ideally terabyte-scale) across a computing cluster.

Within this environment there are all sorts of tools to assist in tasks like import/export, cleaning, processing, querying and analysis (more on these tools later). The best part of Hadoop is that it’s open-source, which means there is a number of different ‘flavors’ to choose from and a huge community to tap into for resources and help.

This graphic is a good visual illustration of the Hadoop environment:

Illustration of the hadoop ecosystem from user layer to OS layer (safaribooksonline).

The core of the Hadoop way of computing is the MapReduce programming model.

MapReduce jobs performed in Hadoop first pass their input data to mapper nodes which then emit key-value pairs as intermediate output; these emitted intermediate data are then grouped by common keys and handed off to reducer nodes which then coordinate to produce the final output. Here’s a visual example for a basic word counting program:

Mapreduce word-count example (guru99).

You do not need to know the ins and outs of MapReduce to use Hadoop. In fact, MapReduce is just one of many frameworks. It can be easily swapped out for others - popular ones include Spark (optimized for iterative processing e.g. machine learning), and Tez (directed acyclic graph style processing).

At its most basic form, Hadoop jobs are run in the form of Java programs that specify mapper and reducer classes and an input dataset. However, there are now a large number of accessible tools to abstract away the mapreduce layer, allowing us to be a data analyst (obsess over the data) rather than a data engineer (obsess over the system).

Here are the tools I typically use when I’m running end-to-end jobs in Hadoop:

When one needs to do some Python or R processing (i.e. complicated logic, numerical computing, etc.) or would like to do more real-time data analysis, I would recommend utilizing Spark over something like Hive or Pig.

Now that we have a broad overview of the tools available, let’s set up our very own Hadoop infrastructure!

2. How to Create a Cluster on Amazon EMR

Hadoop provides the framework - the ‘how’; Amazon Elastic MapReduce (EMR) provides the platform - the ‘what’. Amazon EMR is a flexible, highly scalable cloud platform where we can host our computing cluster fully bootstrapped with the Hadoop ecosystem. Super convenient!

Additionally, Amazon provides the S3 storage system which is a compatible alternative to HDFS. Both may be used in conjunction i.e. persisting input data may be stored in an S3 bucket, while ephemeral, output data is kept on HDFS.

Full details / troubleshooting can be found on the AWS guide to getting started with EMR.

Before starting a cluster, our policy needs configuring. Policy documents in AWS assign certain privileges regarding set up, job creation, and maintenance of clusters to different users. We will need to create a custom policy document in order to guarantee full access to the appropriate users for various functionalities:

    1.) On the AWS console, select the Identity & Access Management icon.

    2.) Select Policies. If this is the first time you’ve accessed this, go ahead with Get started. Click on Create Policy, and select Create Your Own Policy.

    3.) Name and describe the policy document appropriately. In the text field for the document, paste the following:

    "Version": "2012-10-17",
    "Statement": [
            "Action": [
            "Effect": "Allow",
            "Resource": "*"

    4.) Now that our policy has been created, search by filtering for the policy name in the main Policies tab. Once selected, click on the Attached Entities tab. Click on Attach and select all the users who will need access to the cluster. Additionally allow the users EMR_DefaultRole and EMR_EC2_DefaultRole to also be attached to the document.

We now have the appropriate permissions to create our EMR cluster.

To start a cluster, we have the two following options.

Option A: Using the AWS console. The AWS console is the most hassle-free way of firing up a cluster.

  1. Log into AWS console.
  2. Under >Analytics, select EMR. Hit the Create Cluster button.
  3. Using the >Advanced Options tab, make sure that the following list of apps under the Software Configuration header are selected: Hadoop, Hive, Sqoop, HCatalog, HBase, Mahout, Oozie, Tez, Pig, Zookeeper, Hue, Spark. (Note: in order to use the Impala tool, you must select a version under AMI in Release … do so at your own risk). Click Next.
  4. Under >Hardware Configuration, select how many core and task nodes will be needed under Instance count. Note that this can be later resized easily.
  5. Under >General Options, select the S3 folder for logging (optional). Moreover, if there are any Bootstrap Actions that need to be made, select Custom action in the dropdown for Add bootstrap action and select the script to bootstrap the cluster (see more about bootstrapping clusters later in the guide).
  6. Under >Security Options, choose the EC2 key pair typically used in your workflow to be associated with this cluster. Simply leave the default EMR_EC2_DefaultRole and EMR_DefaultRole as those roles already have the appropriate permissions to create/access the cluster.
  7. Create cluster.

You will now be brought to your cluster’s main dashboard. After it is fully initialized, it’s status will be changed to Waiting. Take note of the Master public DNS, next to which will have an address i.e. for the master node in the cluster. Also take note of the cluster id (j-xxxxx).

Option B: Using the AWS client. The AWS client is ideal for cloning an existing set up or specifying cluster parameters from an external file. The AWS command line client can be installed via:

$ sudo pip install awscli --ignore-installed six

Following that, you’ll need to link your account to your profile:

$ aws configure
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: text

Here, I chose the output format to be text for ease of interpretability and capturing aws output in the shell.

Additionally, you’ll have to make sure the default roles are already created:

$ aws emr create-default-roles

To create the same cluster described above using the command line, use the following command:

$ aws emr create-cluster \ 
--applications Name=Hadoop Name=Hive Name=Pig Name=Hue Name=Sqoop Name=Spark \
--ec2-attributes KeyName=[EC2 KEY PAIR] \
--enable-debugging \
--release-label emr-5.0.0 
--log-uri 's3://[ADDRESS FOR LOGS]'
--name '[NAME OF CLUSTER]' 
--instance-groups '[{"InstanceCount":[# OF INSTANCES],"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core instance group - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master instance group - 1"}]' 
--region us-east-1

The above command will return the assigned cluster id j-xxxxx, which you can capture and use in a shell script.

2.1. Accessing and using the cluster

Now, in order to be able to remotely access your cluster after it is set up, the network your cluster is on must be set to accept incoming traffic via ssh. To do so:

  1. On the main AWS console, select EC2. Select Instances from the side bar.
  2. In the table of instances, find the instance corresponding to the master node as noted above. Select its Security Group.
  3. Select the ‘master’ group in the displayed table of groups. In the tabs below, select Inbound and click Edit. You are going to add an rule to allow SSH access. Click Add rule and set the type to be SSH.

You now have access to your EMR cluster using standard ssh:

$ ssh -i ~/[EC2 key pair file]

You have several access points into the cluster now:

a.) Use the AWS terminal client. This will allow you to SSH into the master node:

$ aws emr ssh --cluster-id [CLUSTER ID OBTAINED FROM CONSOLE] --key-pair-file [EC2 KEY FILE]

b.) Use direct SSH.

$ ssh -i [EC2 KEY FILE] hadoop@master-public-dns

c.) Use AWS console. From the main console of the cluster, “steps” can be added by uploading files (JAR, .py, .sql) that will execute as batch jobs. The advantage of this entry point is that spinning up additional nodes, other configurations can be done quite easily.

d.) Use Hadoop’s web interfaces (essentially an IDE). To access the web interfaces, you need to create a proxy server to forward the cluster’s web servers to your local machine. Follow the directions here. The Hue web editor provides a very user friendly interface and IDE to work with Hadoop and all its ecosystems’ tools.

2.2. Importing SQL data into EMR

We will use Apache Sqoop to import/export data in HDFS (Hadoop filesystem). Here’s a great blog on that with way more details.

In order to import data from an external MySQL database, a Java Database Connector is first required. Log into the master node and execute:

$ sudo yum install mysql-connector-java

Make sure your MySQL database can be accessed from the master node.

$ mysql -h [ADDRESS OF MYSQL INSTANCE] [DB] -P 3306 -u [USN] -p [PSS]

Sqoop can then be used quite easily to import the needed tables directly into Hive:

$ sqoop import --connect jdbc:mysql://[ADDRESS]/[DB] --username [USN] --password [PSS] --table [INPUT TABLE NAME] --hive-import --hive-table=[OUTPUT TABLE NAME] --create-hive-table -m 1 --direct

One can also predefine a schema in Hive and import a table into that:

$ sqoop import --connect jdbc:mysql://[ADDRESS]/[DB] --username [USN] --password [PSS] --table [INPUT TABLE NAME] --hive-import --hive-table=[OUTPUT TABLE NAME] --hive-import -m 1 --direct

If you would like to partition your output table, you must specify the partition column (key) and the partition value. Unfortunately, this must be done in a static manner for each partition in the output table. Additionally, you must define a query to select the subset of the input table being placed into the output partition:

$ sqoop import \
--connect jdbc:mysql://[ADDRESS]/[DB] \
--username [USN] \
--password [PSS] \
--table [INPUT TABLE NAME] \
--direct \
--hive-import --hive-table=[OUTPUT TABLE NAME] --hive-partition-key=[COLUMN] --hive-partition-value="[VALUE]" --hive-overwrite --target-dir /user/hive/warehouse/[LOCATION FOR PARTITION] -m 1

Note that Sqoop can either import data directly into Hive (database layer) in a one-step hop or just import data into HDFS (filesystem layer) allowing the user to manually load data into Hive as a two-step hop. The first option is almost always recommended if Hive is the main interface with the data.

3. And What is Hive, Exactly?

Apache Hive is a database infrastructure made to structure the raw data stored in HDFS. In particular it allows us to query using a language nearly identical to MySQL.

I could write an entire new blog post about Hive and it’s usefulness in the Hadoop ecosystem… instead, here’s a few noteworthy points about Hive syntax:

My recommendation is to use Hive first if you’re attempting some sort of processing job. If it gets too ugly or you need juicier operations, try pairing Spark with Python or R, which gives you the ability to libraries like dplyr or pandas on your data, rather than just Hive’s native functions.

See a full list of differences.

3.1. Optimizing the performance of Hive jobs

In reality, no two jobs are alike and blood, sweat, and tears are sometimes pre-requisites for optimal performance (although you should always experiment with tunings on mini-jobs on smaller subsets of your data!).

Advanced concepts aside, here are some basic techniques to optimize reads/writes in Hive:

Further optimizations exist for Hive queries.

4. Some Hadoop workflow tips

Hopefully this suffices as a bare bones starter guide for firing up your own Hadoop cluster! Now that you’re set up, below are some additional resources and examples on use cases for Hadoop. Feel free to drop a comment if you have specific questions or shoot me an email note.

5. Additional Resources