For anyone needing to analyze huge datasets, this is a public resource on how to set up your own instance of Apache Hadoop, the go-to environment for processing big data. If you can navigate basic UNIX commands, work with SQL databases, and are looking to speed up queries on large datasets (>1 Terabyte), Hadoop is definitely the tool for you. It’s what’s helped me clean and merge together the world’s most comprehensive database of applied tariffs - these notes largely come out of my experience working on this job.
Before we get started, you’ll need an account on AWS.
Table of Contents:
- 1. What is Hadoop
- 2. How to Create a Cluster on Amazon EMR
- 3. And What is Hive, Exactly?
- 4. Some Hadoop workflow tips
- 5. Additional Resources
1. What is Hadoop
Hadoop is an all-encompassing Apache environment for storing and processing large-scale datasets (ideally terabyte-scale) across a computing cluster.
Within this environment there are all sorts of tools to assist in tasks like import/export, cleaning, processing, querying and analysis (more on these tools later). The best part of Hadoop is that it’s open-source, which means there is a number of different ‘flavors’ to choose from and a huge community to tap into for resources and help.
This graphic is a good visual illustration of the Hadoop environment:
The core of the Hadoop way of computing is the MapReduce programming model.
MapReduce jobs performed in Hadoop first pass their input data to mapper nodes which then emit key-value pairs as intermediate output; these emitted intermediate data are then grouped by common keys and handed off to reducer nodes which then coordinate to produce the final output. Here’s a visual example for a basic word counting program:
You do not need to know the ins and outs of MapReduce to use Hadoop. In fact, MapReduce is just one of many frameworks. It can be easily swapped out for others - popular ones include Spark (optimized for iterative processing e.g. machine learning), and Tez (directed acyclic graph style processing).
At its most basic form, Hadoop jobs are run in the form of Java programs that specify mapper and reducer classes and an input dataset. However, there are now a large number of accessible tools to abstract away the mapreduce layer, allowing us to be a data analyst (obsess over the data) rather than a data engineer (obsess over the system).
Here are the tools I typically use when I’m running end-to-end jobs in Hadoop:
Sqoop- A tool that imports data from MySQL and other RDBMS data formats into that Hadoop filesystem (HDFS).
Hive- The Hadoop equivalent of SQL.
Pig- A low-level but SQL-esque language for accessing raw HDFS files.
Spark- The most popular current processing tool. Uses a more memory-intensive MapReduce, with an lower-level (but easy to use) API in Python to specify both mapper and reducer. Fast, particularly for machine learning applications.
Tez- A directed acyclic graph processing framework (again, in memory) and a direct competitor to Spark. This graphic illustrates the differences in execution between Tez and standard MapReduce.
When one needs to do some Python or R processing (i.e. complicated logic, numerical computing, etc.) or would like to do more real-time data analysis, I would recommend utilizing Spark over something like Hive or Pig.
Now that we have a broad overview of the tools available, let’s set up our very own Hadoop infrastructure!
2. How to Create a Cluster on Amazon EMR
Hadoop provides the framework - the ‘how’; Amazon Elastic MapReduce (EMR) provides the platform - the ‘what’. Amazon EMR is a flexible, highly scalable cloud platform where we can host our computing cluster fully bootstrapped with the Hadoop ecosystem. Super convenient!
Additionally, Amazon provides the S3 storage system which is a compatible alternative to HDFS. Both may be used in conjunction i.e. persisting input data may be stored in an S3 bucket, while ephemeral, output data is kept on HDFS.
Full details / troubleshooting can be found on the AWS guide to getting started with EMR.
Before starting a cluster, our policy needs configuring. Policy documents in AWS assign certain privileges regarding set up, job creation, and maintenance of clusters to different users. We will need to create a custom policy document in order to guarantee full access to the appropriate users for various functionalities:
1.) On the AWS console, select the Identity & Access Management icon.
2.) Select Policies. If this is the first time you’ve accessed this, go ahead with Get started. Click on Create Policy, and select Create Your Own Policy.
3.) Name and describe the policy document appropriately. In the text field for the document, paste the following:
4.) Now that our policy has been created, search by filtering for the policy name in the
main Policies tab. Once selected, click on the Attached Entities tab. Click on
Attach and select all the users who will need access to the cluster. Additionally
allow the users
EMR_EC2_DefaultRole to also be attached to the
We now have the appropriate permissions to create our EMR cluster.
To start a cluster, we have the two following options.
Option A: Using the AWS console. The AWS console is the most hassle-free way of firing up a cluster.
- Log into AWS console.
- Under >Analytics, select EMR. Hit the Create Cluster button.
- Using the >Advanced Options tab, make sure that the following list of apps under the Software Configuration header are selected: Hadoop, Hive, Sqoop, HCatalog, HBase, Mahout, Oozie, Tez, Pig, Zookeeper, Hue, Spark. (Note: in order to use the Impala tool, you must select a version under AMI in Release … do so at your own risk). Click Next.
- Under >Hardware Configuration, select how many core and task nodes will be needed under Instance count. Note that this can be later resized easily.
- Under >General Options, select the S3 folder for logging (optional). Moreover, if there are any Bootstrap Actions that need to be made, select Custom action in the dropdown for Add bootstrap action and select the script to bootstrap the cluster (see more about bootstrapping clusters later in the guide).
- Under >Security Options, choose the EC2 key pair typically used in your workflow to be associated with this cluster. Simply leave the default
EMR_DefaultRoleas those roles already have the appropriate permissions to create/access the cluster.
- Create cluster.
You will now be brought to your cluster’s main dashboard. After it is fully initialized,
it’s status will be changed to Waiting. Take note of the Master public DNS, next
to which will have an address i.e.
ec2-xx-xx-xx-xxx.compute-1.amazonaws.com for the
master node in the cluster. Also take note of the cluster id (j-xxxxx).
Option B: Using the AWS client. The AWS client is ideal for cloning an existing set up or specifying cluster parameters from an external file. The AWS command line client can be installed via:
Following that, you’ll need to link your account to your profile:
Here, I chose the output format to be text for ease of interpretability and capturing
aws output in the shell.
Additionally, you’ll have to make sure the default roles are already created:
To create the same cluster described above using the command line, use the following command:
The above command will return the assigned cluster id
j-xxxxx, which you can capture and use in a shell script.
2.1. Accessing and using the cluster
Now, in order to be able to remotely access your cluster after it is set up, the network your cluster is on must be set to accept incoming traffic via ssh. To do so:
- On the main AWS console, select EC2. Select Instances from the side bar.
- In the table of instances, find the instance corresponding to the master node as noted above. Select its Security Group.
- Select the ‘master’ group in the displayed table of groups. In the tabs below, select Inbound and click Edit. You are going to add an rule to allow SSH access. Click Add rule and set the type to be SSH.
You now have access to your EMR cluster using standard ssh:
You have several access points into the cluster now:
a.) Use the AWS terminal client. This will allow you to SSH into the master node:
b.) Use direct SSH.
c.) Use AWS console. From the main console of the cluster, “steps” can be added by uploading files (JAR, .py, .sql) that will execute as batch jobs. The advantage of this entry point is that spinning up additional nodes, other configurations can be done quite easily.
d.) Use Hadoop’s web interfaces (essentially an IDE). To access the web interfaces, you need to create a proxy server to forward the cluster’s web servers to your local machine. Follow the directions here. The Hue web editor provides a very user friendly interface and IDE to work with Hadoop and all its ecosystems’ tools.
2.2. Importing SQL data into EMR
We will use Apache Sqoop to import/export data in HDFS (Hadoop filesystem). Here’s a great blog on that with way more details.
In order to import data from an external MySQL database, a Java Database Connector is first required. Log into the master node and execute:
Make sure your MySQL database can be accessed from the master node.
Sqoop can then be used quite easily to import the needed tables directly into Hive:
One can also predefine a schema in Hive and import a table into that:
If you would like to partition your output table, you must specify the partition column (key) and the partition value. Unfortunately, this must be done in a static manner for each partition in the output table. Additionally, you must define a query to select the subset of the input table being placed into the output partition:
Note that Sqoop can either import data directly into Hive (database layer) in a one-step hop or just import data into HDFS (filesystem layer) allowing the user to manually load data into Hive as a two-step hop. The first option is almost always recommended if Hive is the main interface with the data.
3. And What is Hive, Exactly?
Apache Hive is a database infrastructure made to structure the raw data stored in HDFS. In particular it allows us to query using a language nearly identical to MySQL.
I could write an entire new blog post about Hive and it’s usefulness in the Hadoop ecosystem… instead, here’s a few noteworthy points about Hive syntax:
Hive doesn’t like
/* */as comments.
Hive exclusively uses
CREATE INDEX [index_name] ON TABLE [name](can’t use
ALTER TABLE ...).
Hive cannot perform range partitioning (must define partition value either dynamically or statically for all partition keys).
Hive can create external tables (
CREATE EXTERNAL TABLE) which point directly to data on HDFS and are thus “linked” to it (i.e. can’t delete specific rows, as in the case of SQL views).
Hive has an option to load previously un-‘tabled’ data from HDFS into a table:
LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename;. Note that this is what allows Sqoop to perform a two-step import into Hive.
Hive has a large built-in library of user-defined functions (UDFS) that perform tasks from basic statistics to natural language processing on selected data.
Hive doesn’t need parentheses when creating a table/view from a query i.e. must use the syntax
CREATE [TABLE|VIEW] [name] AS SELECT * FROM (...), cannot use syntax
CREATE [TABLE|VIEW] [name] AS (SELECT * FROM (...)).
My recommendation is to use Hive first if you’re attempting some sort of
processing job. If it gets too ugly or you need juicier operations, try pairing
Spark with Python or R, which gives you the ability to libraries like
pandas on your data, rather than just Hive’s native functions.
See a full list of differences.
3.1. Optimizing the performance of Hive jobs
In reality, no two jobs are alike and blood, sweat, and tears are sometimes pre-requisites for optimal performance (although you should always experiment with tunings on mini-jobs on smaller subsets of your data!).
Advanced concepts aside, here are some basic techniques to optimize reads/writes in Hive:
Indexing. Same theory as in MySQL - create a dictionary of values to look up for a particular column key. Hive offers two particular types of indexes.
COMPACT- stores the (column value,list of row(s)) pairs with the corresponding block ids (where they are located); this makes sense to specify when there is a large number of values for a particular key.
BITMAP- stores the (column value,list of row(s)) pairs in a bitmap as a single digital image; this makes sense to specify if there is a smaller number of potential results to be returned per index column value.
Partitioning. Same theory as in MySQL - spread underlying horizontally into different locations on the filesystem by particular column values that will likely be queried for often. Note that when inserting data into a table schema with a partition, this can either be statically loaded into corresponding partitions or dynamically via the line
SET hive.exec.dynamic.partition=true;. Use partitioning if specific subsets of the data are likely to be queried for often.
Clustering. Typically partitioning “dislikes” having a huge number of partitions/subpartitions; if there is a huge number of partition values, there are diminishing returns in splitting a table up across the filesystem. When there are many “buckets” - or “subbuckets” after partitioning - for which rows should be organize together (i.e. by employee id, by alphabetic order), it is wise to create a table schema as
CREATE TABLE(...) CLUSTERED BY [column] INTO [x] BUCKETS. An excellent resource here explains the difference between partitioning and clustering. Use clustering especially if you anticipate many joins on a moderately high number of keys with large value spaces.
Map-joins. A hint (
SELECT /*+MAPJOIN(a,b)*/ * FROM (...)) to specify before selecting columns when performing a join between a small and large table. Newer versions of Hive often do not require this hinting. See more about Hive join types.
4. Some Hadoop workflow tips
Resizing a cluster is usually quite simple. Be cognizant of costs (which can really add up for a 10+ node cluster), and make sure you scale up/down appropriately according to your needs. To best optimize your cluster hardware for a particular computational need, I would recommend crunching the numbers on the AWS monthly calculator … so your bill at least doesn’t come as a surprise!
Make sure you back up all the essential materials on your cluster offline. This advice pertains specifically to AWS, or any sort of ‘off-premise’ cluster you might find yourself working wtith. Keep everything you actually use on your cluster backed up to S3 or your offline MySQL instance using Sqoop. Clusters on AWS are ideally ephemeral instances, which you spin up and spin down for ad hoc computations - this is especially true if you’re using this primarily for your own solo projects.
Hopefully this suffices as a bare bones starter guide for firing up your own Hadoop cluster! Now that you’re set up, below are some additional resources and examples on use cases for Hadoop. Feel free to drop a comment if you have specific questions or shoot me an email note.