Gears of Big Data

Amazon EMR Tutorial: Apache Zeppelin with Phoenix and HBase Interpreters on Amazon EMR

We use Amazon Elastic MapReduce (EMR) heavily for both customer projects and internal use-cases when we need to crunch huge datasets in the cloud. Amazon EMR simplifies the process of spinning up and maintaining Hadoop and Spark clusters running in the cloud, drastically lowering the barriers of entry for data teams to get started.

We have found increased interest from internal and external data teams wanting to collaborate using notebooks. Apache Zeppelin is an open-sourced project for web-based interactive notebooks for data analytics. Typical use-cases include data ingestion, discovery, analytics, visualization, and collaboration. Luckily for us, Amazon EMR comes with Apache Zeppelin support baked right in. At the time of writing, Zeppelin version 0.6.1 is supported on Amazon EMR 5.0.0 (latest).

In addition, Amazon EMR supports Apache HBase (1.2.2) and Apache Phoenix (4.7.0) natively. HBase is a highly reliable NoSQL store built upon the Hadoop Distributed File System, and Phoenix is a JDBC “front-end” to the HBase engine, which converts standard SQL into native HBase scans and queries. This enables a powerful point-lookup use-case, capable of returning small results from billions of rows in milliseconds, or larger queries in seconds, using standard SQL. Finally, the HBase and Phoenix combo supports full ACID transactions, enabling OLTP workloads to run in a highly available, scale-out architected database in AWS.

Of course, we wanted to see the power of using HBase and Phoenix for ourselves on EMR using Zeppelin notebooks, but we noted a few extra steps in order to get this running. We hope we can “spark” (pun intended) your interest in exploring big data sets in the cloud, using EMR and Zeppelin.

Amazon EMR Tutorial:

A summary of the steps that we’ll follow in order to experiment with Zeppelin, Phoenix, and HBase on Amazon EMR is provided below:

1. Start an EMR cluster with Zeppelin, Phoenix, and HBase pre-configured
2. SSH and web proxy into the EMR Master Node
3. Install and configure Zeppelin interpreters for HBase and Phoenix
4. Load data into HBase
5. Query the data using Phoenix in Zeppelin to create charts and graphs
6. Terminate the cluster

You will need:

1. Familiarity with foundational AWS Concepts
2. An AWS account
3. A VPC with a public subnet, in US East N. Virginia
4. Basic knowledge of bash, *nix command line, and SQL helps but is not required

Let’s Get Started:

1. Start an EMR cluster with Zeppelin, Phoenix, and HBase pre-configured

Click the following link to start an EMR cluster in your own AWS account using a JSON CloudFormation template provided by us. Make sure you are currently logged in to an AWS account in your web browser, otherwise the link may not work.

https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=Zeppelin-Phoenix-Blog&templateURL=https:%2F%2Fs3-us-west-2.amazonaws.com%2Fcis-samples%2Fzeppelin-demo%2Fcfn%2Fzeppelin-demo.json

This link will get you quickly started with provisioning the EMR cluster using a JSON CloudFormation template. Follow the prompts in the CloudFormation page to build the cluster. Once it has completed, make a note of the table in the Outputs tab as you will need the URLs in later steps.

A. Create the Stack:

Amazon EMR Tutorial

B. Specify the Parameters using your own VPC, subnet, and EC2 key:

Amazon EMR Tutorial

C. Optionally, specify some tags:

Amazon EMR TutorialAmazon EMR Tutorial

D. Click Create:

Amazon EMR Tutorial

E. Take a note of the Output Tab:

Amazon EMR Tutorial

2. SSH and Web Proxy into the Master Node

You’ll need to set yourself up to be able to connect to the master EMR node via SSH.

A. Go to the EMR page and select your cluster:

Amazon EMR Tutorial

B. Scroll down until you find Security groups for Master and click the hyperlink “sg-********” in the Console:

Amazon EMR Tutorial

C. Select the Security Group with the Group Name “ElasticMapReduce-master”, click Inbound, then click Edit:

Amazon EMR Tutorial

D. Add your own IP by clicking Add Rule, then select SSH from the drop-down on the far right, then select My IP in the second to last drop down, then click Save:

Amazon EMR Tutorial

E. Now you can SSH into the master node from your local workstation! Steps can be found on how to SSH into the master node by going back to your EMR cluster page (Step 2.a), and clicking the SSH link hyperlink:

Amazon EMR Tutorial

Amazon EMR Tutorial

F. Finally, you will also need to set up a web connection via proxy to the EMR master node. Steps can be found by going back to your EMR cluster page (Step 2.a), and clicking the Enable Web Connections hyperlink:

Amazon EMR Tutorial

3. Install and Configure Zeppelin Interpreters for HBase and Phoenix

A. Start a SSH Terminal session (Step 2.e)

Amazon EMR Tutorial
B. Run the following commands in sequence in the Terminal command line interface:

cd /usr/lib/zeppelin
sudo bash bin/install-interpreter.sh -a
sudo bash bin/zeppelin-daemon.sh restart

C. You changed directories into the zeppelin binaries folder, installed all possible interpreters, and restarted the service for the changes to take effect.

4. Connect to the Zeppelin UI and set up the interpreters

A. Open a new Terminal window and start the SSH tunnel (Step 2.f)

Amazon EMR Tutorial
B. Open your browser and enable your web proxy, e.g. FoxyProxy (Step 2.f)

Amazon EMR Tutorial
C. Open your browser and point it to the url in the “zeppelin” output from CloudFormation. You should see the following page:

Amazon EMR Tutorial

D. Click the drop down that says anonymous and select Interpreter

Amazon EMR Tutorial

E. Click the Create button in the top left corner

Amazon EMR Tutorial
F. Give the interpreter a name of jdbc, and select jdbc from the Interpreter group drop-down

Screen Shot 2016-10-11 at 2.28.54 PM

Amazon EMR Tutorial

G. Scroll down in the Properties section until you see “phoenix.” settings. Adjust the phoenix.url setting to: jdbc:phoenix:localhost:8765/hbase
Amazon EMR Tutorial

H. Scroll down to the Dependencies section add the artifact org.apache.phoenix:phoenix-core:4.7.0-HBase-1.1, then click the Save button

Amazon EMR Tutorial

5. Query the data using Zeppelin

At this point in time, you are now ready to create a new Zeppelin notebook and to start loading and querying data!

A. Return to the Zeppelin homepage, click Import Note, click Add from URL, copy and paste the URL below in the URL field. Click Import Note in the pop-up screen, using the URL below:

https://gist.githubusercontent.com/laithalsaadoon/566407d2c0700f785eed87d5b73bdbf8/raw/9755ead50ab05e32e6c8af7bd170a09d49e90eb7/zeppelin-blog.json

Amazon EMR Tutorial

Amazon EMR Tutorial

Amazon EMR Tutorial

B. Click the notebook that you just added

Amazon EMR Tutorial
C. You should see a few lines of code and scripts in a notebook similar to the following:

Amazon EMR Tutorial
D. You can now hit the small “play” button in the top right corner of each white box (these are call “paragraphs” in Zeppelin vernacular)

Amazon EMR Tutorial

E. Click the “play” button on the first paragraph (the one that includes “### Create an empty table in HBase”). Wait for it to say FINISHED in the top right corner before proceeding.

Amazon EMR Tutorial

F. In the following paragraph, update the EMR endpoint url on line 9 with your cluster’s corresponding URL.

G. Repeat the Step d above, running each paragraph in sequence, waiting for each paragraph to say “FINISHED” before proceeding.

H. At the end, you should end up with a Pie Chart visualization similar to the following:

Amazon EMR Tutorial

6. Terminate the cluster

A. Click here to return to the CloudFormation web page

B. Select the template that you created as part of this demo

Amazon EMR Tutorial

C. From the Actions drop-down, select Delete Stack

Amazon EMR Tutorial

D. Click Yes, Delete

Amazon EMR Tutorial

Conclusion:

We hope you enjoyed our Amazon EMR tutorial on Apache Zeppelin and it has truly sparked your interest in exploring big data sets in the cloud, using EMR and Zeppelin.

Related posts:
Learn more about our big data and analytics services by downloading our AWS Data Pipeline Whitepaper or watching our latest Big Data video.

Onica

About Onica

Onica is a global technology consulting company at the forefront of cloud computing. Through collaboration with Amazon Web Services, we help customers embrace a broad spectrum of innovative solutions. From migration strategy to operational excellence, cloud native development, and immersive transformation, Onica is a full spectrum integrator.