UPDATED: I added spark jobs and storm jobs in this tutorial, hope it helps!

This is a very practical tutorial about how to run MapReduce jobs, Spark jobs and Storm jobs with multiple types of data sources (Manila, HDFS and Swift) using Devstack or OpenStack.

If you have an OpenStack cloud, just ignore all the DevStack related worries (like enabling connection and setting up DevStack).

This post considers that you have some familiarity with OpenStack, DevStack and Sahara. If you don’t feel that you have this familiarity don’t worry, this should help you as well and you can look in some references at the end of this post :)!

For this tutorial was used a VM with Ubuntu 14.04, 12 GB RAM, 80 GB HD, 8 vCPU and the DevStack master branch of the day 12/16/2016. Also, the plugins used were Hadoop 2.7.1 (this tutorial can be easily adapted for any Hadoop version below it), Spark 1.6.0 and Storm 0.9.2.

Main sessions of this post:

SETUP DEVSTACK
ENABLE COMMUNICATION BETWEEN DEVSTACK AND INSTANCES
RUN A MAPREDUCE JOB
RUN A SPARK JOB
RUN A STORM JOB

Setup DevStack

First of all, be sure to start DevStack in a VM instead of a real machine.

SSH to the VM
clone DevStack
1. git clone https://git.openstack.org/openstack-dev/devstack
2. cd devstack
create a local.conf file in the DevStack folder:
here is a local.conf example, it enables Manila, Sahara and Heat that are projects we’re going to need in this tutorial. There are a lot of templates of how this file should be, one of them is present in [4] that is a local.conf with Sahara.
./stack.sh
Go get some lunch or something, it’s going to take some time

If you have an error like this:
“E: Unable to locate package liberasurecode-dev
./stack.sh: line 501: generate-subunit: command not found
https://bugs.launchpad.net/devstack/+bug/1547379”

The solution is:
“This is not a bug actually, it happens if devstack unable to install some packages like in my case it was enabled to get liberasurecode-dev which led into that issue, you to edit /etc/apt/source.list to enable install from trusty-backport (if ubuntu server) or xyz backports in case of other server that fixed the issue for me.”

Probably you will not have any problems here, just be sure that local.conf file is right and the stack should work just fine. When the stack.sh finishes it will show a web address you can go to and see Horizon [5] which you can login using the user and password that you specified at local.conf.

Enable communication between DevStack and instances

Okay, now we need to create a cluster so we can run a job on it, right? But before we gotta make sure our DevStack instance can communicate to the cluster’s instances.

We can do this with the following commands in the DevStack VM:

sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
sudo route add -net [private_network] gw [openstack_route]

You can get the [private_network] and [openstack_route] trough Horizon on the Network menu or trough the API as well.

Security groups

Also, an important thing to do is to add SSH, TCP, ICMP rules to the default security group and make sure that all the instances in the clusters belong to this group. If you want to you can create another security group, it’s fine, as long as you add the needed rules and make sure that the instances belong to this group.

You can do this easily through Horizon or API [6].

Run a MapReduce Job

Create a Hadoop cluster

In order to create a Hadoop cluster you have to follow some previous steps, these are listed and described below. You can get some more info about how to do this at [7].

Register Image

In this tutorial, we’re going to use Hadoop 2.7.1 and this image.

Download Image
1. wget http://sahara-files.mirantis.com/images/upstream/newton/sahara-newton-vanilla-2.7.1-ubuntu.qcow2
Create Image with glance
1. openstack image create sahara-newton-vanilla-2.7.1-ubuntu \
  –disk-format qcow2 \
  –container-format bare \
  –file sahara-newton-vanilla-2.7.1-ubuntu.qcow2
Register Image
1. openstack dataprocessing image register sahara-newton-vanilla-2.7.1-ubuntu \ –username ubuntu
  Important: the username must be ubuntu
Add hadoop (vanilla) 2.7.1 tag
1. openstack dataprocessing image tags add sahara-newton-vanilla-2.7.1-ubuntu \ –tags vanilla 2.7.1

Create node groups

I usually do this part through Horizon, so what you I do is basically create two node groups: a master and a worker. Be sure to create both of them in the default security group instead of the default option of Sahara which is creating a new security group for the instance.

Master: plugin: Vanilla 2.7.1, Flavor m1.medium (feel free to change this and please do if you don’t have enough memory or HD), available on nova, no floating IP

namenode
secondarynamenode
resourcemanager
historyserver
oozie
hiveserver

Worker: plugin: Vanilla 2.7.1, Flavor m1.small (feel free to change this and please do if you don’t have enough memory or HD), available on nova, no floating IP

datanode
nodemanager

Create a cluster template

Very straightforward: just add a master and at least one worker to it and mark the Auto-configure option.

Launch cluster

Now you can launch the cluster and it should work just fine :), if any problem happens just give a look at the logs, and make sure that the instances can communicate to each other through SSH.

Running a Hadoop job manually

If you want to “test” the cluster you can run a Hadoop job manually, the process is described below and is totally optional. Below we’re running a wordcount job.

SSH to master instance
login as hadoop
1. sudo su hadoop
Create a hdfs input
1. bin/hdfs dfs -mkdir -p [input_path]
Add some file to it
1. bin/hdfs dfs -copyFromLocal [input_path] [path_to_some_file]
Run job:
1. bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount [input_path] [output_path]
  Imporant: [output_path] shoud not exist in HDFS !!!
Get output
1. bin/hdfs dfs -get [output_path] output
2. cat output/*
If you don’t need the HDFS output anymore delete it!
1. bin/hdfs dfs -rm -r [output_path]

Run MapReduce job

FINALLY! So, we’ll use this file as an example and we’ll run the WordCount job. In order to keep things simpler we’ll let the job binary type as a job binary internal, so we’ll just change the data source, although if you want to change the job binary type the process is pretty much the same, jut create it somewhere (Swift, HDFS or Manila) and define its URL.

Sahara give you the addresses below, one thing that may be needed is to add some rules in the security group in order to allow access to these pages, which can be extremely helpful for debugging and for logging access:

Web UI: http://%5Bmaster_ip%5D:50070
Oozie: http://%5Bmaster_ip%5D:11000
Job Hist: http://%5Bmaster_ip%5D:19888
YARN: http://%5Bmaster_ip%5D:8088
Resource Manager: http://%5Bmaster_ip%5D:8032

The API commands and details that I don’t show in this session can be seen at [3].

Create a job binary

Just download the file in the link above
Create a job binary internal using the file, also make sure the name of the job binary has the .jar/.pig/.type_of_file,etc, in this case: hadoop-mapreduce-examples-2.7.1.jar

If you want to use Swift, Manila or HDFS for the job binary there’s no problem.

Create a job template

Just choose a name for the job template, use the type MapReduce, and use the job binary that we created.

Create a data source

HDFS

SSH to master instance
login as hadoop
1. sudo su hadoop
Create a hdfs input
1. bin/hdfs dfs -mkdir -p [input_path]
Add some file to it
1. bin/hdfs dfs -copyFromLocal [input_path] [some_file]
Create data source for input
1. select HDFS as type
2. URL: [input_path]
Create data source for output
1. select HDFS as type
2. URL: [output_path_that_does_not_exists]

The URL considers that the path is at /users/hadoop if it isn’t please provide the whole path, also if you’re using an external HDFS provide the URL as: hdfs://[master_ip]:8020/[path] and make sure you can access it.

SWIFT

Create a container
1. this can be done through Horizon or API
Add an input file in the container
1. this can be done through Horizon or API
Create data source for input
1. select Swift as type
2. URL: [container]/[input_path]
3. user: [your user], in this case “admin”
4. password: [your password], in this case “nova”
Create data source for output
1. select Swift as type
2. URL: [container]/[output_path_that_does_not_exists]
3. user:[your user], in this case “admin”
4. password:[your password], in this case “nova”

MANILA

If is the first time you’re creating a share
1. create the default type
  1. manila type-create default_share_type True
2. create a share network
  1. manila share-network-create \
    –name test_share_network \
    –neutron-net-id %id_of_neutron_network% \
    –neutron-subnet-id %id_of_network_subnet%
    I used the private net for the shares and the cluster :).
Create a share
1. manila create NFS 1 –name testshare –share-network [name_of_network]
Make this share accessible
1. manila access-allow testshare ip 0.0.0.0/0 –access-level rw/
  Important: you can change the ip for the master’s IP and workers’s IP as well, is actually more recommended
SSH to some instance that has access to the share
1. sudo apt-get install nfs-common
2. Mount the share
  1. sudo mount -t nfs /mnt
3. Add an input to it
  1. cd /mnt
  2. mkdir [input_path]
  3. cp [input_path] [some_file]
4. Unmount the share
  1. sudo umount -f /mnt
Create data source for input
1. select Manila as type
2. URL: /[input_path]
Create data source for output
1. select Manila as type
2. URL: /[output_path_that_does_not_exists]

Run job

Choose the job template we’ve created as a job template
Choose an input data source
Choose an output data source
Configure job correctly
1. For the job we’re running the minimum conf. needed is:
  1. mapreduce.reduce.class = org.apache.hadoop.examples.WordCount$IntSumReducer
  2. mapreduce.map.output.value.class = org.apache.hadoop.io.IntWritable
  3. mapreduce.map.class = org.apache.hadoop.examples.WordCount$TokenizerMapper
  4. mapreduce.map.output.key.class = org.apache.hadoop.io.Text
  5. mapred.reducer.new-api = true
  6. mapred.mapper.new-api = true
    Important: we need these last configurations (5,6) because we’re using new Hadoop API. and the other conf. are self-explanatory.
2. If you’re running a Hadoop 1.2.1 job, you would need to configure
  1. mapred.mapoutput.key.class
  2. mapred.mapoutput.value.class
  3. mapred.reducer.class
  4. mapred.mapper.class

Problems? Check the web UI logs it can be very helpful!

Run a Spark job

IMPORTANT: from this point I’ll not detail how to create data sources, because is exactly the same procedure showed for Hadoop.

Create a Spark cluster

Register Image

In this tutorial, we’re going to use Spark 1.6.0 and this image.

Download Image
1. wget http://sahara-files.mirantis.com/images/upstream/mitaka/sahara-mitaka-spark-1.6.0-ubuntu.qcow2
Create Image with glance
1. openstack image create sahara-mitaka-spark-1.6.0-ubuntu \
  –disk-format qcow2 \
  –container-format bare \
  –file sahara-mitaka-spark-1.6.0-ubuntu.qcow2
Register Image
1. openstack dataprocessing image register sahara-mitaka-spark-1.6.0-ubuntu \ –username ubuntu
  Important: the username must be ubuntu
Add hadoop (vanilla) 2.7.1 tag
1. openstack dataprocessing image tags addsahara-mitaka-spark-1.6.0-ubuntu \ –tags spark 1.6.0

Create node groups

In this tutorial, I’ll just make one node group that has the master and worker, I’ll call it All-in-one.

All-in-one: plugin: Spark 1.6.0, Flavor m1.medium (feel free to change this and please do if you don’t have enough memory or HD), available on nova, no floating IP

namenode
datanode
master
slave

Create a cluster template

Very straightforward: just add the node group All-in-one and mark the Auto-configure option.

Launch cluster

Now you can launch the cluster and it should work just fine :), if any problem happens just give a look at the logs, and make sure that the instances can communicate to each other through SSH.

Run a Spark job

For this part, we’ll use this job binary, this is a wordcount job, create the job binary exactly how you created the Hadoop job binary, and choose this job binary as the main for the job template.

At this point you’re basically good to go! You’ll need input and output data sources, you can do exactly how we did with Hadoop.

Run job

Choose the job template we’ve created as a job template
Configure job correctly
1. main class
  For this job the main class is: sahara.edp.spark.SparkWordCount
2. configsFor Swift type you’ll may need to pass the credentials as configs, example:
  - fs.swift.service.sahara.password = admin
  - fs.swift.service.sahara.username = nova
3. args
  Now, the biggest difference between Spark and Hadoop, is that the data sources are passed as args. So to run with a general data source you should pass as args:
  - datasource://[name of the data source]
  - datasource://[name of the data source]
4. Run!

Run a Storm job

Create a Storm cluster

Register Image

In this tutorial, we’re going to use Storm 0.9.2 and unfortunately there’s no image available, but don’t be sad! Sahara-image-elements makes easy to create an image for Storm and other plugins, just follow the instructions, generate your image and then come back here!

Generate image with sahara-image-element
1. it probably will generate something like: ubuntu_sahara_storm_latest_0.9.2
Create Image with glance
1. openstack image create ubuntu_sahara_storm_latest_0.9.2 \
  –disk-format qcow2 \
  –container-format bare \
  –file ubuntu_sahara_storm_latest_0.9.2
Register Image
1. openstack dataprocessing image register ubuntu_sahara_storm_latest_0.9.2 \ –username ubuntu
  Important: the username must be ubuntu
Add hadoop (vanilla) 2.7.1 tag
1. openstack dataprocessing image tags add subuntu_sahara_storm_latest_0.9.2 \ –tags storm 0.9.2

Create node groups

In this tutorial, I’ll create a master and a worker, for some reasons Storm fails to have both components at the same node.

master: plugin: Strom 0.9.2, Flavor m1.medium (feel free to change this and please do if you don’t have enough memory or HD), available on nova, no floating IP

zookeeper
nimbus

master: plugin: Strom 0.9.2, Flavor m1.small (feel free to change this and please do if you don’t have enough memory or HD), available on nova, no floating IP

supervisor

Create a cluster template

Very straightforward: just add a master and at least one worker to it and mark the Auto-configure option.

Launch cluster

Now you can launch the cluster and it should work just fine :), if any problem happens just give a look at the logs, and make sure that the instances can communicate to each other through SSH.

Run Storm job

For this part, we’ll use this job binary, this is one of the examples from Storm-examples called ExclamationTopology and it doesn’t have any real use, but is a good job binary test! Create the job binary exactly how you created the Hadoop job binary, and choose this job binary as the main for the job template.

At this point you’re basically good to go! Storm doesn’t need data sources!

Run job

Choose the job template we’ve created as a job template
Configure job correctly
That’s it! Run!
1. It will literally run forever (if you want to stop it, just kill it: using Storm or Sahara)

Here is Mari

Technology, Education, Researchs, Music, Ideas and Me

Running Jobs with DevStack and OpenStack – Tutorial

Setup DevStack

Enable communication between DevStack and instances

Security groups

Run a MapReduce Job

Create a Hadoop cluster

Register Image

Create node groups

Create a cluster template

Launch cluster

Running a Hadoop job manually

Run MapReduce job

Create a job binary

Create a job template

Create a data source

HDFS

SWIFT

MANILA

Run job

Run a Spark job

Create a Spark cluster

Register Image

Create node groups

Create a cluster template

Launch cluster

Run a Spark job

Run job

Run a Storm job

Create a Storm cluster

Register Image

Create node groups

Create a cluster template

Launch cluster

Run Storm job

Run job

References

One thought on “Running Jobs with DevStack and OpenStack – Tutorial”

Leave a comment Cancel reply

Setup DevStack

Enable communication between DevStack and instances

Security groups

Run a MapReduce Job

Create a Hadoop cluster

Register Image

Create node groups

Create a cluster template

Launch cluster

Running a Hadoop job manually

Run MapReduce job

Create a job binary

Create a job template

Create a data source

HDFS

SWIFT

MANILA

Run job

Run a Spark job

Create a Spark cluster

Register Image

Create node groups

Create a cluster template

Launch cluster

Run a Spark job

Run job

Run a Storm job

Create a Storm cluster

Register Image

Create node groups

Create a cluster template

Launch cluster

Run Storm job

Run job

References

Share this:

One thought on “Running Jobs with DevStack and OpenStack – Tutorial”

Leave a comment Cancel reply