About Outreachy

Hi everyone,

I’ll be quick in this post, I’ll just talk a little bit about what changed since the last post about Outreachy.

So:

  1. I finished the implementation ( \o/ yeeey), but the code wasn’t merged yet ( :/ not so yeeey), and this was mainly because of the new stable version (Ocata) that started being created at the end of January, so this means no new feature could be merged, but the frozen time has ended, the winter is not coming, so this means the code should be merged soon.
  2. I’ve blogged a little about Mocking, which was a new concept I had to learn in order to make good unit tests for my patch.
  3. I tested my code manually:
    1. ran Hadoop, Spark and Storm jobs with multiple scenarios fixing some bus
  4. And currently, I’m reviewing code and making other contributions to the community (adding code, tests, fixing bugs)!
About Outreachy

Running Jobs with DevStack and OpenStack – Tutorial

UPDATED: I added spark jobs and storm jobs in this tutorial, hope it helps!

This is a very practical tutorial about how to run MapReduce jobs, Spark jobs and Storm jobs with multiple types of data sources (Manila, HDFS and Swift) using Devstack or OpenStack.

If you have an OpenStack cloud, just ignore all the DevStack related worries (like enabling connection and setting up DevStack).

This post considers that you have some familiarity with OpenStack, DevStack and Sahara. If you don’t feel that you have this familiarity don’t worry, this should help you as well and you can look in some references at the end of this post :)!

For this tutorial was used a VM with Ubuntu 14.04,  12 GB RAM, 80 GB HD, 8 vCPU and the DevStack master branch of the day 12/16/2016. Also, the plugins used were Hadoop 2.7.1 (this tutorial can be easily adapted for any Hadoop version below it), Spark 1.6.0 and Storm 0.9.2.

Main sessions of this post:

  • SETUP DEVSTACK
  • ENABLE COMMUNICATION BETWEEN DEVSTACK AND INSTANCES
  • RUN A MAPREDUCE JOB
  • RUN A SPARK JOB
  • RUN A STORM JOB

Setup DevStack

First of all, be sure to start DevStack in a VM instead of a real machine.

  1. SSH to the VM
  2. clone DevStack
    1. git clone https://git.openstack.org/openstack-dev/devstack
    2. cd devstack
  3. create a local.conf file in the DevStack folder:
    here is a local.conf example, it enables Manila, Sahara and Heat that are projects we’re going to need in this tutorial. There are a lot of templates of how this file should be, one of them is present in [4] that is a local.conf with Sahara.
  4. ./stack.sh
  5. Go get some lunch or something, it’s going to take some time

If you have an error like this:
“E: Unable to locate package liberasurecode-dev
./stack.sh: line 501: generate-subunit: command not found
https://bugs.launchpad.net/devstack/+bug/1547379”

The solution is:
“This is not a bug actually, it happens if 
devstack unable to install some packages like in my case it was enabled to get liberasurecode-dev which led into that issue, you to edit /etc/apt/source.list to enable install from trusty-backport (if ubuntu server) or xyz backports in case of other server that fixed the issue for me.”

Probably you will not have any problems here, just be sure that local.conf file is right and the stack should work just fine. When the stack.sh finishes it will show a web address you can go to and see Horizon [5] which you can login using the user and password that you specified at local.conf.

Enable communication between DevStack and instances

Okay, now we need to create a cluster so we can run a job on it, right? But before we gotta make sure our DevStack instance can communicate to the cluster’s instances.

We can do this with the following commands in the DevStack VM:

sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
sudo route add -net [private_network] gw [openstack_route]

You can get the [private_network] and [openstack_route] trough Horizon on the Network menu or trough the API as well.

Security groups

Also, an important thing to do is to add SSH, TCP, ICMP rules to the default security group and make sure that all the instances in the clusters belong to this group. If you want to you can create another security group, it’s fine, as long as you add the needed rules and make sure that the instances belong to this group.

You can do this easily through Horizon or API [6].

Run a MapReduce Job

Create a Hadoop cluster

In order to create a Hadoop cluster you have to follow some previous steps, these are listed and described below. You can get some more info about how to do this at [7].

Register Image

In this tutorial, we’re going to use Hadoop 2.7.1 and this image.

  1. Download Image
    1. wget http://sahara-files.mirantis.com/images/upstream/newton/sahara-newton-vanilla-2.7.1-ubuntu.qcow2
  2. Create Image with glance
    1. openstack image create sahara-newton-vanilla-2.7.1-ubuntu \
      –disk-format qcow2 \
      –container-format bare \
      –file sahara-newton-vanilla-2.7.1-ubuntu.qcow2
  3. Register Image
    1. openstack dataprocessing image register sahara-newton-vanilla-2.7.1-ubuntu \ –username ubuntu
      Important: the username must be ubuntu
  4. Add hadoop (vanilla) 2.7.1 tag
    1. openstack dataprocessing image tags add sahara-newton-vanilla-2.7.1-ubuntu \ –tags vanilla 2.7.1

Create node groups

I usually do this part through Horizon, so what you I do is basically create two node groups: a master and a worker. Be sure to create both of them in the default security group instead of the default option of Sahara which is creating a new security group for the instance.

Master:  plugin: Vanilla 2.7.1, Flavor m1.medium (feel free to change this and please do if you don’t have enough memory or HD), available on nova, no floating IP

  • namenode
  • secondarynamenode
  • resourcemanager
  • historyserver
  • oozie
  • hiveserver

Worker:  plugin: Vanilla 2.7.1, Flavor m1.small (feel free to change this and please do if you don’t have enough memory or HD), available on nova, no floating IP

  • datanode
  • nodemanager

Create a cluster template

Very straightforward: just add a master and at least one worker to it and mark the Auto-configure option.

Launch cluster

Now you can launch the cluster and it should work just fine :), if any problem happens just give a look at the logs, and make sure that the instances can communicate to each other through SSH.

Running a Hadoop job manually

If you want to “test” the cluster you can run a Hadoop job manually, the process is described below and is totally optional. Below we’re running a wordcount job.

  1. SSH to master instance
  2. login as hadoop
    1. sudo su hadoop
  3. Create a hdfs input
    1. bin/hdfs dfs -mkdir -p [input_path]
  4. Add some file to it
    1. bin/hdfs dfs -copyFromLocal [input_path] [path_to_some_file]
  5.  Run job:
    1. bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount [input_path] [output_path]
      Imporant: [output_path] shoud not exist in HDFS !!!
  6. Get output
    1. bin/hdfs dfs -get [output_path] output
    2. cat output/*
  7. If you don’t need the HDFS output anymore delete it!
    1. bin/hdfs dfs -rm -r [output_path]

Run MapReduce job

FINALLY! So, we’ll use this file as an example and we’ll run the WordCount job. In order to keep things simpler we’ll let the job binary type as a job binary internal, so we’ll just change the data source, although if you want to change the job binary type the process is pretty much the same, jut create it somewhere (Swift, HDFS or Manila) and define its URL.

Sahara give you the addresses below, one thing that may be needed is to add some rules in the security group in order to allow access to these pages, which can be extremely helpful for debugging and for logging access:

Web UI: http://%5Bmaster_ip%5D:50070
Oozie: http://%5Bmaster_ip%5D:11000
Job Hist: http://%5Bmaster_ip%5D:19888
YARN: http://%5Bmaster_ip%5D:8088
Resource Manager: http://%5Bmaster_ip%5D:8032

The API commands and details that I don’t show in this session can be seen at [3].

Create a job binary

  1. Just download the file in the link above
  2. Create a job binary internal using the file, also make sure the name of the job binary has the .jar/.pig/.type_of_file,etc, in this case: hadoop-mapreduce-examples-2.7.1.jar

If you want to use Swift, Manila or HDFS for the job binary there’s no problem.

Create a job template

Just choose a name for the job template, use the type MapReduce, and use the job binary that we created.

Create a data source

HDFS

  1. SSH to master instance
  2. login as hadoop
    1. sudo su hadoop
  3. Create a hdfs input
    1. bin/hdfs dfs -mkdir -p [input_path]
  4. Add some file to it
    1. bin/hdfs dfs -copyFromLocal [input_path] [some_file]
  5. Create data source for input
    1. select HDFS as type
    2. URL: [input_path]
  6. Create data source for output
    1. select HDFS as type
    2. URL: [output_path_that_does_not_exists]

The URL considers that the path is at /users/hadoop if it isn’t please provide the whole path, also if you’re using an external HDFS provide the URL as:  hdfs://[master_ip]:8020/[path] and make sure you can access it.

SWIFT

  1. Create a container
    1. this can be done through Horizon or API
  2. Add an input file in the container
    1. this can be done through Horizon or API
  3. Create data source for input
    1. select Swift as type
    2. URL: [container]/[input_path]
    3. user: [your user], in this case “admin”
    4. password: [your password], in this case “nova”
  4. Create data source for output
    1. select Swift as type
    2. URL: [container]/[output_path_that_does_not_exists]
    3. user:[your user], in this case “admin”
    4. password:[your password], in this case “nova”

MANILA

  1. If is the first time you’re creating a share
    1. create the default type
      1. manila type-create default_share_type True
    2. create a share network
      1. manila share-network-create \
        –name test_share_network \
        –neutron-net-id %id_of_neutron_network% \
        –neutron-subnet-id %id_of_network_subnet%
        I used the private net for the shares and the cluster :).
  2. Create a share
    1. manila create NFS 1 –name testshare –share-network [name_of_network]
  3. Make this share accessible
    1. manila access-allow testshare ip 0.0.0.0/0 –access-level rw/
      Important: you can change the ip for the master’s IP and workers’s IP as well, is actually more recommended
  4. SSH to some instance that has access to the share
    1. sudo apt-get install nfs-common
    2. Mount the share
      1. sudo mount -t nfs /mnt
    3. Add an input to it
      1. cd /mnt
      2. mkdir [input_path]
      3. cp [input_path] [some_file]
    4. Unmount the share
      1. sudo umount -f /mnt
  5. Create data source for input
    1. select Manila as type
    2. URL: /[input_path]
  6. Create data source for output
    1. select Manila as type
    2. URL: /[output_path_that_does_not_exists]

Run job

  1. Choose the job template we’ve created as a job template
  2. Choose an input data source
  3. Choose an output data source
  4. Configure job correctly
    1. For the job we’re running the minimum conf. needed is:
      1. mapreduce.reduce.class = org.apache.hadoop.examples.WordCount$IntSumReducer
      2. mapreduce.map.output.value.class = org.apache.hadoop.io.IntWritable
      3. mapreduce.map.class = org.apache.hadoop.examples.WordCount$TokenizerMapper
      4. mapreduce.map.output.key.class = org.apache.hadoop.io.Text
      5. mapred.reducer.new-api = true
      6. mapred.mapper.new-api = true
        Important: we need these last configurations (5,6) because we’re using new Hadoop API. and the other conf. are self-explanatory.
    2. If you’re running a Hadoop 1.2.1 job, you would need to configure
      1. mapred.mapoutput.key.class
      2. mapred.mapoutput.value.class
      3. mapred.reducer.class
      4. mapred.mapper.class

Problems? Check the web UI logs it can be very helpful!

Run a Spark job

IMPORTANT: from this point I’ll not detail how to create data sources, because is exactly the same procedure showed for Hadoop.

Create a Spark cluster

Register Image

In this tutorial, we’re going to use Spark 1.6.0 and this image.

  1. Download Image
    1. wget http://sahara-files.mirantis.com/images/upstream/mitaka/sahara-mitaka-spark-1.6.0-ubuntu.qcow2
  2. Create Image with glance
    1. openstack image create sahara-mitaka-spark-1.6.0-ubuntu \
      –disk-format qcow2 \
      –container-format bare \
      –file sahara-mitaka-spark-1.6.0-ubuntu.qcow2
  3. Register Image
    1. openstack dataprocessing image register sahara-mitaka-spark-1.6.0-ubuntu \ –username ubuntu
      Important: the username must be ubuntu
  4. Add hadoop (vanilla) 2.7.1 tag
    1. openstack dataprocessing image tags addsahara-mitaka-spark-1.6.0-ubuntu \ –tags spark 1.6.0

Create node groups

In this tutorial, I’ll just make one node group that has the master and worker, I’ll call it All-in-one.

All-in-one:  plugin: Spark 1.6.0, Flavor m1.medium (feel free to change this and please do if you don’t have enough memory or HD), available on nova, no floating IP

  • namenode
  • datanode
  • master
  • slave

Create a cluster template

Very straightforward: just add the node group All-in-one and mark the Auto-configure option.

Launch cluster

Now you can launch the cluster and it should work just fine :), if any problem happens just give a look at the logs, and make sure that the instances can communicate to each other through SSH.

Run a Spark job

For this part, we’ll use this job binary, this is a wordcount job, create the job binary exactly how you created the Hadoop job binary, and choose this job binary as the main for the job template. 

At this point you’re basically good to go! You’ll need input and output data sources, you can do exactly how we did with Hadoop.

Run job

  1. Choose the job template we’ve created as a job template
  2. Configure job correctly
    1. main class
      For this job the main class is: sahara.edp.spark.SparkWordCount
    2. configsFor Swift type you’ll may need to pass the credentials as configs, example:
      • fs.swift.service.sahara.password = admin
      • fs.swift.service.sahara.username = nova
    3. args
      Now, the biggest difference between Spark and Hadoop, is that the data sources are passed as args. So to run with a general data source you should pass as args:

      • datasource://[name of the data source]
      • datasource://[name of the data source]

       

    4. Run!

Run a Storm job

Create a Storm cluster

Register Image

In this tutorial, we’re going to use Storm 0.9.2 and unfortunately there’s no image available, but don’t be sad! Sahara-image-elements makes easy to create an image for Storm and other plugins, just follow the instructions, generate your image and then come back here!

  1. Generate image with sahara-image-element
    1. it probably will generate something like: ubuntu_sahara_storm_latest_0.9.2
  2. Create Image with glance
    1. openstack image create ubuntu_sahara_storm_latest_0.9.2 \
      –disk-format qcow2 \
      –container-format bare \
      –file ubuntu_sahara_storm_latest_0.9.2
  3. Register Image
    1. openstack dataprocessing image register ubuntu_sahara_storm_latest_0.9.2 \ –username ubuntu
      Important: the username must be ubuntu
  4. Add hadoop (vanilla) 2.7.1 tag
    1. openstack dataprocessing image tags add subuntu_sahara_storm_latest_0.9.2 \ –tags storm 0.9.2

Create node groups

In this tutorial, I’ll create a master and a worker, for some reasons Storm fails to have both components at the same node.

master:  plugin: Strom 0.9.2, Flavor m1.medium (feel free to change this and please do if you don’t have enough memory or HD), available on nova, no floating IP

  • zookeeper
  • nimbus

master:  plugin: Strom 0.9.2, Flavor m1.small (feel free to change this and please do if you don’t have enough memory or HD), available on nova, no floating IP

  • supervisor

Create a cluster template

Very straightforward: just add a master and at least one worker to it and mark the Auto-configure option.

Launch cluster

Now you can launch the cluster and it should work just fine :), if any problem happens just give a look at the logs, and make sure that the instances can communicate to each other through SSH.

Run Storm job

For this part, we’ll use this job binary, this is one of the examples from Storm-examples called ExclamationTopology and it doesn’t have any real use, but is a good job binary test! Create the job binary exactly how you created the Hadoop job binary, and choose this job binary as the main for the job template. 

At this point you’re basically good to go! Storm doesn’t need data sources!

Run job

  1. Choose the job template we’ve created as a job template
  2. Configure job correctly
      1. main class
        For this job the main class is: storm.starter.ExclamationTopology
  3. That’s it! Run!
    1. It will literally run forever (if you want to stop it, just kill it: using Storm or Sahara)

 

References

Running Jobs with DevStack and OpenStack – Tutorial

what is stevedore? and how is it related to Sahara?

Before we start…

This post is considering that:

  1. You know Sahara :)!
  2. You don’t know stevedore :(, or you know it, but feel like you should know more about it or how it relates to Sahara

So If you don’t know Sahara, I can introduce you two, here is Sahara. I’ll let you know each other.

First, what is stevedore?

“Python makes loading code dynamically easy, allowing you to configure and extend your application by discovering and loading extensions (“plugins”) at runtime. Many applications implement their own library for doing this, using __import__ or importlib. stevedore avoids creating yet another extension mechanism by building on top of setuptools entry points. The code for managing entry points tends to be repetitive, though, so stevedore provides manager classes for implementing common patterns for using dynamically loaded extensions.”

If you’re not familiar with some concepts like setuptools, entry points, and how these concepts could be used to make the process of loading code dynamically easier, this link can be very helpful.

So now we kind of know what is stevedore: “stevedore manage dynamic plugins for Python applications”. So stevedore is “independent” of OpenStack, it’s used by a lot of projects of OpenStack, but any Python code can use it :)!

Okay… but what exactly are Plugins and why should I use them?

Plugins are software components that add features to an existing computer program (core code). “When a program supports plug-ins, it enables customization.

In other words, a plugin is a piece of code that is not a part of the core code, and because of this fact it can be added or removed easily. And it provides some features, services and operations in a more specific away.

And we should use plugins because:

  • We already talked about customization, that means that using plugins you can have “different versions” of the code in an easier way. If I don’t need all the plugins I can install just the plugins and dependencies I need.
  • We will have an improved design. “Keeping a separation between core and extension code encourages you to think more about abstractions in your design”.
  • “Plugins are a good way to implement device drivers and other versions of the Strategy pattern. The application can maintain generic core logic, and the plugin can handle the details for interfacing with an outside system or device.”
  • “Plugins also provide a convenient way to extend the feature set of an application by hooking new code into well-defined extension points. And having such an extensible system makes it easier for other developers to contribute to your project indirectly by providing add-on packages that are released separately.” 

And how Sahara uses stevedore?

Let’s open Sahara source code and have a look, here is the link: https://github.com/openstack/sahara

As we discussed plugins are loaded through entry points, the configuration can be seen in a file called setup.cfg, search for [entry_points], the syntax is:

namespace = name = module.path: importable_from_module

By the name of the namespaces we can suppose that exists a driver responsible for SSH, console scripts are implemented with stevedore, cluster’s plugins, a Heat engine and some other things.

Feel free to explore these namespaces on your own and see how they work! I may add more details about this with the time :)!

References

http://docs.openstack.org/developer/stevedore/
http://docs.pylonsproject.org/projects/pylons-webframework/en/latest/advanced_pylons/entry_points_and_plugins.html
https://www.youtube.com/watch?v=U53ND5NucYY

what is stevedore? and how is it related to Sahara?