Installing Apache Spark Cluster on Odroid C2 and Raspberry Pi

Contents

Summary. 1

Network configurations. 2

Odroid C2 Ubuntu 16.04 Mate. 2

Static IP. 2

Setting the Hostname. 2

Raspberry Pi 3

Static IP. 3

Setting the Hostname. 4

Common configurations. 5

Hosts file editing. 5

Disabling IP6. 5

Firewall 6

Creating an admin user to Spark. 6

Enable SSH communication between nodes. 6

Swap. 7

Installing other packages for Spark. 8

Java Installation. 8

Installing Scala. 9

Scientific Python installation. 9

Installing Apache Spark. 9

Spark Configurations. 10

Slaves. 10

log4j.properties. 11

spark-defaults.conf. 12

spark-env.sh. 13

Bash configurations. 14

Starting and Stoping Spark. 16

Submiting work to the cluster. 16

Summary

For this Spark cluster installation, I used the following packages and tools:

· Oracle Java SDK 8

· Scala 2.12.1

· Apache Spark 2.1.0

Even if you encounter URLs to other versions than those written above, they are not true. I was just lazy changing my own notes, which I have gathered from several sources. So change the files names and URLs to correspond to those you want and need.

Before I go on describing the process, I just want to mention that while it has been a bit difficult doing this installation, this was fun once I got it to work.

My biggest problems where that I come from the Microsoft ecosystem where most things are preaty much behind a UI, you don’t have to understand or know that much necessarily.

Is this good or bad? Well it depends, sometimes having a button that does things is nice but on the other hand, it takes away from actually knowing what you are doing. For example, managing users, privileges, file system etc is a totally different thing on Linux and you actually have to know what you are doing.

I found the experience with Linux very fun and enjoyable. What I struggled the most with was Spark and Hadoop (not the topic of this post). It was difficult to understand which configurations are needed. I had problems getting Hadoop to do anything and the errors whereas usually with any software obscure, or in other words a pain in the ass.

I really had to focus and want to make it to work. I felt like giving up at times.

Anyway learning using Linux was the most fun, so much fun that I ended up installing a Linux distro dual booting with Windows.

Network configurations

Odroid C2 Ubuntu 16.04 Mate

In my situation, I did the configurations through the UI but you can do it through the terminal.

Static IP

To configure a static IP goto: System > Preferences > Internet and Network > Network Proxy

There go to the IPv4 Settings and add your network desired information for the Odroid C2.

Setting the Hostname

GoTo: System > Administration > Network

Raspberry Pi

Static IP

On the top right corner press the mouse second button on the network indicator(the two arrows pointing up and down). Then select “Wireless & Wires Network Settings”. Then select the Interface and eht0 and configure the network configurations you desire.

Setting the Hostname

Goto: ”Start” icon > Preferences > Raspberry Pi Configuration > System tab > Hostname

You might need to restart your system.

Common configurations

Hosts file editing

This configuration needs must be done on every machine. You will need hostnames for your machines so that Spark and Hadoop can properly communicate between machines (nodes) in your cluster.

Type in terminal:

sudo nano /etc/hosts

Add your machine IPs and desired hostnames. My hosts file looks like this:

Notice that I have removed everything else and left the localhost definition. This is a strange this, with Spark I could not get the workers to communicate properly without the localhost definition but with Hadoop it was the other way around, not sure now why this is.

Disabling IP6

After looking several tutorials on installing Hadoop all mentioned that it is a good practice to disable the IP6 support. Apparently Hadoop does not support IP6 properly or at all. Since Apache Spark work ontop of Hadoop then I applied the same method on Spark also.

Open the /etc/sysctl.conf file for editing and add the following at the end of the file:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Firewall

In case you have problems between the nodes in a cluster disable the firewall or allow the nodes to communicate between the nodes with the needed ports.

Creating an admin user to Spark

In terminal create the user, add it to a group and give it admin privileges:

sudo addgroup spark
sudo adduser –ingroup spark spuser
sudo adduser spuser sudo

Login into the user and do everything related to Spark with this user:

su spuser

Enable SSH communication between nodes

This is to avoid using authentication when using Spark. If you do not do this you might run into problems and you also are constantly required to type in the account and password which you want to run Spark on.

Next create the SSH key and add it to the authorized_keys file.

$ cd ~
$ ssh-keygen -t rsa -P “”
$ cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys

Verify that the SSH tunnel is working:

ssh localhost

Copy the public keys to the slaves nodes in you cluster

$ ssh-copy-id spuser@raspberrypi01

$ ssh-copy-id spuser@raspberrypi02

Then test the connection to the slaves:

$ ssh raspberrypi01

Swap

This is something I had to do for Odroid C2. Even though the Odroid has double the RAM Raspberry Pi has, it also has Ubuntu installed on it which takes up nearly half the memory when booting. The Raspbian OS takes about a little over 150 MB, so about 15 % of the Raspberry Pi’s total RAM.

I ran into problems when I wanted to use the Odroid as the Master and also as a slave node for calculation. Because I wanted to use as much memory as possible on the actual Raspberry Pi nodes I allocated 768 MB, which is OK for the Pi’s but I could not allocate less than 512 MB for Odroid and allocating 512 MB caused the Odroid to swap and since there was no swap created the OS crashed or became unresponsive.

To combat this problem I create a swap for Odroid, the size of the Swap was double the RAM, 4 GB:

The Swap creation guide below is from:

http://www.tutorialspoint.com/articles/how-to-enable-or-add-swap-space-on-ubuntu-16-04

Checking for the Swap Information

Before we begin, we will first check for the swap space available on the server or system

We can use the below command to see that the system is having the swap partition or not

$ free -h

We can also run the below command but if the swap partitions do not exist, we cannot see any information.

$ sudo swapon –s

In the above command, we can see that the swap is not enabled or not configured for this server to configure the swap in this machine. We will first check for the free disk space available with the below command –

$ df –h

Creating a Swap File

As we know the disk space availability, we can go ahead and create a swap file on the filesystem. To create the swap file we can use ‘fallocate’ a package or utility which can create a preallocated size to instantly. As we have a little space on the server will create a swap file with 512 MB size to create a swap file below is the command.

$ sudo fallocate -l 512M /swapfile

And to check the swap file we will use the below command

$ ls -lh /swapfile
-rw-r–r– 1 root root 512M Sep 6 14:22 /swapfile

Enabling the Swap to use the Swap File

Before, we are going to enable the swap, we need to fix the file permission that other than root any others can read/write the file below is the command to change the file permission.

$ sudo chmod 600 /swapfile

Once, we change the permission we will check the file below and execute the below command to check the swap file permissions.

Once, we change the permission we will check the file below and execute the below command to check the swap file permissions.

$  ls -lh /swapfile
-rw——- 1 root root 512M Sep  6 14:22 /swapfile

We will now make this file as a swap space using this below command –

$ sudo mkswap /swapfile
Setting up swapspace version 1, size = 524284 KiB
no label, UUID=d02e2bbb-5fcc-4c7b-9f85-4ae75c9c55f9

Now we will enable the swap by using the below command

$ sudo swapon –s
Filename                                Type            Size    Used    Priority
/swapfile                               file            524284  0       -1

We can also check with free –h commands to see the swap partition

$ free –h

Making the Swap Partition/File to start Permanent

As in the above steps, we have created the swap partition and we are able to use that swap for temporary memory and once the machine is rebooted then the swap, setting will be lost to needed to use this swap file permanently we will make the swap file permanent.

We will edit the /etc/stab and add the information to mount the swap file even if we reboot the machine

$ sudo  vi /etc/fstab

Add the below line to the existing file.

/swapfile            none     swap     sw         0            0

For better performance for using the swap memory, we can do some tweaks.

 

Installing other packages for Spark

For my installation, I needed a few other packages to make things work:

  • Oracle Java version 8
  • Scala

Java Installation

 

For a more automatic installation, type the following commands:

$ sudo apt-get install oracle-java8-jdk

 

$ sudo apt-get update && sudo apt-get install oracle-java8-jdk

 

$ sudo update-alternatives –config java

 

For a more manual one got to the Oracle website and download the Linux ARM 64 Hard Float ABI package: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

 

Notice that my Raspberry Pi with Raspbian from December had the 32 bit Java 7 version installed. I used the 64 Bit Java 8 for Odroid C2 and 32 bit for Java 8 for Raspberry Pi.

When you have the package, do the following:

Enter the command to extract jdk-8-linux-arm-vfp-hflt.tar.gz to /opt directory.

$ sudo tar zxvf jdk-8-linux-arm-vfp-hflt.tar.gz -C /opt

 

Set default java and javac to the new installed jdk8.

$ sudo update-alternatives –install /usr/bin/javac javac /opt/jdk1.8.0/bin/javac 1

$ sudo update-alternatives –install /usr/bin/java java /opt/jdk1.8.0/bin/java 1

 

$ sudo update-alternatives –config javac

$ sudo update-alternatives –config java

 

After all, verify with the commands with -version option.

$ java -version

$ javac –version

I also added as the owner the spark user:

$ sudo chown -R spuser:spark jdk1.8.0/

Notice: To make life easier you should add environmental variables to your bashrc file. More on this in the Spark installation portion.

Installing Scala

Navigate to the following URL: http://www.scala-lang.org/download/

I downloaded the tar package, extracted it to a location and added proper privileges to the spark user:

$ sudo tar zxvf scala-2.12.1.tgz -C /opt

$ sudo chown -R spuser:spark scala-2.12.1/

 

Notice: To make life easier you should add environmental variables to your bashrc file. More on this in the Spark installation portion.

Scientific Python installation

 

This is not a requirement but I used these scripts to install Jupyter and Python 3.5 on my cluster nodes:

https://github.com/kleinee/jns

Installing Apache Spark

 

Start by downloading you desired package from Apache Spark URL: http://spark.apache.org/downloads.html

Or user wget: wget http://www.eu.apache.org/dist/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.tar.gz

Then extract Spark package

$ sudo tar -xvzf spark.2.1.0.tar.gz -C /opt/

Add the Spark User as owner

$ cd /opt
$ sudo chown -R hduser:hadoop spark.2.1.0/

If you are having problem with access to the spark folder or if you are installing Hadoop and have configured namenodes file system locations etc use the following command to add more privileges to the desired locations:

$sudo chmod 750 /opt/hadoop/hadoop_data/hdfs

Spark Configurations

Go to the Spark conf folder, depending where you installed spark:

$ cd /media/microSD/spark/conf

There are four files I had to configure for the cluster to work:

  • Slaves
  • properties
  • spark-defaults.conf
  • spark-env.sh

For more info on these files check out Spark documentation:

http://spark.apache.org/docs/latest/configuration.html

http://spark.apache.org/docs/latest/spark-standalone.html

The first step is to rename some of the files mentioned above. Some of the files have the “.template” file extension on them, remove it:

$ mv log4j.properties.template log4j.properties

If the slaves file does not exist then using nano you will create is automatically:

$ sudo nano slaves

The above assumes you are in the conf folder.

When you are done with the configurations on your master just copy them with scp to all slave nodes. Make sure you change the node specific values in these files(more on this below).

Slaves

Here you add the slave machines hostname, or the machines you want to do the work for you, the calculations:

odroid64

raspberrypi01

raspberrypi02

log4j.properties

With this file all we want is to minimize the amount of log on screen, it will be much easier to spot what is going on when you are not flooded with basic operational messages.

What you need to do it to change this:

log4j.rootCategory=INFO, console

to this:

log4j.rootCategory=WARN, console

spark-defaults.conf

Here we just want to specify the master URL so that we do not always have to specify it when submitting work to the Spark cluster:

spark.master                     spark://odroid64:7077

spark-env.sh

Here you specify the parameters that your cluster will use to communicate with all the nodes within it:

SPARK_MASTER_IP=odroid64

SPARK_WORKER_MEMORY=512m

SPARK_MASTER_HOST=odroid64

SPARK_LOCAL_IP=odroid64

SPARK_WORKER_CORES=2

SPARK_DAEMON_MEMORY=512m

SPARK_EXECUTOR_INSTANCES=1

SPARK_EXECUTOR_CORES=2

SPARK_EXECUTOR_MEMORY=512m

SPARK_DRIVER_MEMORY=512m

There are many variables, which you can tweak, I used and had to use the above ones to get things to work.

The SPARK_MASTER_IP and SPARK_MASTER_HOST HAVE to be the same on all nodes. The rest have to correspond to the actual physical node where the configuration file resides.

 

 

 

Bash configurations

 

For my cluster I used the following configurations(disregard the Hadoop ones, not necessary for Spark):

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_121

export HADOOP_HOME=/media/microSD/hadoop-2.7.3

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

#export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin

export HADOOP_OPTS=”$HADOOP_OPTS -Djava.library.path=/media/microSD/hadoop-2.7.3/lib/native/”

#export PATH=$PATH:$JAVA_HOME/bin:/media/microSD/spark/sbin:/media/microSD/spark/sbi

 

export SBT_HOME=/media/microSD/sbt

export SPARK_HOME=/media/microSD/spark

export SCALA_HOME=/media/microSD/scala-2.12.1

export PATH=$PATH:$JAVA_HOME/bin

export PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin

export SPARK_MASTER_URL=http://192.168.10.65:7077

 

The important ones for Spark are:

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_121

export SPARK_HOME=/media/microSD/spark

export SCALA_HOME=/media/microSD/scala-2.12.1

export PATH=$PATH:$JAVA_HOME/bin

export PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin

export SPARK_MASTER_URL=http://192.168.10.65:7077

You want to add Spark, Scala and Java to the PATH environmental variable to be able to access commands easily from Terminal.

You can then copy these configurations to the slave nodes using scp command:

http://www.hypexr.org/linux_scp_help.php

$ scp ~/.bachrc spuser@raspberrypi01:~

 

To force a refresh of the environmental variables

$ source ~/.bachrc

Starting and Stoping Spark

This is simple.

To start Spark type:

$ start-all.sh

To stop Spark:

$ stop-all.sh

To access Spark web UI:

http://odroid64:8080/

 

Submiting work to the cluster

 

Use the Spark specific command:

spark-submit

More info:

http://spark.apache.org/docs/latest/submitting-applications.html

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s