Category Archives: Linux

Lessons learned from building microservices – Part 1: Logging

This is a part in a series of posts discussing things learned while I worked with micro-services. The things I write here are not absolute truths and should be considered as the best solutions at the time I and my team used these methods. You might chose to do things differently and I recommend highly to find out for yourself the best practices and approaches that work for you and your project.

I also assume that you have a wide range of pre-existing knowledge on building microservices, API, programming languages, programming, cloud providers etc.

Notice: In the examples below I will omit “boilerplate” code to save space.

Base requirements for logging

Service instances

In a microservice architecture the most important thing is to be able to see what each microservice instance is doing. This means in the case of kubernetes each pod, or each container with docker etc.

So if you have a service named Customer and you have three instances of this service you would want to know what each service is doing when logging. So here is a check list of things to consider:

  • You need to know what each service instance is doing because each instance will process logic and each instance will have it’s own output based on what it is doing or requested to do
  • Each log entry should be able to identify which service instance was that performed the log entry by providing a unique service intance id
  • Each log entry should identify which application version the service instance is using
  • Each log entry should tell in which environment the service instane is operating in, example: development, test, qa, prod
  • If possible each log entry should tell where the service instance is like IP address or host-name

Monitoring

Next you need a way to push logs to a location, aggregate them, parse and index them perhaps, then analyze them and finally to be able to easily find logs, make graphs, alerts etc.

A common pattern or stack to use and the one I used was ElasticSeach, Logstash and Kibana. You can mix and match different service and solutions to get the same results.

Log types

Next I’ll cover the different logging types you might need and that will make your life easier.

General logging details

Before we cover the different types of logs which you might need first we need to have some common data witch each log entry. This data will help us in different way depending on the solution you are making. In my example here these data are related to an API backend but you might find them useful in some other types of solutions.

So consider adding these logging fields to other logs as metadata.

public class LogData
 {

    private String requestId;
    private String userId;
    private String environmentId;
    private String appName;
    private String appVersion;
    private Instant createdAt;

}
FieldSampleDescription
requestId6f88dcd0-f628-44f1-850e-962a4ba086e3This is a value that should represent a request to your API. This request id should be applied to all log entries to be able to group all log entries from a request.
userId9ff4016d-d4e6-429f-bca8-6503b9d629e1Same as with the request id but a user id that represents a possible user that made the API request.
environmentIdDEV, TEST, PRODThis should tell a person looking at a log entry from which environment the log entry came for. This is important in cases where all log entries are pushed into one location and not separated physically.
appNameYour Cool APISame as with the environment id but concerns the app name.
appVersion2.1.7Same as with the environment id but concerns the app version.
createdAt02/08/2019 12:37:59This should represent when the log entry has been created. This will help very much in tracking the progress of the application logic in all environment in case of troubleshooting.

Access log

Access logs are a great way to keep track of your API requests and their response to a client. I won’t go deeper into them, there are plenty of detail descriptions available which I recommend going through, here is one:

https://httpd.apache.org/docs/2.4/logs.html#accesslog

https://en.wikipedia.org/wiki/Server_log

Here is some sample code:

public class AccessLog {
    private String clientIP;
    private String userId;
    private String timestamp;
    private String method;
    private String requestURL;
    private String protocol;
    private int statusCode;
    private int payloadSize;
    private String borwserAgent;
    private String requestId;
}
FieldSampleDescription
clientIP127.0.0.1The IP address of the client that made the request to you API.
userIdaa10318a-a9b7-4452-9616-0856a206da75Preferably this should be the same user id that was used in the LogData class above
timestamp02/08/2019 12:37:59A date time format of your choice when the request occured.
methodGET, POST, PUT etc.HTTP Method of the request.
requestURLhttps://localhost:9000/api/customer/infoThe URL of the request
protocolHTTP/1.1The protocol used to communicate with the API request.
statusCode200, 201, 401, 500 etc.HTTP status code of the request response.
payloadSize2345The size of the payload returned to the client.
borwserAgentMozilla/4.08 [en] (Win98; I ;Nav)“The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent.” – https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
requestIdThis should the the same request id used in the LogData class earlier.

Message Queue Log

This is a sample log which you could use with events/message queues. Depending on what message queue you use and what kind of configurations, you would most likely have minimal information about the message pushed to a queue.

From a troubleshooting point of view and being able to track things I would recommend passing with the message additional metadata related to the message original situation.

Lets take as an example an API request. What I did was add an additional property field to my message which held a JSON version of the class below. Looking at the message below you can see that mostly it is the same fields as in the LogData class earlier with added metadata related to the message itself which can also be used to controler the message logic at the receiving end.

public class MessageQueueLog {
    private String sourceHostname;
    private String sourceAppName;
    private String sourceAppVersion;
    private String sourceEnvironmentId;
    private String sourceRequestId;
    private String sourceUserId;
    private String message;
    private String messageType;
}
FieldSampleDescription
sourceHostnameLook at the LogData example earlier.
sourceAppNameLook at the LogData example earlier.
sourceAppVersionLook at the LogData example earlier.
sourceEnvironmentIdLook at the LogData example earlier.
sourceRequestIdLook at the LogData example earlier.
sourceUserIdLook at the LogData example earlier.
messageJSON data JSON data representing a serialized object that hold important data to be used the the receiving end.
messageTypeUPDATE_USER, DELETE_USERA simple unique static ID for the message. This ID will tell the receiving end what it needs to do with the data in the message field.

Metrics log

With metrics logs the idea is to be able to track desired things in your application. A common thing that you might like to track would be how external request from your own code is performing. This will allow you set up alerts and troubleshoot problem with external sources, especially if combined with a access log you can see and a metrics log of how long you request totally took to finish.

So you could track the following metrics:

  • External source like database, API, service etc.
  • You request total processing time from start to end to return a response
  • Some important section of your code
public class MetricsLog {

    private String title;
    private String body;
    private String additional;
    private String url;
    private int statusCode;
    private Double payloadSize;
    private Long receivedResponseAtMillis = 0L;
    private Long sentRequestAtMillis = 0L;
    private MetricsLogTypes logType;
    private double elapsedTimeInSeconds = 0;
    private double elapsedTimeInMS = 0;
    private String category;
}
FieldSampleDescription
titleUser Database
bodyUpdate user
additionalSome additional data
urlhttp://localhost:9200/api/car/typesIf this is a API request to an external service you should log the request URL.
statusCode200, 401, 500 etc.The HTTP status code returned by the external source.
payloadSize234567The size of the returned data.
receivedResponseAtMillis1575364455When the response was received, this could be in UNIX epoch time.
sentRequestAtMillis1575363455When the request was received, this could be in UNIX epoch time.
logTypeAPI, DATABASE, CODE etc.This could be used to identify what kind of a metric this is.
elapsedTimeInSeconds1Calculate and write how long it took for the response to be received.
elapsedTimeInMS1000Calculate and write how long it took for the response to be received.
categoryCategory1/2/3 etc.This could be used to group different metrics together.

Security Logs

I would also consider creating a separate security log that would be logged and identified by the logging indexer to it’s own pattern or category etc.

This is to speed up troubleshooting related to security issues like when someone signs in, signs out, registers etc.

Aggregated log entry

This is an example where you would have a main log class that will contain our desired log entry data and details for a system.

Possible use cases is when streaming to Cloudwatch or to perhaps Elasticsearch.

public class CloudLog {
    private LocalDateTime timeStamp;
    private String logger;
    private Map<String, Object> metadata;
    private String message;
    private String level;
}
FieldDescription
timeStampA timestamp when the log entry was created.
loggerThe logger entity name.
metadataA map full of key value pair, full of data which can be serialized into JSON for indexing.
messageThe main message to the log entry
levelSeverity level of the log entry, DEBUG, INFO, ERROR, etc.

Spring Boot: Bean management and speeding development

Intro

Is this blog post I’ll show a way how to use Spring Boot functionality to create a more automatized way to use beans that are some what created as component or features.

The idea is that we way have functionalities or features which we want to have easy and clear access through code, so the following things should be true:

  • If I want I can use a set of beans easily
  • If I want I can use a specific bean or beans within the previous set of beans
  • It should be easily told to Spring what beans to load, a only liner preferably
  • Configuration of beans should be not hidden from a developer, the developer should be noticed if a configuration is missing from a required bean ( By configuration I mean application properties)
  • A bean or set of beans should be able to be used from a common library so that when the library is references in a project the beans will not be automatically created and thus creating mandatory dependencies that would break the other project code and/or add functionalities which are not required

All of the above will happen if the following three things are created and used properly within a code base:

  1. Custom annotations to represent features or functionalities by tagging wanted code
  2. Usage of component scan to load up the wanted features or functionalities based on the set annotations
  3. Usage of properties classes which extend from a properties base class handling application properties dependencies and configuration logic and logging

Notice: I assume that you are familiar with Java and Spring Boot, so I’ll skip some of the minor details regarding the implementation.

Implementation

Custom annotation

@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.TYPE)
public @interface MyFeature {
   
}

To use this annotation you need to apply it to bean creation process which you want the component scan to pick up.

@Bean(name = "MY_FEATURE_BEAN")
        @Autowired
        @Profile({"primary"})
        @MyFeature
        public MyFeatureClass createMyFeatureBean(MyFeatureProperties myfeatureProperties) {
            MyFeatureClass myFeature = new MyFeatureClass(myfeatureProperties);
            // Do someething else with the class

            return myFeature; // Return the class to be used as a bean
        }

You can also directly apply it to a class. This way the class is used directly to create a bean out of it.

Component Scanning

You can use the Spring Boot component scanning in many different ways (I recommend looking at what the component scan can do).

In this example it is enough for you to tell which annotation to include in your project, notice that you have to create a configuration class for this to work:


@Configuration
@ComponentScan(basePackages = "com.my.library.common",
        includeFilters = @ComponentScan.Filter(MyFeature.class))
public class MyFeaturesConfiguration {
}

Extended properties configuration

For this example we need two things to happen for the custom properties configuration and handling/logging to work:

  1. Create a properties class that represents a set of properties for a feature or set or features and/or functionalities
  2. Extend it from a base properties class that will examine each field in the class and determine if a property has been set, not set or if it is optional.

What we want to achieve here is that we want to show a developer which properties from a feature or functionalities are missing or not missing. We don’t show the values since the values may contain sensitive data, we only list ALL of the properties in a properties class no matter if they have set values or not. This is to show to a developer all the needed fields and which are invalid, including optional properties.

This approach will significantly improve a developers or a system admins daily work load by decreasing. You won’t have to guess what is missing. And combining with good documentation on the property level of a configuration class you should figure out easily what is missing.

BaseProperties class

Extend this class in all classes that you want to define properties.

import com.sato.library.common.general.exceptions.SettingsException;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.util.StringUtils;

import javax.annotation.PostConstruct;
import java.lang.reflect.Field;
import java.util.Optional;

public class BaseProperties {
    @PostConstruct
    private void init() throws Exception {
        boolean failedSettingsCheck = false;
        StringBuilder sb = new StringBuilder();

        // Go through every field in the class and log it's situation if it has problems(missing property value). NOTICE: A report of the settings properties is only logged IF a required field is not set
        for (Field f : getClass().getDeclaredFields()) {
            f.setAccessible(true);
            String optionalFieldPostFixText = " ";
            boolean isOptionalSetting = false;
            String classConfigurationPropertyFieldPrefixText = "";

            // Check to see if the class has a configuration properties annontation, if so add the defined property path to the logging
            if (getClass().getDeclaredAnnotation(ConfigurationProperties.class) != null) {
                final ConfigurationProperties configurationPropertiesAnnotation = getClass().getDeclaredAnnotation(ConfigurationProperties.class);
                if (!StringUtils.isEmpty(configurationPropertiesAnnotation.value()))
                    classConfigurationPropertyFieldPrefixText = configurationPropertiesAnnotation.value() + ".";

                if (StringUtils.isEmpty(classConfigurationPropertyFieldPrefixText) && !StringUtils.isEmpty(configurationPropertiesAnnotation.prefix()))
                    classConfigurationPropertyFieldPrefixText = configurationPropertiesAnnotation.prefix() + ".";
            }

            // Check to see if this field is optional
            if (f.getDeclaredAnnotation(OptionalSetting.class) != null) {
                optionalFieldPostFixText = " - Optional";
                isOptionalSetting = true;
            }

            // Check to see if a settings field is empty, if so then set the execution of the application to stop and logg the situations
            if (f.get(this) == null || (f.getType() == String.class && StringUtils.isEmpty(f.get(this)))) {
                // Skip empty field if they are set as optional
                if (!isOptionalSetting) {
                    failedSettingsCheck = true;
                }
                sb.append(classConfigurationPropertyFieldPrefixText + f.getName() + ": Missing" + optionalFieldPostFixText + System.lineSeparator());
            } else {
                // If the field is OK then mark than in the logging to give a better overview of the properties
                sb.append(classConfigurationPropertyFieldPrefixText + f.getName() + ": OK" + optionalFieldPostFixText + System.lineSeparator());
            }
        }

        // If even one required setting property is empty then stop the application execution and log the findings
        if(failedSettingsCheck) {
            throw new SettingsException(Optional.of(System.lineSeparator() + "SETTINGS FAILURE: You can't use these settings values of " + this.getClass() + " without setting all of the properties: " + System.lineSeparator() + sb.toString()));
        }
    }
}

Optional Annotation for optional properties

Use the following code to set optional properties in properties classes. This means that in the properties base classes any optional property is ignored as a fatal exception that needs to stop the execution of the application.

@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.FIELD)
public @interface OptionalProperty {
}

Using all of the above

@ConfigurationProperties(prefix = "myfeature")
@MyFeature
public class MyFeatureProperties extends BaseProperties {
    @OptionalProperty
    private String secretKey;
    private String region;

    public String getSecretKey() {
        return secretKey;
    }

    public void setSecretKey(String secretKey) {
        this.secretKey = secretKey;
    }


    public String getRegion() {
        return region;
    }

    public void setRegion(String region) {
        this.region = region;
    }
}

Notice: In the usage example code above I do not set a @Configuration annotation to the class, this is because the component scan will pick up this class and automatically determine it is a configuration class because of the @ConfigurationProperties annotation, yep this is a trick but it work nicely.

My Kubernetes Cheat Sheet, things I find useful everyday

Hi,

Here is a list of my personal most used and useful commands with Kubernetes.

kubectl config current-context # Get the Kuberneste context where you are operating

kubectl get services # List all services in the namespace
kubectl get pods # Get all pods
kubectl get pods –all-namespaces # List all pods in all namespaces
kubectl get pods -o wide # List all pods in the namespace, with more details
kubectl get deployment my-dep # List a particular deployment
kubectl get pods –include-uninitialized # List all pods in the namespace, including uninitialized ones

kubectl describe nodes my-node
kubectl describe pods my-pod
kubectl describe my-dep

kubectl scale –replicas=0 my-dep # roll down a deployment to zore instances
kubectl scale –replicas=1 my-dep # roll up a deployment to desired instaces number

kubectl set image my-dep my-containers=my-image –record # Update the image of the diven deployment containers

kubectl apply -f my-file.yaml # apply a kubernetes specific conifiguration, secrets file, deployment file

kubectl logs -f –tail=1 my-pod # Attach to the pods output and print one line of at a time

kubectl exec my-podf — printenv | sort # print all environmental variables from a pod and sort them

kubectl get my-dep –output=yaml # Print a deployment yaml file the deployment is using

kubectl get pod my-pod –output=yaml # Print the pod related configurations it is using

kubectl logs -p my-pod # Print the logs of the previous container instance, you can use this if there was a crash

kubectl run -i –tty busybox –image=busybox –restart=Never — sh # run a busybox pod for troubleshooting

More useful commands: https://kubernetes.io/docs/reference/kubectl/cheatsheet/

Helper Scripts for Docker, git and Java developers

Hi,

Here are some of my own scripts that I use when developing to ease my life:

Building a Java Gradle project, then building a docker image and pushing it

./gradlew test
if [ $? -eq 0 ]; then
    echo Tests OK
    gradle clean
    gradle generateGitProperties
    gradle bootRepackage
    ./cleandocker.sh
    docker rmi {your image name + tag}
    docker build -t {your image name + tag} .
    ./dockerregistrylogin.sh
    docker push {your image name + tag}
else
    echo Tests Failed
    exit 1
fi

Clean docker from all running containers and stopped ones

echo "Stoping all containers"
docker stop $(docker ps -a -q)
echo "Removing all containers"
docker rm $(docker ps -a -q)
echo "Starting dev environment"

Commit your code to git after gradle tests are successfull

./gradlew test
if [ $? -eq 0 ]; then
    echo Tests OK
    git add .
    git commit -m "$1"
    git push
else
    echo Tests Failed
    exit 1
fi

Merge your branch with your master

git checkout master
git pull origin master
git merge dev -m "$1"
git push origin master
git checkout dev

This one is for AWS Developers to run and get the AWS ECR docker login

#Notice: To use a certain profile for login define additional profiles like this: aws configure --profile awscli

function doAwsDockerRegistryLogin()
{
    local  myresult=$(aws ecr get-login --no-include-email --region eu-central-1 --profile awscli)
    echo "$myresult"
}

result=$(doAwsDockerRegistryLogin)   # or result=`myfunc`
eval $result

 

Ethical Hacking: Terminology – Part 1

I’ve started a new course on ethical hacking to get a better understanding of the internet, software security, personal security etc.

I’ll post a series of posts where I will write down my notes on what I’ve learned.

I’ll start today with some basic terminology:

Term Description
White Hat Hacker People that do hacking to help others, legal and ethical
Black Hat Hacker Unethical and unlegal activities
Grey Hat Hacker Between White and Black hat
Footprinting Information gathering on your target, on your task: like figuring out network related information, or software related details, or getting information from real world things or people. General information gathering in regard to your chosen target
DoS (Just you) Denial Of Service – On person performs a certain amount of request, more than the server can handle, to make the server crash. Servers can handle only a certain amount of requests and so the requests that does not fit into the request pool limit will be dropped out. If the service attack comes from one location/machine this is should not be possible.
DDoS (multiple people) Domain Denial Of Service – When you multiple computers/machines doing the service attack it will be harder for the software to know who to kick out.

 

The attack is not hard to do but the preparation is hard. You need to have multiple machines and to do this usually you have to infect other computers to create a bot farm of machines.

RAT Remote Administration Tools – For DDoS attacks needed a software that can be distributed upon other computers. This gives you control of a computer and allows you to hide your identity. The operations are not visible to a normal user. You can even hide them so that they do not show in normal operating system diagnostic tools.
FUD ( Anti-virus can not detect) Fully Undetectable – Also needed for DDoS attacks. Not labeled as malicious by anti-virus programs
Fishing Applying a bait and someone acts on it. Example: You get an email from someone and you click on it. Either it uploads something malicious or you do something that compromises your data, security.

 

Usually these are done so that the links look authentic but once you click on them you are redirected to some other server, which is not the one you would expect.

 

An easy way to spot these kind of addresses is to look at the address. If it is not from an HTTPS address then you are probably dealing with a false address. HTTPS addresses are much harder to fake.

SQL Injections Passing SQL Queries to HTTP requests. Allowing SQL command to run on a server to get or alter data that is not others wise intended to see or use.
VPN Virtual Private Network – Routing and encrypting traffic data between you and the VPN server/provider. A way of anonymizing yourself.

 

There is no real easy way to identify you unless the VPN Provider gives up your identity.

Proxy A less reliable way of staying anonymous. You could route your traffic between many proxies but the more proxies you have the harder it is to add new proxies to your traffic. This is mostly because of internet speed limitations, not enough available bandwidth. It will slow down you actions.

 

You can use free proxies and you can use paid proxies but paid ones leave a trace of whom you are.

Tor Open Source – Another way to hide your identity. Faster than proxies but slower than VPNs. Routes traffic through different routes, routers, places to hide your trace.

 

There is a very high chance of staying hidden (99.99%), there are tools, ways to find but highly unlikely.

VPS Virtual Private Server – a “security layer”, example: a virtual machine inside an actual machine that serves as a database server for you web server. This is done so that the database is not accessible from the outside directly.

 

In this way you can be specific who and from where can access that virtual machine.

Key Loggers Tools that are used to extract information from a machine, these needs to be deployed to a machine where the tool gathers key strokes and send that information to a location for analysis.

 

Key Loggers can extract existing information as well, you can modify the settings of a key logger (what, where, how to act), you can take screenshots, to use a camera on a device, microphone etc.

Terminal An interface to control your operating system. GUI tools are not as nearly as powerful as terminal tools.

 

Most hacking tools are designed for the terminal. Once you know how to do it in the terminal, you’ll know how to do it in the GUI.

Firewall A firewall is configured through iptable commands.

 

Linux firewall is open source and it has a HUDE amount of options. On Windows, by default you have some of these options but you will need to buy some package or application to get more options.

Root Kit rootkit is a collection of computer software, typically malicious, designed to enable access to a computer or areas of its software that would not otherwise be allowed (for example, to an unauthorized user) and often masks its existence or the existence of other software.
Reverse-shells There are thousands of Reverse-shells. You have a program that infects another device that program opens up a reverse connection from that device back to you. Therefore, you can keep up controlling an external device.

 

Usually you need to break through a router first and reconfigure it to give you more access to a network and machines.

Installing Apache Spark Cluster on Odroid C2 and Raspberry Pi

Contents

Summary. 1

Network configurations. 2

Odroid C2 Ubuntu 16.04 Mate. 2

Static IP. 2

Setting the Hostname. 2

Raspberry Pi 3

Static IP. 3

Setting the Hostname. 4

Common configurations. 5

Hosts file editing. 5

Disabling IP6. 5

Firewall 6

Creating an admin user to Spark. 6

Enable SSH communication between nodes. 6

Swap. 7

Installing other packages for Spark. 8

Java Installation. 8

Installing Scala. 9

Scientific Python installation. 9

Installing Apache Spark. 9

Spark Configurations. 10

Slaves. 10

log4j.properties. 11

spark-defaults.conf. 12

spark-env.sh. 13

Bash configurations. 14

Starting and Stoping Spark. 16

Submiting work to the cluster. 16

Summary

For this Spark cluster installation, I used the following packages and tools:

· Oracle Java SDK 8

· Scala 2.12.1

· Apache Spark 2.1.0

Even if you encounter URLs to other versions than those written above, they are not true. I was just lazy changing my own notes, which I have gathered from several sources. So change the files names and URLs to correspond to those you want and need.

Before I go on describing the process, I just want to mention that while it has been a bit difficult doing this installation, this was fun once I got it to work.

My biggest problems where that I come from the Microsoft ecosystem where most things are preaty much behind a UI, you don’t have to understand or know that much necessarily.

Is this good or bad? Well it depends, sometimes having a button that does things is nice but on the other hand, it takes away from actually knowing what you are doing. For example, managing users, privileges, file system etc is a totally different thing on Linux and you actually have to know what you are doing.

I found the experience with Linux very fun and enjoyable. What I struggled the most with was Spark and Hadoop (not the topic of this post). It was difficult to understand which configurations are needed. I had problems getting Hadoop to do anything and the errors whereas usually with any software obscure, or in other words a pain in the ass.

I really had to focus and want to make it to work. I felt like giving up at times.

Anyway learning using Linux was the most fun, so much fun that I ended up installing a Linux distro dual booting with Windows.

Network configurations

Odroid C2 Ubuntu 16.04 Mate

In my situation, I did the configurations through the UI but you can do it through the terminal.

Static IP

To configure a static IP goto: System > Preferences > Internet and Network > Network Proxy

There go to the IPv4 Settings and add your network desired information for the Odroid C2.

Setting the Hostname

GoTo: System > Administration > Network

Raspberry Pi

Static IP

On the top right corner press the mouse second button on the network indicator(the two arrows pointing up and down). Then select “Wireless & Wires Network Settings”. Then select the Interface and eht0 and configure the network configurations you desire.

Setting the Hostname

Goto: ”Start” icon > Preferences > Raspberry Pi Configuration > System tab > Hostname

You might need to restart your system.

Common configurations

Hosts file editing

This configuration needs must be done on every machine. You will need hostnames for your machines so that Spark and Hadoop can properly communicate between machines (nodes) in your cluster.

Type in terminal:

sudo nano /etc/hosts

Add your machine IPs and desired hostnames. My hosts file looks like this:

Notice that I have removed everything else and left the localhost definition. This is a strange this, with Spark I could not get the workers to communicate properly without the localhost definition but with Hadoop it was the other way around, not sure now why this is.

Disabling IP6

After looking several tutorials on installing Hadoop all mentioned that it is a good practice to disable the IP6 support. Apparently Hadoop does not support IP6 properly or at all. Since Apache Spark work ontop of Hadoop then I applied the same method on Spark also.

Open the /etc/sysctl.conf file for editing and add the following at the end of the file:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Firewall

In case you have problems between the nodes in a cluster disable the firewall or allow the nodes to communicate between the nodes with the needed ports.

Creating an admin user to Spark

In terminal create the user, add it to a group and give it admin privileges:

sudo addgroup spark
sudo adduser –ingroup spark spuser
sudo adduser spuser sudo

Login into the user and do everything related to Spark with this user:

su spuser

Enable SSH communication between nodes

This is to avoid using authentication when using Spark. If you do not do this you might run into problems and you also are constantly required to type in the account and password which you want to run Spark on.

Next create the SSH key and add it to the authorized_keys file.

$ cd ~
$ ssh-keygen -t rsa -P “”
$ cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys

Verify that the SSH tunnel is working:

ssh localhost

Copy the public keys to the slaves nodes in you cluster

$ ssh-copy-id spuser@raspberrypi01

$ ssh-copy-id spuser@raspberrypi02

Then test the connection to the slaves:

$ ssh raspberrypi01

Swap

This is something I had to do for Odroid C2. Even though the Odroid has double the RAM Raspberry Pi has, it also has Ubuntu installed on it which takes up nearly half the memory when booting. The Raspbian OS takes about a little over 150 MB, so about 15 % of the Raspberry Pi’s total RAM.

I ran into problems when I wanted to use the Odroid as the Master and also as a slave node for calculation. Because I wanted to use as much memory as possible on the actual Raspberry Pi nodes I allocated 768 MB, which is OK for the Pi’s but I could not allocate less than 512 MB for Odroid and allocating 512 MB caused the Odroid to swap and since there was no swap created the OS crashed or became unresponsive.

To combat this problem I create a swap for Odroid, the size of the Swap was double the RAM, 4 GB:

The Swap creation guide below is from:

http://www.tutorialspoint.com/articles/how-to-enable-or-add-swap-space-on-ubuntu-16-04

Checking for the Swap Information

Before we begin, we will first check for the swap space available on the server or system

We can use the below command to see that the system is having the swap partition or not

$ free -h

We can also run the below command but if the swap partitions do not exist, we cannot see any information.

$ sudo swapon –s

In the above command, we can see that the swap is not enabled or not configured for this server to configure the swap in this machine. We will first check for the free disk space available with the below command –

$ df –h

Creating a Swap File

As we know the disk space availability, we can go ahead and create a swap file on the filesystem. To create the swap file we can use ‘fallocate’ a package or utility which can create a preallocated size to instantly. As we have a little space on the server will create a swap file with 512 MB size to create a swap file below is the command.

$ sudo fallocate -l 512M /swapfile

And to check the swap file we will use the below command

$ ls -lh /swapfile
-rw-r–r– 1 root root 512M Sep 6 14:22 /swapfile

Enabling the Swap to use the Swap File

Before, we are going to enable the swap, we need to fix the file permission that other than root any others can read/write the file below is the command to change the file permission.

$ sudo chmod 600 /swapfile

Once, we change the permission we will check the file below and execute the below command to check the swap file permissions.

Once, we change the permission we will check the file below and execute the below command to check the swap file permissions.

$  ls -lh /swapfile
-rw——- 1 root root 512M Sep  6 14:22 /swapfile

We will now make this file as a swap space using this below command –

$ sudo mkswap /swapfile
Setting up swapspace version 1, size = 524284 KiB
no label, UUID=d02e2bbb-5fcc-4c7b-9f85-4ae75c9c55f9

Now we will enable the swap by using the below command

$ sudo swapon –s
Filename                                Type            Size    Used    Priority
/swapfile                               file            524284  0       -1

We can also check with free –h commands to see the swap partition

$ free –h

Making the Swap Partition/File to start Permanent

As in the above steps, we have created the swap partition and we are able to use that swap for temporary memory and once the machine is rebooted then the swap, setting will be lost to needed to use this swap file permanently we will make the swap file permanent.

We will edit the /etc/stab and add the information to mount the swap file even if we reboot the machine

$ sudo  vi /etc/fstab

Add the below line to the existing file.

/swapfile            none     swap     sw         0            0

For better performance for using the swap memory, we can do some tweaks.

 

Installing other packages for Spark

For my installation, I needed a few other packages to make things work:

  • Oracle Java version 8
  • Scala

Java Installation

 

For a more automatic installation, type the following commands:

$ sudo apt-get install oracle-java8-jdk

 

$ sudo apt-get update && sudo apt-get install oracle-java8-jdk

 

$ sudo update-alternatives –config java

 

For a more manual one got to the Oracle website and download the Linux ARM 64 Hard Float ABI package: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

 

Notice that my Raspberry Pi with Raspbian from December had the 32 bit Java 7 version installed. I used the 64 Bit Java 8 for Odroid C2 and 32 bit for Java 8 for Raspberry Pi.

When you have the package, do the following:

Enter the command to extract jdk-8-linux-arm-vfp-hflt.tar.gz to /opt directory.

$ sudo tar zxvf jdk-8-linux-arm-vfp-hflt.tar.gz -C /opt

 

Set default java and javac to the new installed jdk8.

$ sudo update-alternatives –install /usr/bin/javac javac /opt/jdk1.8.0/bin/javac 1

$ sudo update-alternatives –install /usr/bin/java java /opt/jdk1.8.0/bin/java 1

 

$ sudo update-alternatives –config javac

$ sudo update-alternatives –config java

 

After all, verify with the commands with -version option.

$ java -version

$ javac –version

I also added as the owner the spark user:

$ sudo chown -R spuser:spark jdk1.8.0/

Notice: To make life easier you should add environmental variables to your bashrc file. More on this in the Spark installation portion.

Installing Scala

Navigate to the following URL: http://www.scala-lang.org/download/

I downloaded the tar package, extracted it to a location and added proper privileges to the spark user:

$ sudo tar zxvf scala-2.12.1.tgz -C /opt

$ sudo chown -R spuser:spark scala-2.12.1/

 

Notice: To make life easier you should add environmental variables to your bashrc file. More on this in the Spark installation portion.

Scientific Python installation

 

This is not a requirement but I used these scripts to install Jupyter and Python 3.5 on my cluster nodes:

https://github.com/kleinee/jns

Installing Apache Spark

 

Start by downloading you desired package from Apache Spark URL: http://spark.apache.org/downloads.html

Or user wget: wget http://www.eu.apache.org/dist/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.tar.gz

Then extract Spark package

$ sudo tar -xvzf spark.2.1.0.tar.gz -C /opt/

Add the Spark User as owner

$ cd /opt
$ sudo chown -R hduser:hadoop spark.2.1.0/

If you are having problem with access to the spark folder or if you are installing Hadoop and have configured namenodes file system locations etc use the following command to add more privileges to the desired locations:

$sudo chmod 750 /opt/hadoop/hadoop_data/hdfs

Spark Configurations

Go to the Spark conf folder, depending where you installed spark:

$ cd /media/microSD/spark/conf

There are four files I had to configure for the cluster to work:

  • Slaves
  • properties
  • spark-defaults.conf
  • spark-env.sh

For more info on these files check out Spark documentation:

http://spark.apache.org/docs/latest/configuration.html

http://spark.apache.org/docs/latest/spark-standalone.html

The first step is to rename some of the files mentioned above. Some of the files have the “.template” file extension on them, remove it:

$ mv log4j.properties.template log4j.properties

If the slaves file does not exist then using nano you will create is automatically:

$ sudo nano slaves

The above assumes you are in the conf folder.

When you are done with the configurations on your master just copy them with scp to all slave nodes. Make sure you change the node specific values in these files(more on this below).

Slaves

Here you add the slave machines hostname, or the machines you want to do the work for you, the calculations:

odroid64

raspberrypi01

raspberrypi02

log4j.properties

With this file all we want is to minimize the amount of log on screen, it will be much easier to spot what is going on when you are not flooded with basic operational messages.

What you need to do it to change this:

log4j.rootCategory=INFO, console

to this:

log4j.rootCategory=WARN, console

spark-defaults.conf

Here we just want to specify the master URL so that we do not always have to specify it when submitting work to the Spark cluster:

spark.master                     spark://odroid64:7077

spark-env.sh

Here you specify the parameters that your cluster will use to communicate with all the nodes within it:

SPARK_MASTER_IP=odroid64

SPARK_WORKER_MEMORY=512m

SPARK_MASTER_HOST=odroid64

SPARK_LOCAL_IP=odroid64

SPARK_WORKER_CORES=2

SPARK_DAEMON_MEMORY=512m

SPARK_EXECUTOR_INSTANCES=1

SPARK_EXECUTOR_CORES=2

SPARK_EXECUTOR_MEMORY=512m

SPARK_DRIVER_MEMORY=512m

There are many variables, which you can tweak, I used and had to use the above ones to get things to work.

The SPARK_MASTER_IP and SPARK_MASTER_HOST HAVE to be the same on all nodes. The rest have to correspond to the actual physical node where the configuration file resides.

 

 

 

Bash configurations

 

For my cluster I used the following configurations(disregard the Hadoop ones, not necessary for Spark):

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_121

export HADOOP_HOME=/media/microSD/hadoop-2.7.3

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

#export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin

export HADOOP_OPTS=”$HADOOP_OPTS -Djava.library.path=/media/microSD/hadoop-2.7.3/lib/native/”

#export PATH=$PATH:$JAVA_HOME/bin:/media/microSD/spark/sbin:/media/microSD/spark/sbi

 

export SBT_HOME=/media/microSD/sbt

export SPARK_HOME=/media/microSD/spark

export SCALA_HOME=/media/microSD/scala-2.12.1

export PATH=$PATH:$JAVA_HOME/bin

export PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin

export SPARK_MASTER_URL=http://192.168.10.65:7077

 

The important ones for Spark are:

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_121

export SPARK_HOME=/media/microSD/spark

export SCALA_HOME=/media/microSD/scala-2.12.1

export PATH=$PATH:$JAVA_HOME/bin

export PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin

export SPARK_MASTER_URL=http://192.168.10.65:7077

You want to add Spark, Scala and Java to the PATH environmental variable to be able to access commands easily from Terminal.

You can then copy these configurations to the slave nodes using scp command:

http://www.hypexr.org/linux_scp_help.php

$ scp ~/.bachrc spuser@raspberrypi01:~

 

To force a refresh of the environmental variables

$ source ~/.bachrc

Starting and Stoping Spark

This is simple.

To start Spark type:

$ start-all.sh

To stop Spark:

$ stop-all.sh

To access Spark web UI:

http://odroid64:8080/

 

Submiting work to the cluster

 

Use the Spark specific command:

spark-submit

More info:

http://spark.apache.org/docs/latest/submitting-applications.html