On anything that got to do with software development

Bash scripting


In this post I am going to talk about writing bash scripts. This is an important skill to have because it can help in saving time especially with repetitive tasks. Therefore, I am going to talk about what a bash script is and some use cases.

What is bash scripting?

In order to understand what a bash script is, one needs to understand what a script is.

A script is a written text of a play, film or broadcast.

In the context of a play, a script tells an actor what to say and do. An actor therefore does what the script tells him/her to do. This is the reason why we have terms such as manuscript or prescription. Using prescription as an example, these are instructions from a doctor that authorizes a patient to be issued with medicine or treatment. The patient only gets what is on the prescription. From these explanations we can see that there is no deviation from the script, what is on the script is what is done.

Now, bearing the explanation above in mind, a bash script is a file with written text intended for the bash shell and the written text is a series of commands. In other words, these commands will be telling bash shell what to do. The command, mkdir, for example, is a command used to make a directory and like any other command it can be used to compose a bash script.

If the bash file has the following, as an example,

mkdir tutorial

its telling the bash shell to make a directory called tutorial. Therefore, the day to day commands you use on the terminal can be used to write a bash script. The commands you find in a bash script can be used on the terminal as well.

How to write and use bash scripts

Now that we have an understanding what a bash script is, the next thing we want to know is how to write and use bash scripts. In the context of Linux, the standard extension for bash script is .sh. If you therefore see a file with an extension .sh, it is a bash script or an executable file. When writing a bash script, the first line to write is:


This indicates that you want to use bash shell to execute commands.

Little background
The character sequence consisting of the hash and exclamation mark (#!) is known as shebang. In the Unix/Linux world, the shebang is an interpreter directive. This is just a way of informing Unix/Linux which program is going to be used to execute the commands. Some examples of shebang are as follows:
#!/bin/sh – Execute the file using the Bourne shell, or a compatible shell, with path /bin/sh
#!/bin/bash – Execute the file using the Bash shell.
#!/bin/csh -f – Execute the file using csh, the C shell, or a compatible shell, and suppress the execution of the user’s .cshrc file on startup
#!/usr/bin/perl -T – Execute using Perl with the option for taint checks
#!/usr/bin/env python – Execute using Python by looking up the path to the Python interpreter automatically via env
#!/bin/false – Do nothing, but return a non-zero exit status, indicating failure. Used to prevent stand-alone execution of a script file intended for execution in a specific context, such as by the . command from sh/bash, source from csh/tcsh, or as a .profile, .cshrc, or .login file.

Now whatever follows after the shebang is the intent of the file, what is it that you want to accomplish when the file is executed. Using the above example again:

mkdir tutorial

save the file as make_directory.sh. After saving the file, make it an executable file by changing the mode as follows chmod 755 make_directory.sh or chmod +x make_directory.sh. After changing the file into an executable file, you can run or execute it as follows ./make_directory.sh. The reason why we are doing it this way ./ is because by just running make_directory.sh it will not get executed. Bear in mind that in Linux “.” refers to the current directory and therefore by running it like ./make_directory.sh we are just saying “in this directory there is an executable file called make_directory.sh”, do something about it.

Use cases

Making a directory is not all that can you can do with bash scripts. One use case of bash script is to automate repetitive tasks, such submitting a Spark job, connecting to Kubernetes pod or getting pods and so forth. I am going to add some of the scripts that I use on my github repository. Feel free to look through, add/subtract your are welcome to do it.


This was a high overview what a bash script is and a few examples of use cases. Bash scripts are becoming more and more important especially with the dawn of containerization. They assist in automating tasks and they can also be used to execute commands when a Docker container, as example, is started. In upcoming posts I will look at variables, functions, loops and so forth when writing bash scripts.




Cannot cast exception while reading data from Hive in Spark


In this post I am going to talk about an exception I got when I was trying to read data from Hive using Spark and how I managed to debug the issue and resolved it. I will also explain how one can reproduce the issue, by doing so, one can also avoid reproducing it. The exception was:

java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.DateWritable

Spark code

Here is the snippet of the code that I had in spark:

From the code snippet above, all I am doing is a simple select from a table called geonames_data which is in a database called tendai_test and I just want to read only 5 records. The exception is thrown when trying to retrieve the data using the show function.


Now, my first instinct was to check the schema of the table. Using the Dataframe’s printSchema function, I can see the fields and their datatypes as shown below:

From the schema above I can see that all the fields and their datatypes and all is what I expect. However, from the query I am using to retrieve the data, there is no place I am doing a cast. Therefore, there got to be another place I need to check to see if there is a difference.  The next place I checked was to read the schema from ORC as shown below:

From the schema above, I noticed that there is a difference in datatypes for a field called modificationdate, from ORC its string whereas from Hive its a date.


Quick solution is to do a backup of the table in Hive as follows:

create table tendai_test.geonames_data_backup as select * from tendai_test.geonames_data;

After backing up the data, drop the original table as follows:

drop table tendai_test.geonames_data purge;

Once the table is dropped, create the table using the original script and using the geonames_data table, the script is as follows:

create table tendai_test.geonames_data(
geonameid int,
name string,
asciiname string,
alternatenames string,
latitude int,
longitude int,
featureclass string,
featurecode string,
countrycode string,
cc2 string,
admin1code int,
admin2code int,
admin3code int,
admin4code int,
population int,
elevation string,
dem int,
timezone string,
modificationdate date)

Once that is done, populate the table using the backed up data from geonames_data_backup as follows:

insert into tendai_test.geonames_data select * from tendai_test.geonames_data_backup;

Back to the Spark code

If I run the Spark code again, I will no longer have the exception as shown below:

Just did a minor modification to the code by selecting a few fields.

Reproducing the issue

Using the backup table from the above, we can recreate the issue as follows:


The lesson here is that, when writing data in Spark, make sure that the datatypes are not changed. Or when doing a cast, make sure that the cast is of the right datatype. Check first before doing a cast to make sure that the right thing is being done.

Doing this without checking might slow down production especially if working in a big team. Say someone is reading data from geonames_data, and the expected datatype of modificationdate is date, then all of a sudden there is now a cast exception. Trying to get to the bottom of the issue and resolving it, will slow down progress.


This is what I experienced and I just wanted to share. In future posts, I will talk about Hive since there are some concepts in this post that I didn’t talk about, for instance ORC and Dataframes.

Apache Spark 2


In the article, Apache Spark 1, I looked at a high overview of what Apache Spark is and its architecture. I touched on the fact that at the core of Apache Spark there is a very advanced Direct Acyclic Graph (DAG) data processing engine. In this article I will look into DAG and how it works in Spark.

What is Direct Acyclic Graph (DAG) ?

Spark is efficient due to its advanced Directed Acyclic Graph (DAG) data processing engine. To understand how DAG works, we have to break it down and define each word.

Graph, what does it mean? Its a diagram representing a system of connections or interrelations among two or more things by a number of distinctive dots, lines, bars and so forth. In our case, which is DAG, the graph is a representation of connected nodes and they are connected by what are know as edges.

Acyclic, what does it mean?  Its an adjective that describes a graph in which there is no cycle or closed path.

Directed, what does it mean? In the context of DAG, its a direction an edge has to take and it is represented by an arrow.

From the diagram above, we have a graph with a collection of nodes denoted by the letters A to F and they are connected by edges which have specific directions, hence the arrows. If we are to follow the movement, say from A, we will never go back to A, hence they are acyclic. There is no circular movement and because of that, it enables dependency tracking, for instance, B depends on A and F on E and so forth. If we think of it as a data processing pipeline, A and G will be the source of the data and it will go through several transformation steps until it gets to F which will be the final result.

Now in the context of Spark, our nodes are RDDs and the arrows are Transformations. Below is an example of how DAG works in Spark:

This is a simple word count that will be done for a ReadMe file. Visually the process will look as follows:

The textFile is our first RDD whose element is a line and it will be transformed using flatMap to another RDD whose element is a word and then map  will transform the word to a key value pair RDD. The reduceByKey will then do the summation of the counts of each word. Spark comes with its own DAG visualisation tool. From that tool the DAG diagram will look as follows:

Now, if we look at the first DAG diagram, whereby we were using letters as nodes, we established that letter A and G will represent the source of our data. The final result will then be letter F, after it has gone through some transformation. This is the same with the above example, the ReadMe file is the source and as it moves from Stage 0 to Stage 1, the data will go through some transformation. But then one would ask, why the split between Stage 0 and 1? Why is reduceByKey in its own stage?

As I mentioned earlier, one of the advantages of DAG is the tracking of dependencies. The textFile has to be flatMapped -> mapped -> reducedByKey.  Each line from the ReadMe file will be transformed to words. Now each word does not have any dependency on the other words. However, when we now want to find the number of occurrences of each word, there is going to be need of movement of words. In Spark, this is known as shuffling and if there is need for shuffling, Spark sets that as a boundary between stages. If a task will result in shuffling, it will be placed in its own stage, hence, reduceByKey is in Stage 1. Once that is done, tasks in each stage are then bundled together and are sent to the executors.


In this article, I managed to explain what DAG is in Spark context and also gave an example on how it works in Spark. This is just scratching the surface since more is involved when running a Spark application. For instance, when a Spark application is submitted to Apache Spark, one needs to understand how to set Spark configurations for Spark to work efficiently. Therefore, questions like, how much driver memory should I use, how many executors or number of cores per each executor do I need, will need to be answered.

When developing a Spark application, one also need to understand what transformation is and what action is in the context of Spark. Understanding these terms will help in developing an application that will be executed efficiently.

Will try to cover these concepts in the next articles on Apache Spark.

Execution plan 



Apache Spark 1


In the article, Big Data Handled Using Apache Hadoop, I looked at what is Hadoop and how it is used to handle Big Data. In conclusion I noted that Hadoop is an aggregation of different modules whose purpose is to store and process Big Data in an efficient and secure manner. One of the modules that can be used on top of Hadoop is Apache Spark. In this article, which is first of many series to come on Spark Apache, I am going to look at a high level overview of what Apache Spark is.

What is Apache Spark ?

Spark, an open source project, is a fast, general purpose cluster computing platform that takes advantage of parallelism, by distributing processing across a cluster of nodes, in order to process data very fast. What make Spark process data fast is the fact that it does the processing in main memory of the worker nodes and in doing so it prevents the unnecessary input and output operations with the disks. According to Apache Spark documentation, Spark is capable of running programs up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk.

Matei Zaharia gave birth to Spark in 2009 as a project within the AMPLab at the University of California, Berkeley. Spark was later on donated to Apache in 2013 and then got promoted as a Top-Level Apache Project in 2014. Spark is one of the most active projects managed by Apache, with more than 500 contributors from across 200 organizations responsible for code and a user base of 225 000+ members. Among the contributors are well-funded corporates such as IBM, Databricks and China’s Huawei.

Spark’s Architecture

Spark comes with a very advanced Direct Acyclic Graph (DAG) data processing engine. On top of that engine, Spark has a stack of domain specific libraries that provide different functionalities useful for different big data processing needs, as shown by the diagram below:

  • Spark SQL enables the use of SQL statements inside Spark applications.
  • Spark Streaming enables processing of live data streams.
  • Spark MLlib enables development of machine learning applications.
  • Spark GraphX enables graph processing and supports a growing library of graph algorithms.

Spark can run in:

  • a standalone mode on a single node having a supported OS.
  • a cluster mode on either Hadoop YARN or Apache Mesos.
  • the Amazon EC2 cloud as well and on Kubernetes (an open source system for automating deployment, scaling and management of containerized applications).

In a distribution application, just like Spark, there is a driver program that controls the execution and there will be one or more worker nodes. The driver program allocates the tasks to the appropriate workers. In Spark, the Spark Context is the driver program and it communicates with the appropriate cluster manager (and this can either be YARN, standalone or Mesos),  to run the tasks, as shown below:



This is just a high overview of what Apache Spark is and its high level architecture. In the next series, will take a deep dive into Apache Spark.

Apache Spark Docs
Apache Spark Architecture Explained


My experience with Docker – Part 2


In this article, My experience with Docker, I touched on what Docker is and how it works. I didn’t get to actually building a Docker container. In this article, I am going to look at how one can get started in using Docker based on the research I did and also on my personal experience.


First thing first, you need to install the Docker engine. In order to install Docker CE, you need the 64-bit version of either Ubuntu Artful 17.10 (Docker CE 17.11 Edger and higher only) or Xenial 16.04 (LTS) or Trusty 14.04 (LTS). Here is how you install Docker CE:

  • sudo apt-get update :> updates the apt package index
  • sudo apt-get install apt-transport-https ca-certificates curl software-properties-common -y :> install packages to allow apt to use a repository over HTTPS
  • curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add – :> add Docker’s official GPG key
  • sudo add-apt-repository “deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable” :> this sets up a stable repository.
  • sudo apt-get update :> updates the apt package index
  • NOTE: in order to install a specific version of Docker, run the following command:
    • apt-cache madison docker-ce
  • sudo apt-get install docker-ce=17.12.1~ce-0~ubuntu -y :> install Docker version 17.2.1. NOTE: this will vary depending on the version you would have selected from the above command.
  • sudo docker run hello-world :> confirm that Docker installation was successful.

Docker concepts

There following are concepts one needs to know first before getting hands dirty with Docker:
Note: for more detailed explanation click here.

Docker Engine  is a client-server application with the following major components:

  • A server which is a type of long-running program called a daemon process.
  • A REST API which specifies interfaces that programs can use to talk to the daemon and instruct it what to do.
  • A command line interface (CLI) client -> the docker command.

Docker Daemon (dockerd) listens for Docker API requests and manages Docker objects such as images, containers, networks and volumes. It can also communicate with other daemons to manage Docker services.

Docker client (docker) is the means by which many Docker users interacts with Docker. Commands such as docker run, when executed, they are sent to dockerd by the client so that the daemon can carry out the commands.

Docker image is a read-only template with instructions for creating a Docker container.

Docker container is a runnable instance of an image.

Docker registry stores Docker images.

Getting started

Now that Docker is installed and have verified that all is good with the installation, to get started with creating a container, first thing one needs to do is to create a file called Dockerfile. This file is used to create the container and it is in this file where instructions on what is needed for the container to be operational are set. For instance, here is how a simple file would look like:

# Use an official Ubuntu 14.04 runtime as a parent image
FROM ubuntu:14.04

# Run echo when the container starts
CMD echo “Hi there, just started my first container!!!”

The instructions here are simple, use Ubuntu 14.04 as a parent image and then output “Hi there, just started my first container!!!” when the container launches. Once the file has been created, the next step is to build an image by running the command below:

docker build -t my-first-image .

The command will build the image with a tag my-first-image which gives the image a friendlier name. Once the build is done, run any one of the following commands and they will list the images you have locally:

  • docker images
  • docker image ls

From the above, we can see that the image has been created. The next step is to run the image and create a container as follows:

docker run –name my-first-container my-first-image and it will output Hi there, just started my first container!!!

By just running the command above, a container has been created and to see the container, run the following command:

docker ps -a

By just running the command above, it will list containers and will provide details such as container id (auto generated by Docker client), image (the image used by the container), created (time the container was created), status (provides the status if the container is running or not), ports (provides ports that are exposed for the container) and name (container name).

So far we created a Dockerfile, the file that has  instructions on what is needed to get the container operational, ran the build command to create Docker image and finally ran the command to build Docker container.

The commands we ran so far are as a follows:

  • docker images or docker image ls :> lists docker images
  • docker build [options] :> builds the image from the Dockerfile
  • docker ps -a :> lists the containers

These are just a few commands, there are more just run docker and it will give the list commands you can run and more information on what you can do with the Docker client.


From this article we saw how one can install Docker, create a Docker image and a Docker container. Next article will look on how Docker can be part of continuous integration.

Docker overview


My experience with Docker


This is a basic introduction to what Docker is based on personal experience and what I understood from online research.

What is Docker?

One needs to understand what containerization is in order to understand what Docker is. I am going to use an example here to explain some concepts first. Lets say you are relocating to a new place. You start to pack your stuff in different boxes or containers. One container will have plates, the other coffee mugs and another electric appliances and so forth. Once all the packing is done, you are going to put these containers in a vehicle and transport your stuff to the new place.

The act of putting stuff in containers and transporting them is known as containerization.  You cannot mix your plates with electric appliances, but you can put them in different containers and transport them all at once using one vehicle.

The same applies with operating software, they cannot mix or in other words they cannot both boot simultaneously when a computer is powered on. But due to virtualization, you can have your base operating software as Windows and have Linux running on a virtual machine. One a single laptop or computer you have two operating software running. Back to our moving example, lets say one of the container is a cooling container for keeping meat fresh. But for this container to work, it must also get some of the fuel from the vehicle that is transporting these containers. The cooling container is sharing resources with the vehicle. This is not efficient which is the same with virtual machines (VMs). VMs share resources such as CPU, hard drive space and RAM and this is not efficient especially if the VM needs to operate heavy processes.

Now imagine if the cooling container has its own source of fuel and all it needs is to be placed on the transporting vehicle. At the end you have plates container, electric container and cooling container all being transported with one vehicle.

From the above example we can establish that a container is used to enclose and hold something either for storage, packaging or for transportation. Then containerization is a system of transporting containers. With that in mind, think of application containerization as enclosing or packaging files or libraries needed to run a desired software. This then brings us to Docker. In a nut shell, Docker is a software used to containerize other software or applications. It is developed by Docker Inc and it is open source.

How Docker works

Using our delivery truck example, the image below will show how docker works in a nut shell:

The image below is going to be more technical on how Docker works on a real machine:

As shown from the diagram above, Docker is sharing the Linux’s kernel which in turn will enable the containers to run on the same kernel. Using our delivery analogy, think of the Docker container for HDP Sandbox as the cooling container that requires its own fuel to cool the meat. The Docker container for HDP Sandbox runs on CentOS, has Java and openjdk and other applications include Hive, Ambari, Spark, Yarn just to mention a few. Its own operating software is CentOS yet the base operating software is Ubuntu 16.04.4 LTS. The cooling container requires petrol to operate its cooling engine, a hypothetical example, whereas the delivery truck is running on diesel. In addition, this is different from Docker container for ElasticSearch and Spark. They have their own needs that are separate and independent to those of the HDP Sandbox.

So Docker has containerized these application down to the basics, that is, share the kernel and run your own operating software. The requirements for this to run are not that much, hence, more containers can be added. In addition, due to the fact that the Docker engine allows applications to run on the same kernel with the base operating software, the applications can be started faster. Lastly, by virtue of these applications having their own operating software, the applications can run anywhere as long as Docker is installed on the host / base operating software. Again, using the delivery analogy, if the new place you are moving to is say abroad, the delivery truck will deliver the containers to an airport. Once at the airport, the containers will be transported using an airplane.

If you put the Docker containers for Spark, HDP Sandbox and ElasticSearch on a machine with Mac or Windows, they will run the same way they would on a machine with Linux.


In conclusion, Docker is a software that is used for application containerization. It is not the only containerization software, CoreOS released Rocket. Microsoft is working on its own software for containerization called Drawbridge.

You might be then wondering how these containers are build and once they are built how are they managed. In future articles I will look at how I use Docker and what I use to manage the containers. Back to our moving analogy, at the airport, planes are managed so that they land on the right runway, avoid air collisions and also takeoff at the right time. This again is similar with Docker containers, they need to be managed so that those that need to communicate with each other they can do so. Therefore in future articles I will also look at Kubernetes, an application that I use to manage Docker containers.

Containerization 1
Containerization 2

Big Data Handled Using Apache Hadoop


In the article, Big Data, what does it mean?,  I spoke about what Big Data is and its characteristics. Upon conclusion, we noticed that there are technologies that are used to handle Big Data and that one of those technologies is Hadoop. In this article we will look at what Hadoop is used and a high overview of how it is used to handle Big Data based on the research I did online and what I understood.

What is Hadoop?

In order to understand what Hadoop is, one needs to understand the problem Hadoop is trying to address, and that problem is Big Data. We now know that Big Data is characterized by the 3 Vs, volume, velocity and variety. There is need to store huge amounts of data, in its different forms, and that cannot be done on a single computer. There is also need to process that huge amount of data and again it cannot be handled by a single computer using the traditional software we already have.

Lets use a hypothetical scenario in order to understand what Hadoop is. There is a bumper harvest on a farm and there is need for storage of the harvest, say wheat. Now a single silo will not be able to store all the wheat and there is need for more than one silo, say 24 is needed. All the wheat is harvested and stored nicely in the silos. Now the wheat needs to be processed so that the farmer can produce flour for baking. To process one silo takes about, say 24 hours, that is one day. For all silos, that will take 24 days and the farmer’s clients are already waiting for the flour to be supplied and they cannot wait for 24 days. The farmer then employs a more advanced machinery that can process more wheat and takes less time, say for one silo it now takes 4 hrs. If my math is right, one machine will take 4 days to process all the silos, but this needs to be done in one day, so the farmer buys 3 more machines. All is good, the wheat is processed in a day and the following day all the flour has been packaged and is ready for delivery. All this is summarized below:

Wheat can be thought of as data that needs to be stored and one silo will not do. The same with data, one computer will not be sufficient and hence we need a number of computers. A collection of computers whose purpose is to store data and process it is known as a cluster and each computer on that cluster is known as a node. The wheat can be thought of as being distributed to each silo. This is the same with data, it is distributed among the computers and in Hadoop it is achieved using its distributed flies system known as Hadoop Distributed File System, HDFS in short. The wheat again is processed in parallel so that flour can be produced in a day. This is the same with data, it can be processed in parallel so that whatever computation is done can produce results faster. In Hadoop this is done using a MapReduce programming model. For the machines to operate smoothly, the farmer needs to overlook at the whole process and in Hadoop that is done using YARN short for Yet Another Resource Negotiator.

Hadoop, an Apache open-source project, is therefore a combination of modules whose purpose is to store and process huge amounts of data. At the core of Hadoop, there is HDFS, responsible for storing the data and MapReduce, responsible for processing the data. In addition to the core there is Hadoop YARN and Hadoop Common. The modules are summarized as follows:

Hadoop Distributed File System (HDFS) is a distributed file system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

Hadoop MapReduce is an implementation of the MapReduce programming model for large scale data processing.

Hadoop YARN is a platform responsible for managing computing resources in a cluster and using them for scheduling applications.

Hadoop Common contains libraries and utilities needed by other Hadoop modules.

Back to our farm analogy, if the farmer wants to keep say chickens for poultry farming, the farmer will build a poultry cage on the very same land where wheat is grown and kept. The farmer can do more than one activity on the farm, the limit here is the size of land and talent. Same thing with Hadoop. It is not limited to the aforementioned modules, but other modules can be added onto the platform such as Hive, HBase, Zookeeper, Kafka, Storm, Spark and so forth. All these modules perform different functions.

How it began

The co-founders of Hadoop are Doug Cutting and Mike Cafarella. The name Hadoop, came from Doug’s son’s toy elephant.  Doug and Mike were inspired by the “Google File System” paper that was published in October 2003. The initial development was on the Apache Nutch project but later on moved to the new Hadoop project in January 2006. The first committer to add to the Hadoop project was Owen O’Malley in March 2006 and Hadoop 0.1.0 was released in April 2006. It continues to evolve through the many contributions that are being made to the project.


This was just a high overview of what Hadoop is and how it handles Big Data. From this, in short, we can say Hadoop is an aggregation of different modules wholes purpose is to store and process Big Data in an efficient and secure manner. In future articles, I will look into the core modules of Hadoop in depth.

Wiki Apache Hadoop

Big Data, what does it mean?


The buzz these days is Big Data. From government institutions to private entities, they are all talking about Big Data. What is Big Data? In this article, I will try to explain what Big Data is based on the research and understanding I did.

Big Data, what is it?

Before answering this question, the first question we must ask ourselves is, what is data? Dictionary definition of data states that data is facts and statistics collected together for reference or analysis. If you think about it, we have been collecting data for a very long time before the dawn of computers, using files and books. Now with the use of computers we moved from capturing facts or statistics on paper to using spreadsheets and databases on computers. Once the information has been captured on computers it is now in binary digital form.

Now let us look at the definition of big. The dictionary defines big as something of considerable size or extent. Some words that are synonyms to big include, large, great, huge, immense, enormous and so forth. Combining the two, it follows that we can say Big Data is an enormous collection of facts and statistics collected together for reference or analysis. Is it that simple? Not even close. To understand why, we need a small history. Remember we shifted from paper to computers for capturing data and guess what, the first hard drive, which is a storage device on a computer for storing data, only captured 5 MB of data. This is equivalent to Shakespeare’s complete work or a 30 second video clip of broadcast quality.

Today, a standard computer can have a hard drive that can hold up to 15.36 terabytes (TB) of data. If 10 terabytes of data is equivalent to the printed collection of the U.S Library of Congress, 15.36 terabytes, now that’s a lot of data on a single computer. I am sure we can start to see why Big Data is in hype these days. In terms of volume, we can now have the capability of holding as much information as we can using computers. But is that all? I mean is Big Data about the amount of data we can hold? Not even close.

We are no longer capturing data using computers only, we now have smart devices such as our phones,  refrigerators,  airplanes and even motor vehicles. All these devices have the capability of capturing data. Now imagine people on a platform like Twitter and they are tweeting about something that has just happened using their phones and people at work are also on the same band wagon, posting about the same thing. Now all of a sudden there is a lot of information coming through Twitter and say for every two seconds, a hundred people are tweeting. At this point, Twitter is not only experiencing a surge of data but the speed or velocity that the data is coming through is also high. People are not only posting using text but they are also using pictures and videos. Now not only surge and speed but data is also coming in different varieties.

At this point we can see that Big Data is not only about the volume of data but it is also about velocity and variety. In Big Data world, these characteristics are known as the 3 Vs. Lets look at the each of the Vs.


Volume is one of the main characteristics of Big Data because of the meaning of the word volume itself, which means, the amount of space that a substance or object occupies or that is enclosed within a container. In this case we can rephrase it to the amount of data that is occupied on a hard drive. For instance, think about the fact that Twitter has more active users than South Africa has people. Each of those users have posted a whole lot of photographs, tweets and videos. What about platforms such as Instagram, they have reported that on an average day 80 million photos are shared and what about Facebook, it is reported that it stores about 250 billion images.

Lets try to quantify this data. Say you have a phone that has a resolution of 1440 x 2560 pixels, multiply 1440 by 2560 which gives you 3,686,400 pixels. You then multiply this number by the number of bytes per pixel:
16 bit per pixel image: 3,686,400 X 2 bytes per pixel = 7372800 bytes = 7.37 MB approx
32 bit per pixel image: 3,686,400 X 4 bytes per pixel = 14745600 bytes = 14.75 MB approx

Now for argument’s sake, if on average day 80 million photos are shared on Instagram, how much space is need if say all users as using a phone with the above resolution? If my Maths is right, that is about 1,180,000,000 MB which is 1.18 Petabyte, take note here, per DAY!! Now, that’s a lot of hardware equipment that is need to store that much of information.


Velocity is also an important characteristic of Big Data. Why? Again, from the meaning of the word velocity, which is, the speed of something in a given direction. In this case, the speed at which data is flowing to a data center. Back to our Twitter analogy, every second, on average, around 6,000 tweets are tweeted, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day and around 200 billion tweets per year. That is a lot of information that needs to be handled so that there are no bottlenecks. If a user logs on Twitter, the experience must be flawless.


Variety is the quality or state of being different or diverse; the absence of uniformity or monotony. In others words, not the same and this is true for data as well. From the examples we used, people tweet using text, photos and videos. The data here is different. In other words data can be structured and unstructured,  which means that not all data can easily fit into fields on a spreadsheet or a database application. There got to be ways and means of storing data in its different form.


Big Data is not only how huge or enormous the data is but its also about how fast the data is moving and its type. If there is a surge in the amount of data that needs to processed, the next question we need to answer is, how is it all handled? When a user logs on Twitter, Facebook, Instagram or on a news site like BBC or streaming on YouTube, the experience is flawless, but the amount that is coming through those platforms is huge and moving fast and in different form. Due to this, it gave birth to various technologies that are being used to manage the surge.  These technologies include, Hadoop, Spark, Hive, Kubernetes just to mention a few. Will look into some of these technologies in future articles.

How much is data
First hard drive
Twitter stats
Images measure

My experience with Scala – Part 1

This is going to be a series of theories and personal experience with Scala.  I am currently using Scala for developing Spark applications.

Scala is an abbreviation of the term SCAlable LAnguage. It was created by Professor Martin Odersky and his group at EPFL, Ecole polytechnique federale de Lausanne, Switzerland, in 2003. Scala provides a high-performance, concurrent-ready environment for functional programming and object-oriented programming on the Java Virtual Machine (JVM) platform.

Installing Scala:
As a JVM language, Scala requires the use of a Java runtime. Scala 2.11 needs at least Java 6. For optimal performance, rather install Java 8 JDK. The download for Java 8 JDK is available on Oracle’s website.

The following installation can be done on Ubuntu. Copy and create an executable file and run it. For instance, using nano or any editor of your choice, from the terminal:
sudo nano install_java_scala.sh  -> this will create a file called install_java_scala.sh
copy the script below and paste
ctl + o + enter to save
ctl + x to exit nano
sudo chmod +x install_java_scala.sh to make the file executable
run ./install_java_scala.sh

We are installing wget in order to download the installation files needed to install Java and Scala.
Also not that you might have to change the link to download Java, because it changes.


Once it is done, you can test to see if both Java and Scala have been installed correctly. To test, from the terminal run java -version and you should get:

java version “1.8.0_161”
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

As for Scala, just run scala and you should get:

After running scala from the terminal, you will be now on the Scala REPL and all is good to go for coding in Scala. If you are having issues in getting started with Scala, make sure that the Java command is installed correctly and that the system path points to the correct Java version.

The Scala REPL is a tool for evaluating expressions in Scala: for instance, type println(“Welcome to scala!”), it will print, Welcome to Scala!. The scala command will execute a source script, like the above println(text), by wrapping it in a template and then compiling and executing the resulting program. A help system is available and can be started by entering :help command.

Note: A Read-Eval-Print Loop (REPL), also known as an interactive top level or language shell, is a simple, interactive computer programming environment that takes single user inputs (i.e single expressions), evaluates them and returns the result to the user.

The other tool to install is SBT, which is the de facto build tool for Scala and the instructions for installing the tool can be found here (scala-sbt.org). Once the installation is done, you will be able to run sbt command from the terminal. If you start the sbt command and don’t specify a task to run, SBT starts the REPL. Below is the lists of sbt tasks you can run together with the description of the task:

      • clean      Delete all build artifacts
      • compile  Incrementally compile
      • console  Start the Scala REPL
      • eclipse   Generate Eclipse project files
      • exit        Quit the REPL or Ctrl+d
      • help       Describe commands
      • run         Run one of the “main” routines in the project
      • show x   Show the definition of variable “x”
      • tasks      Show the most commonly-used available tasks
      • tasks -V Show ALL the available tasks
      • test        Incrementally compile the code and run the tests
      • ~test      Run incr. compiles and tests whenever files are saved
        This works for any command prefixed by “~”



This is a basic introduction to Scala. In the next series I will look into useful terms in Scala that needs to be defined and explained.

Learning scala



Object Oriented Programming (OOP)

Object oriented programming (OOP) is not a programming language used to write code, instead it is an example or model that explains a concept or theory that helps you to design software. How does it help you? To understand how OOP can help you, you need to understand two things, an Object and a Class.

An object is something that you can see, feel and touch. For instance, a mobile phone, you can see and touch it. Due to the fact that you can see and touch a mobile phone, you can also describe it, for instance, the color, name, make, model and weight, which are known as Attributes of an object. These attributes in OOP are known as Properties of an object.

Not only can you describe your mobile phone, but you can also perform actions on it, for instance, make a call or play music, watch movies or even hang up. These are known as Behaviors and in OOP as Methods.  Now, when a call is being made, and after a certain period of time there is no answer on the other side of the call, it will hang up automatically and can also send an instance message to the person you were trying to call. Now, this an enumeration of what should happen when making a call and in OOP it is known as Events.

In short, descriptive items of an object are know as Properties, actions you can perform with an object are known as Methods and are described using verbs. Lastly things that can happen to an object are known as Events.

A class is simply a representation of a type object. It is the blueprint/ plan/ template that describes the details of an object. A class is the blueprint from which the individual objects are created.

From the above definitions, it follows that, a class is used to create a mobile phone. A class will represent our real mobile phone in our software.

Point to note, we have already stipulated that OOP is not a programming language. It follows that there are programming languages that we can use to implement OOP and these include, C#, PHP and Java, only to mention a few.

Understanding OOP is not complete without discussing its four main concepts or commonly referred to as the four pillars of OOP, which are:

  • Encapsulation
  • Abstraction
  • Inheritance
  • Polymorphism

It is when you hide your module’s internal data and all other implementation details/mechanism from other modules. In other words, it is a means of hiding data, properties and methods from the outside world or outside the object’s scope and only revealing what is necessary. Encapsulation works hand in hand with Abstraction, discussed below, in the sense that, it exposes essential features of an object while Encapsulation hides unwanted data or private data from outside of an object.

Encapsulation is achieved by using access specifiers, which define the the scope and visibility of an object’s member. Using C# as an example, it has the following access specifiers, namely:

  • Public
  • Private
  • Protected
  • Internal
  • Protected internal

Private access specifier limits the accessibility of a member to within the defined type. If, as an example, a variable or function is declared as private, the type or member can only be accessed by code in the same class or struct.

 Public access specifier allows a class to expose its member variables and member functions to other functions and objects, it has no limits. The type or member can be accessed by any other code in the same assembly or another assembly that references it.

Protected The type or member can only be accessed by code in the same class or struct, or in a derived or inherited class.

Internal The type or member can be accessed by any code in the same assembly, but not from another assembly.

Protected Internal The type or member can be accessed by any code in the same assembly, or by any derived class in another assembly.

It is a process of exposing essential features of an entity while hiding other irrelevant detail. It places the emphasis on what an object is or does rather than how it is represented or how it works. Thus, it is the primary means of managing complexity in large programs. In other words, it is a process of removing unwanted information or details and pay attention to information that is important to that context or system under consideration.

For example:

A person has different characteristics, different roles in society and different relationships to other persons. At school a person is a Student, at work an Employee and from business’s point of view a Client, thus it all comes down to in what context we are looking at a person/entity/object. Therefore, when developing an Academic Registration System one can look at characteristics of a person as a Student such as Gender, Name, Age, Course Enrolled. As for a Payroll System a person is an Employee and one can look at characteristics such as Identification Number, Tax Registration Number, Contract or Full time, Designation. Lastly for a Life Policy Insurance System a person is a Client and one can look at characteristics such as Health Status, Age, Gender, Marital Status and so forth.

Take note at how the person is looked at from those different systems. How the important characteristics are abstracted and used in relation to the system in question. Even though some same characteristics can be used across all systems, the main point to note is that, it is not all information about a person is relevant but only a few that is important is taken into consideration. Therefore, abstraction is describing a person in simpler terms.

The ability of creating a new class from an existing class. A class that is used as the basis for inheritance is called a superclass or base class, whereas the class that inherits from a superclass is called a subclass or derived class. Subclass and superclass can be understood in terms of the is a relationship. A subclass is a more specific instance of a superclass. For example, a cabbage is a brassica vegetable, which is a vegetable. A siamese is a cat, which is an animal. If the is a relationship does not exist between a subclass and a superclass, then there is no need to use inheritance. A cabbage is a vegetable, so it would make sense to write a Cabbage class that is a subclass of a Vegetable class. However, once there is a has a relationship, that indicates composition and not inheritance. For instance, a car has a gearbox and it would not make sense to say a gearbox is a car or that a car is a gearbox.

It is a generic term that means ‘many shapes’. More precisely Polymorphism means the ability to request that the same operations be performed by a wide range of different types of things. In OOP the polymorphism is achieved by using many different techniques named Method overloading and Method overriding.

Method overloading is a feature that allows a class to have more than one method with the same name provided their argument lists or parameters are different. For instance, the parameters of a method multiply(int x, int y) which has two, is different from the parameters of the method multiply(int x, int y, int z) which has three. Overloading a method is achieved in three ways namely, 1) number of parameters, just like what we saw with the multiply method, 2) data type of parameters, for instance,  multiply(int x, int y) will be different from  multiply(int x, float y)  and lastly sequence of data types of parameters, for instance,  multiply(float y, int x) will be different from  multiply(int x, float y).

Method overriding is a feature that allows a subclass to provide a specific implementation of a method that is already provided by one of its superclasses. The implementation in the subclass overrides or in other words replaces the implementation in the superclass by providing a method that has the same name, same parameters and same return type as the method in the superclass. For instance, say we have two classes, Duck and Dog, both inheriting from the Animal class that has a method called sound. Duck will override the method sound as follows: println(“quacks”) while Dog will override as follows: println(“bucks”). From this example, it is clear that method overriding enables the subclass to have its own specific implementation to an inherited method without even modifying the superclass code.


This is a basic introduction to object oriented programming concepts namely its four main pillars. Having good understanding of those concepts helps in designing effective object oriented solutions. There are so many sources in the internet for further reading regarding OOP. Hope you found this useful as a starting point.

Encapsulation 1
Encapsulation 2
Encapsulation 3
Method overriding 1
Method overloading 


© 2020 Tendai

Theme by Tendai BepeteUp ↑