Programming – Tendai

Category: Programming

Cannot cast exception while reading data from Hive in Spark

May 17, 2018 / rulanitee / Comments Off

Introduction

In this post I am going to talk about an exception I got when I was trying to read data from Hive using Spark and how I managed to debug the issue and resolved it. I will also explain how one can reproduce the issue, by doing so, one can also avoid reproducing it. The exception was:

java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.DateWritable

Spark code

Here is the snippet of the code that I had in spark:

From the code snippet above, all I am doing is a simple select from a table called geonames_data which is in a database called tendai_test and I just want to read only 5 records. The exception is thrown when trying to retrieve the data using the show function.

Debugging

Now, my first instinct was to check the schema of the table. Using the Dataframe’s printSchema function, I can see the fields and their datatypes as shown below:

From the schema above I can see that all the fields and their datatypes and all is what I expect. However, from the query I am using to retrieve the data, there is no place I am doing a cast. Therefore, there got to be another place I need to check to see if there is a difference. The next place I checked was to read the schema from ORC as shown below:

From the schema above, I noticed that there is a difference in datatypes for a field called modificationdate, from ORC its string whereas from Hive its a date.

Solution

Quick solution is to do a backup of the table in Hive as follows:

create table tendai_test.geonames_data_backup as select * from tendai_test.geonames_data;

After backing up the data, drop the original table as follows:

drop table tendai_test.geonames_data purge;

Once the table is dropped, create the table using the original script and using the geonames_data table, the script is as follows:

create table tendai_test.geonames_data(
geonameid int,
name string,
asciiname string,
alternatenames string,
latitude int,
longitude int,
featureclass string,
featurecode string,
countrycode string,
cc2 string,
admin1code int,
admin2code int,
admin3code int,
admin4code int,
population int,
elevation string,
dem int,
timezone string,
modificationdate date)
STORED AS ORC;

Once that is done, populate the table using the backed up data from geonames_data_backup as follows:

insert into tendai_test.geonames_data select * from tendai_test.geonames_data_backup;

Back to the Spark code

If I run the Spark code again, I will no longer have the exception as shown below:

Just did a minor modification to the code by selecting a few fields.

Reproducing the issue

Using the backup table from the above, we can recreate the issue as follows:

Lessons

The lesson here is that, when writing data in Spark, make sure that the datatypes are not changed. Or when doing a cast, make sure that the cast is of the right datatype. Check first before doing a cast to make sure that the right thing is being done.

Doing this without checking might slow down production especially if working in a big team. Say someone is reading data from geonames_data, and the expected datatype of modificationdate is date, then all of a sudden there is now a cast exception. Trying to get to the bottom of the issue and resolving it, will slow down progress.

Conclusion

This is what I experienced and I just wanted to share. In future posts, I will talk about Hive since there are some concepts in this post that I didn’t talk about, for instance ORC and Dataframes.

My experience with Scala – Part 1

February 2, 2018 / rulanitee / Comments Off

Introduction:
This is going to be a series of theories and personal experience with Scala. I am currently using Scala for developing Spark applications.

Background:
Scala is an abbreviation of the term SCAlable LAnguage. It was created by Professor Martin Odersky and his group at EPFL, Ecole polytechnique federale de Lausanne, Switzerland, in 2003. Scala provides a high-performance, concurrent-ready environment for functional programming and object-oriented programming on the Java Virtual Machine (JVM) platform.

Installing Scala:
As a JVM language, Scala requires the use of a Java runtime. Scala 2.11 needs at least Java 6. For optimal performance, rather install Java 8 JDK. The download for Java 8 JDK is available on Oracle’s website.

The following installation can be done on Ubuntu. Copy and create an executable file and run it. For instance, using nano or any editor of your choice, from the terminal:
sudo nano install_java_scala.sh -> this will create a file called install_java_scala.sh
copy the script below and paste
ctl + o + enter to save
ctl + x to exit nano
sudo chmod +x install_java_scala.sh to make the file executable
run ./install_java_scala.sh

Notes:
We are installing wget in order to download the installation files needed to install Java and Scala.
Also not that you might have to change the link to download Java, because it changes.

Script:
install_java_scala

Once it is done, you can test to see if both Java and Scala have been installed correctly. To test, from the terminal run java -version and you should get:

java version “1.8.0_161”
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

As for Scala, just run scala and you should get:

After running scala from the terminal, you will be now on the Scala REPL and all is good to go for coding in Scala. If you are having issues in getting started with Scala, make sure that the Java command is installed correctly and that the system path points to the correct Java version.

The Scala REPL is a tool for evaluating expressions in Scala: for instance, type println(“Welcome to scala!”), it will print, Welcome to Scala!. The scala command will execute a source script, like the above println(text), by wrapping it in a template and then compiling and executing the resulting program. A help system is available and can be started by entering :help command.

Note: A Read-Eval-Print Loop (REPL), also known as an interactive top level or language shell, is a simple, interactive computer programming environment that takes single user inputs (i.e single expressions), evaluates them and returns the result to the user.

The other tool to install is SBT, which is the de facto build tool for Scala and the instructions for installing the tool can be found here (scala-sbt.org). Once the installation is done, you will be able to run sbt command from the terminal. If you start the sbt command and don’t specify a task to run, SBT starts the REPL. Below is the lists of sbt tasks you can run together with the description of the task:

- - clean Delete all build artifacts
  - compile Incrementally compile
  - console Start the Scala REPL
  - eclipse Generate Eclipse project files
  - exit Quit the REPL or Ctrl+d
  - help Describe commands
  - run Run one of the “main” routines in the project
  - show x Show the definition of variable “x”
  - tasks Show the most commonly-used available tasks
  - tasks -V Show ALL the available tasks
  - test Incrementally compile the code and run the tests
  - ~test Run incr. compiles and tests whenever files are saved
    This works for any command prefixed by “~”

Conclusion

This is a basic introduction to Scala. In the next series I will look into useful terms in Scala that needs to be defined and explained.

Source:
Learning scala
REPL
REPL 2

Object Oriented Programming (OOP)

April 11, 2016 / rulanitee / Comments Off

Object oriented programming (OOP) is not a programming language used to write code, instead it is an example or model that explains a concept or theory that helps you to design software. How does it help you? To understand how OOP can help you, you need to understand two things, an Object and a Class.

Object
An object is something that you can see, feel and touch. For instance, a mobile phone, you can see and touch it. Due to the fact that you can see and touch a mobile phone, you can also describe it, for instance, the color, name, make, model and weight, which are known as Attributes of an object. These attributes in OOP are known as Properties of an object.

Not only can you describe your mobile phone, but you can also perform actions on it, for instance, make a call or play music, watch movies or even hang up. These are known as Behaviors and in OOP as Methods. Now, when a call is being made, and after a certain period of time there is no answer on the other side of the call, it will hang up automatically and can also send an instance message to the person you were trying to call. Now, this an enumeration of what should happen when making a call and in OOP it is known as Events.

In short, descriptive items of an object are know as Properties, actions you can perform with an object are known as Methods and are described using verbs. Lastly things that can happen to an object are known as Events.

Class
A class is simply a representation of a type object. It is the blueprint/ plan/ template that describes the details of an object. A class is the blueprint from which the individual objects are created.

From the above definitions, it follows that, a class is used to create a mobile phone. A class will represent our real mobile phone in our software.

Point to note, we have already stipulated that OOP is not a programming language. It follows that there are programming languages that we can use to implement OOP and these include, C#, PHP and Java, only to mention a few.

Understanding OOP is not complete without discussing its four main concepts or commonly referred to as the four pillars of OOP, which are:

Encapsulation
Abstraction
Inheritance
Polymorphism

Encapsulation
It is when you hide your module’s internal data and all other implementation details/mechanism from other modules. In other words, it is a means of hiding data, properties and methods from the outside world or outside the object’s scope and only revealing what is necessary. Encapsulation works hand in hand with Abstraction, discussed below, in the sense that, it exposes essential features of an object while Encapsulation hides unwanted data or private data from outside of an object.

Encapsulation is achieved by using access specifiers, which define the the scope and visibility of an object’s member. Using C# as an example, it has the following access specifiers, namely:

Public
Private
Protected
Internal
Protected internal

Private access specifier limits the accessibility of a member to within the defined type. If, as an example, a variable or function is declared as private, the type or member can only be accessed by code in the same class or struct.

Public access specifier allows a class to expose its member variables and member functions to other functions and objects, it has no limits. The type or member can be accessed by any other code in the same assembly or another assembly that references it.

Protected The type or member can only be accessed by code in the same class or struct, or in a derived or inherited class.

Internal The type or member can be accessed by any code in the same assembly, but not from another assembly.

Protected Internal The type or member can be accessed by any code in the same assembly, or by any derived class in another assembly.

Abstraction
It is a process of exposing essential features of an entity while hiding other irrelevant detail. It places the emphasis on what an object is or does rather than how it is represented or how it works. Thus, it is the primary means of managing complexity in large programs. In other words, it is a process of removing unwanted information or details and pay attention to information that is important to that context or system under consideration.

For example:

A person has different characteristics, different roles in society and different relationships to other persons. At school a person is a Student, at work an Employee and from business’s point of view a Client, thus it all comes down to in what context we are looking at a person/entity/object. Therefore, when developing an Academic Registration System one can look at characteristics of a person as a Student such as Gender, Name, Age, Course Enrolled. As for a Payroll System a person is an Employee and one can look at characteristics such as Identification Number, Tax Registration Number, Contract or Full time, Designation. Lastly for a Life Policy Insurance System a person is a Client and one can look at characteristics such as Health Status, Age, Gender, Marital Status and so forth.

Take note at how the person is looked at from those different systems. How the important characteristics are abstracted and used in relation to the system in question. Even though some same characteristics can be used across all systems, the main point to note is that, it is not all information about a person is relevant but only a few that is important is taken into consideration. Therefore, abstraction is describing a person in simpler terms.

Inheritance
The ability of creating a new class from an existing class. A class that is used as the basis for inheritance is called a superclass or base class, whereas the class that inherits from a superclass is called a subclass or derived class. Subclass and superclass can be understood in terms of the is a relationship. A subclass is a more specific instance of a superclass. For example, a cabbage is a brassica vegetable, which is a vegetable. A siamese is a cat, which is an animal. If the is a relationship does not exist between a subclass and a superclass, then there is no need to use inheritance. A cabbage is a vegetable, so it would make sense to write a Cabbage class that is a subclass of a Vegetable class. However, once there is a has a relationship, that indicates composition and not inheritance. For instance, a car has a gearbox and it would not make sense to say a gearbox is a car or that a car is a gearbox.

Polymorphism
It is a generic term that means ‘many shapes’. More precisely Polymorphism means the ability to request that the same operations be performed by a wide range of different types of things. In OOP the polymorphism is achieved by using many different techniques named Method overloading and Method overriding.

Method overloading is a feature that allows a class to have more than one method with the same name provided their argument lists or parameters are different. For instance, the parameters of a method multiply(int x, int y) which has two, is different from the parameters of the method multiply(int x, int y, int z) which has three. Overloading a method is achieved in three ways namely, 1) number of parameters, just like what we saw with the multiply method, 2) data type of parameters, for instance, multiply(int x, int y) will be different from multiply(int x, float y) and lastly sequence of data types of parameters, for instance, multiply(float y, int x) will be different from multiply(int x, float y).

Method overriding is a feature that allows a subclass to provide a specific implementation of a method that is already provided by one of its superclasses. The implementation in the subclass overrides or in other words replaces the implementation in the superclass by providing a method that has the same name, same parameters and same return type as the method in the superclass. For instance, say we have two classes, Duck and Dog, both inheriting from the Animal class that has a method called sound. Duck will override the method sound as follows: println(“quacks”) while Dog will override as follows: println(“bucks”). From this example, it is clear that method overriding enables the subclass to have its own specific implementation to an inherited method without even modifying the superclass code.

Conclusion

This is a basic introduction to object oriented programming concepts namely its four main pillars. Having good understanding of those concepts helps in designing effective object oriented solutions. There are so many sources in the internet for further reading regarding OOP. Hope you found this useful as a starting point.

Source:
Encapsulation 1
Encapsulation 2
Encapsulation 3
Inheritance
Method overriding 1
Method overloading

Tendai

On anything that got to do with software development

Category: Programming

Cannot cast exception while reading data from Hive in Spark

My experience with Scala – Part 1

Object Oriented Programming (OOP)

RECENT POSTS

About Tendai

Tags

Archives