The buzz these days is Big Data. From government institutions to private entities, they are all talking about Big Data. What is Big Data? In this article, I will try to explain what Big Data is based on the research and understanding I did.
Big Data, what is it?
Before answering this question, the first question we must ask ourselves is, what is data? Dictionary definition of data states that data is facts and statistics collected together for reference or analysis. If you think about it, we have been collecting data for a very long time before the dawn of computers, using files and books. Now with the use of computers we moved from capturing facts or statistics on paper to using spreadsheets and databases on computers. Once the information has been captured on computers it is now in binary digital form.
Now let us look at the definition of big. The dictionary defines big as something of considerable size or extent. Some words that are synonyms to big include, large, great, huge, immense, enormous and so forth. Combining the two, it follows that we can say Big Data is an enormous collection of facts and statistics collected together for reference or analysis. Is it that simple? Not even close. To understand why, we need a small history. Remember we shifted from paper to computers for capturing data and guess what, the first hard drive, which is a storage device on a computer for storing data, only captured 5 MB of data. This is equivalent to Shakespeare’s complete work or a 30 second video clip of broadcast quality.
Today, a standard computer can have a hard drive that can hold up to 15.36 terabytes (TB) of data. If 10 terabytes of data is equivalent to the printed collection of the U.S Library of Congress, 15.36 terabytes, now that’s a lot of data on a single computer. I am sure we can start to see why Big Data is in hype these days. In terms of volume, we can now have the capability of holding as much information as we can using computers. But is that all? I mean is Big Data about the amount of data we can hold? Not even close.
We are no longer capturing data using computers only, we now have smart devices such as our phones, refrigerators, airplanes and even motor vehicles. All these devices have the capability of capturing data. Now imagine people on a platform like Twitter and they are tweeting about something that has just happened using their phones and people at work are also on the same band wagon, posting about the same thing. Now all of a sudden there is a lot of information coming through Twitter and say for every two seconds, a hundred people are tweeting. At this point, Twitter is not only experiencing a surge of data but the speed or velocity that the data is coming through is also high. People are not only posting using text but they are also using pictures and videos. Now not only surge and speed but data is also coming in different varieties.
At this point we can see that Big Data is not only about the volume of data but it is also about velocity and variety. In Big Data world, these characteristics are known as the 3 Vs. Lets look at the each of the Vs.
Volume is one of the main characteristics of Big Data because of the meaning of the word volume itself, which means, the amount of space that a substance or object occupies or that is enclosed within a container. In this case we can rephrase it to the amount of data that is occupied on a hard drive. For instance, think about the fact that Twitter has more active users than South Africa has people. Each of those users have posted a whole lot of photographs, tweets and videos. What about platforms such as Instagram, they have reported that on an average day 80 million photos are shared and what about Facebook, it is reported that it stores about 250 billion images.
Lets try to quantify this data. Say you have a phone that has a resolution of 1440 x 2560 pixels, multiply 1440 by 2560 which gives you 3,686,400 pixels. You then multiply this number by the number of bytes per pixel:
16 bit per pixel image: 3,686,400 X 2 bytes per pixel = 7372800 bytes = 7.37 MB approx
32 bit per pixel image: 3,686,400 X 4 bytes per pixel = 14745600 bytes = 14.75 MB approx
Now for argument’s sake, if on average day 80 million photos are shared on Instagram, how much space is need if say all users as using a phone with the above resolution? If my Maths is right, that is about 1,180,000,000 MB which is 1.18 Petabyte, take note here, per DAY!! Now, that’s a lot of hardware equipment that is need to store that much of information.
Velocity is also an important characteristic of Big Data. Why? Again, from the meaning of the word velocity, which is, the speed of something in a given direction. In this case, the speed at which data is flowing to a data center. Back to our Twitter analogy, every second, on average, around 6,000 tweets are tweeted, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day and around 200 billion tweets per year. That is a lot of information that needs to be handled so that there are no bottlenecks. If a user logs on Twitter, the experience must be flawless.
Variety is the quality or state of being different or diverse; the absence of uniformity or monotony. In others words, not the same and this is true for data as well. From the examples we used, people tweet using text, photos and videos. The data here is different. In other words data can be structured and unstructured, which means that not all data can easily fit into fields on a spreadsheet or a database application. There got to be ways and means of storing data in its different form.
Big Data is not only how huge or enormous the data is but its also about how fast the data is moving and its type. If there is a surge in the amount of data that needs to processed, the next question we need to answer is, how is it all handled? When a user logs on Twitter, Facebook, Instagram or on a news site like BBC or streaming on YouTube, the experience is flawless, but the amount that is coming through those platforms is huge and moving fast and in different form. Due to this, it gave birth to various technologies that are being used to manage the surge. These technologies include, Hadoop, Spark, Hive, Kubernetes just to mention a few. Will look into some of these technologies in future articles.