This site uses cookies to offer you a better browsing experience. Find out more on Privacy Policy.

Harnessing Big Data

January 16, 2013 K. Piskorski

Five terabytes. That’s how much data will be created directly or indirectly by every man, women and child on earth by 2020.

This number, coming from International Data Corp report, is only an average. A US citizen who works in IT, uploads YouTube videos, owns a DSLR, and is also available on numerous online platforms, might have a digital footprint ten times this size. On the other hand, some people in the Third World might be represented only by a small string of kilobytes – a single record in the tax system of their country.

But just for the sake of simplicity, imagine that everyone you meet in the street has a stack of 1TB hard drives floating over his or her head. That’s their digital footprint. Companies and businesses also have footprints. Let’s imagine them as huge piles of drives sitting next to the office buildings. This data exists physically somewhere in the world. In a Malaysian data center. On your pendrive. On a server of your insurance company. On the account in the online store you often use.

Now imagine that all the stacks are doubling in size every two years.

It’s not a fantasy. According to the Aberdeen Group report, the average dataset of the US company grew by 60% last year, and many businesses over the pond double their storage every two years.

It is this growth that fuels many new sectors of IT. One of them became an important buzz word, already known to almost every IT professional on the planet.

The Big Data.

Contrary to what you might think after this introduction, Big Data is not only about storage and archiving. It’s more about finding innovative (and often very profitable) ways to use, analyze and interconnect the immense multi-format data collections – from database records and statistics to videos, pictures and social streams.

In 2010, spending in this sector was close to $3.2 billion. We already know that in 2015 it should reach $16.9 billion – an astonishing annual growth rate of 40%. That’s 7 times the growth rate of the general IT market.

This sector is rising so fast that IDC predicts an upcoming shortage of specialists with experience in Big Data projects. When local job markets dry out, many businesses will turn to outsourcing and offshore development – a fact that companies like PGS Software keenly take note of.

If 2012 was a year of Big Data, then 2013 is going to be a year of Big Data outsourcing.

What can we do with it?

Why is it happening? Is there really so much value in this market, or is Big Data just a buzz  propagated by technology journalists?

Let me give you a good example. Last year, a small team of Big Data enthusiasts was able to create an application called TwitterHealth. Their software has been analyzing an enormous sea of Twitter feeds, looking for social updates that could indicate someone is suffering from a flu. As you would expect, Twitter users very often write if they feel sick, or if they intend to stay at home – the application takes advantage of that. Wherever the flu strikes, such tweets are much more prevalent. By cross-referencing the semantic analysis with geographical information, TwitterHealth was able to create a surprisingly good, real-time map of flu epidemics. Now the best thing – the map proved just as good as the one prepared by the Center for Disease Control that used information from hundreds of medical practitioners. But it was much faster, much cheaper, and worked on data available to everyone.

Another example comes from Japan.  A company there introduced a custom application for decommissioning post-lease cars. Monitoring auction houses and prices at local used vehicle dealerships, the system automatically finds the best place in the country to sell the car. At the same time it’s also using a big technical database to virtually disassemble it, seek information about current prices of used parts, set value on specific components, see if there’s a market for them, and modify the vehicle price accordingly.

Overall, this automated system allowed company to earn $150 more on each of 250 000 cars sold per year. Who would say no to extra $37,5 million out of nowhere? That’s what Big Data is all about –using what’s already available in a smarter way.

The system I have just mentioned, created by Opera Solutions, is just one of many economical and e-commerce projects based on Big Data that appear all around us. Many of them will prove  important to our economy or even politics. JPMorgan Chase & Co, a powerful financial company that came into the spotlight during investigation into the American financial crash of 2008, currently uses Big Data software to control processing of derivatives – a financial IT system based on many inputs from economy, market trends and even world news.

My guess is, come next crash, we’ll be talking about algorithms and software engineers, instead of bankers with fat paychecks.

But there are also many good non-commercial examples of Big Data application development. One of them is a 1000 Genomes Project. The genome mapping and research has one important trait – it generates A LOT of information. Now just imagine that dozens of teams across the world work simultaneously, mapping genomes of many individuals, creating heaps of raw data. That’s why the project creators started an Amazon Web Service that gives every genome researcher on the globe an easy access to the data of all the other scientists. A real global hub for the genome research – and an example why Big Data is important not only for commercial ventures.

The last thing I’d like to highlight comes from Google. The company employs what’s called a Statistical Machine Translation to fuel its popular Google Translate service. Google doesn’t really try to “understand” grammar of the world’s languages or the context of phrases it translates. It just takes an immense database of digitized texts in both languages, looks for established patterns, and then tries to guess which string in a foreign language most likely represents the string in the input language.

Yes, Google Translate often gets it wrong. But the thing is, it becomes more accurate the more data you pump into it. That’s one of the reasons why Google is running its book digitization effort. Every new volume added to the database makes Google’s translation services a tiny bit better.

Potentially, when it gobbles up and digests all known written sources in every language on earth, it could make translators and translating software obsolete – a groundbreaking perspective that might become true in just several years.

There are more examples. Greenhouses that connect to the publically available weather data to determine when they should open, and when they should close. Stores that manage stocks based on the social media buzz.

And yes, sometimes it gets a bit scary. Just like the algorithm created by scientists from the University of Birmingham. It overlaid data from cell phones of 200 people on the map, and then started to learn about them, taking note of all their movements during the day, meetings, social interactions, work flow.

It proved way too effective. Once the analysis was finished, the algorithm could predict with a 93% certainty where any person would be at any given time and date in the foreseeable future, with an accuracy of just 20 meters.

It’s hard to decide what’s more disturbing – the fact that our daily routines are so repeatable, or that it takes so little to track us.

The impact on IT outsourcing

Now you know why Big Data sector is so important, and why it grows fast. But the true depth of this market is not created by end-user solutions. Big chunk of the soon-to-be $16.9 billion pie is occupied by middleware, aiming to provide connection between different data assets. For example: MarkLogic software that allows you to analyze unstructured data in formats hard to process, such as random documents or videos. Many leading companies also invest to create their own Big Data analysis platforms, and every one of them is a massive undertaking.

Today, too many people still think about Big Data as of something dull, evoking images of endless server rooms, tape streamers and cloud data centers. But the truth is, this rapidly growing sector is all about creativity. We need to realize that it’s Big Data – not handheld gadgets, new phones, phablets or tablets – that’s going to really change our lives ten years down the line.

According to IDC, 25% of the information we currently have in the world is potentially useful (and valuable). It only needs to be tagged, analyzed and interconnected. But so far, we only managed to process 0.5%.

That’s why teams like my PGS team are eagerly waiting for entrepreneurs with a vision on how to utilize the remaining 24.5%.

We know that Big Data is the oil of the 21st century – a new resource that lies waiting for people who will find a way to use it. We know that the push we currently see with Big Data sector growing by 40% each year is a new gold rush. We also worked on some interesting projects from this sector – for example a piece of software that tracked the ebbs and flows of the phone market analyzing raw data from cellular towers.

That’s why we’re all going to watch this field closely in 2013. Maybe in one of many upcoming Big Data projects, we’re going to see a glimpse of the future of the entire IT industry.

Last posts