While academics from the US or UK may dispute the number of zeroes making a quintillion they would both agree 2.5 quintillion bytes of data is a lot and that is the amount of data, according to IBM, generated around the world on a daily basis from – human interactions on the internet, the instrumentation and interconnectedness of sensors, the intelligence of devices and including what some have termed ‘ubiquitous computing’. These and other data sources are all combining in ever growing proportions, in an expanding network, to produce vast amounts of data, 80% of which according to IBM is unstructured. Although the unstructured data creates an analytical challenge to traditional relational database tools, the volume and variety of the data are however not the only issues giving organisations headaches because they also have to contend with the speed at which the data changes. It is this triple play of volume, variety and velocity of data that have led many to coin the term ‘Big Data’.
But the expression “Big Data” usually leaves the question, “how big is big?”, because if Moore’s Law holds, then what was big yesterday usually turns out to be small today. Not long ago Big Data might have referred to a handful of floppies or a CD-ROM. Nowadays, personal computing defaults at hundreds of gigabytes and corporations have systems that deal with data in the order of terabytes or petabytes, exemplified by spy agencies monitoring global communications and research labs involved in genomic sequencing.
Any thoughts of similarities between Big Data and a Data Warehouse would not be far off, but there is a difference as data warehouses tend to be associated with structured data meanwhile Big Data implies both structured and unstructured data on a “web-scale”. However, not all data captured is valuable, therefore organisations need to prospect for the high value data items before Big Data can have any commercial use. An analogy between the mining of ores for precious metals and querying Big Data for value has been made by many commentators but it unfortunately implies Big Data is preserve of those with wherewithal for investment.
Irrespective of the Big Data use case of which many are non-exclusively in the realm of navigation, logging, fraud detection, social media or finance, the problems in dealing with the high volumes, variety and the speed of the data are consistent and in creating Big Data solutions, organisations invest in technology and resources to give them the capacity and capability to store and cross-reference vast amounts of continually changing data.
Historically, organisations use data to report and explain what happened in the past quarter or perhaps year. But prior to analysis and reporting, organisations also use data to understand the current state of affairs in the company. Thus, while Business Intelligence solutions like Hyperion and Cognos have been great for handling financial and enterprise data, nowadays organisations are also sometimes required to make predictions via complex modelling to improve decision making and this is where Big Data solutions from suppliers like IBM, Microsoft, Oracle and Dell are proving themselves invaluable. There are also many corporations building their own bespoke Big Data solutions away from traditional suppliers and, in general, the McKinsey Global Institute reports that these solutions empower corporations pull financial returns from Big Data levers in marketing, merchandising, operations, supply chains and in any other new business models.
Those who have shopped online in shops like Amazon.com know that upon purchasing one item they are often presented with options to purchase recommended items which are usually in tune with their preferences or tastes. Amazon has attributed 30% of its sales to such cross-selling made possible through information mined from Big Data. Thus Big Data insights serve to bring corporations ever closer to their customers through deeper segmentation and increasingly personalised services. Where such ‘micro-segmentation’ is not possible, corporations are still able to use Big Data to understand how their customers have taken to particular marketing initiatives through real-time analysis of social media feeds.
The customer is king and being attuned to their needs via the product, its placement or pricing is good business and many corporations are finding that through information gleaned from Big Data they can not only better tailor their products, but also optimise their operations and supply chains.
Harvesting gains from the value chain through Big Data is not just the preserve of service organisations but in manufacturing as well because, in their need to produce desirable products, many manufacturers are using consumer feedback to close the loop in collaborative product design and development so that they constantly refine their products via real-time data of their products in use. But ‘collaborative development’ and ‘sensor driven operations’ are not exhaustive as McKinsey Global Institute identifies other areas across the manufacturing value chain like capacity planning and after sales care where manufacturing corporations stand to make gains investing in Big Data.
These gains attributed to Big Data are not confined to the private sector because on the analysis of OECD data, the McKinsey Global Institute finds Big Data has the potential to create value of up to 300 billion Euros in the EU via operational efficiency savings, fraud reduction and increases in tax receipts.
As the predictions of Big Data gains across retail, manufacturing, public administration pile in, one area in which consumers are already finding benefits is in location based applications and services. Thus as dawn breaks on autonomous driving, Big Data also promises smarter routing, socialising and shopping.
If these are some of the benefits of Big Data, it makes sense to peek at the other side of the equation re: the costs of deploying the technology underpins Big Data?
Little can be said currently about the technology of Big Data without mentioning Apache’s Hadoop, a scalable, fault-tolerant and distributed Java based open source platform for handling very large datasets. Hadoop clusters balance storage, parallel processing power and network resources to endow them with phenomenal processing power and speed as exemplified in 2013 when a Hadoop cluster was able to perform a general purpose sort on 102.5 terabytes of data on at a rate of 1.42 TB/ Min to take the Daytona Sort Benchmark. Key components to the cluster’s success are the Hadoop’s MapReduce (comprising Mapping and Reducing functions) and Hadoop distributed File System which stores data in blocks of 64MB or in the case of IBM’s BigInsights 128MB.
IBM’s BigInsights is a shrink-wapped enterprise Big Data solution which builds on an un-forked Hardoop to deliver an enhanced platform that can be easily deployed through an installer. According to IBM, BigInsights can be used to flatten the Big Data analytics time-to-value curve through a runtime environment which encourages the development and deployment of advanced analytical tools. But IBM is by no means the only one involved in using this popular open source project to provide Big Data solutions for enterprise. Microsoft also provides Hadoop distributions for Windows which incorporate with traditional tools like Excel and Business Intelligence tools to help corporations quickly mine their Big Data for value.
The value conferred on corporations include not only better segmentation of their customers, but also empowerment to make effective decisions and insights to better understand sources of variability of their operations with a view to delivering improved quality.
There is however, a disparity in the global spread of these gains as different regions of the world stand to make varying returns from leveraging Big Data, with more going to the developed economies of the world where the continuous investment in ICT infrastructure has meant the 200 petabyte of data stored in the Middle East and Africa in 2010 was a tenth of that stored in Europe (IDC Storage Reports 2010).
However, Big Data benefits are not uniform across all sectors of an economy and McKinsey Global Institute predicts the most gains in data-centric sectors like manufacturing, transport, real estate, finance, insurance, health care providers and retail. Making maximum benefit out of Big Data requires businesses in these sectors to become data-centric and recommendations from The Chartered Global Management Accountant note that firms have to understand what new data is relevant for their business and to create quick-win initiatives which they can quickly expand on to develop a data-driven culture in their organisations.
Irrespective of the sector or business, decision makers in corporations about invest in a Big Data solution via Hadoop are faced with a continuum of deployment options from building the cluster (using physical machines) to leasing Hadoop-as-a-service, referring to a cloud based and virtualised model. Accenture believes the selected choice of deployment model would be contingent on issues like the price/performance ratio, data privacy considerations, the feasibility of moving data across platforms, how easy it is to cross link datasets and how easily data scientists can convert the raw data into value. But, Hadoop is designed to take advantage of whatever hardware is available and the exact specification for a Hadoop cluster is corporation and context specific.
Therefore, there are many choices of hardware configurations for Hadoop, but based on general user preferences Apache has identified three broad categories with costs in the order of 20:5:1 thousands of pounds and the options include quad x quad core database class machines, generic production dual x dual core machines and single dual core PCs respectively. While putting the high-end of these machines together may top 50k in terms of costs, they are by no means the maximum in terms of what corporations spend on the ‘bare-metal’ for their Big Data solutions. Notwithstanding hardware costs, the total cost of ownership for Hadoop also includes the datacentre costs (rack space, the energy costs, the cooling equipment), the costs of technical support and maintenance as well as the costs of operational staff (administrators and data scientists). In a review of these costs making up the TCO of a Hadoop cluster, Accenture concluded that price-performance ratio was better under Hadoop-as-a-Service for example Amazon EMR.
The business case for these costs are delivered through corporate gains from for example improved customer segmentation through retail customer loyalty schemes like those in Tesco or Sainsbury’s. Since customers in the schemes benefit via discounts and redeemable reward points, they have been quite relaxed about handing over personal data to retailers. Nevertheless, there is still a great deal of suspicion by the public and concerns over privacy. The cases in point being the recent NHS Data Sharing scheme in England which in February 2014, had to be delayed for 6 months so that public fears could be allayed, as well as a recent Google fine of 1 million Euros in Italy for its Street-Car infringements on the privacy of Italians, a marked increase from the 150 thousand Euro fine the company also faced for similar infringements in Germany.
Notwithstanding the public doubts and occasional fines, Big Data presents plenty of opportunity, not only for businesses or their consumers but also for professionals, ranging from business analysts who can create value from new analytical insights to data scientists who manage the data journey from acquisition through ingestion, storage, visual analysis, modelling and dissemination. With the spread of the internet of things and social media, with advances in the data ecosystem from connected cars and many other Big Data levers churning out astronomic amounts of diverse data, coupled with the fact that 51% of corporate leaders consider Big Data a corporate priority, the number and roles of data professionals in organisations are set to increase as would the range of new software tools, databases and platforms that would be created to increase the understanding of the world around us captured by ubiquitous sensors.
Paul C. Zikopoulos et al, Understanding Big Data, Analytics for Enterprise Class Hadoop and Streaming, 2012, Data McGraw Hill
Paul C. Zikopoulos et al, Harness the Power of Big Data, The IBM Big Data Platform, 2012, McGraw Hill
Michael Chui et al, The Internet of Things, 2010, McKinsey Quarterly, Number 2
CGMA Report, From insight to impact, Unlocking opportunities in big data, October 2013
James Manyika et al, Big data: The next frontier for innovation, competition, and productivity, May 2011, McKinsey Global Institute
Big Data Now, Current Perspectives From O’Reilly Media, 2013 Edition, O’Reilly Media.
Scott W. Kurth , Michael E. Wendt, Hadoop Deployment Comparison Study Price-performance comparison between a bare-metal Hadoop cluster and Hadoop-as-a-Service. (2013), Accenture Technology Labs.
Data Scientist: The Sexiest Job of the 21st Century
What is Big Data
Big Data at the Speed Of Business