Click here to login if you're an NAE Member
Recover Your Account Information
Author: James M. Tien
It is projected that about 5 zettabytes (or 1021 bits) of digital data are being generated each year by everything from retail transactions to underground physics experiments to global positioning systems to the “Internet of Things.” In the United States, public and private sector programs are being funded to deal with Big Data in all five sectors of the economy—services, manufacturing, construction, agriculture, and mining. This article presents an overview and analysis of Big Data from a US perspective.
The term Big Data applies to datasets whose size exceeds the capacity of available tools to perform acquisition, access, analytics, and/or application in a reasonable amount of time.
“Data rich, information poor” (DRIP) problems have been pervasive since the advent of large-scale data collections or warehouses (Tien 2003), but somewhat mitigated by the Big Data approach, which supports informed (’though not necessarily defensible or valid) decisions or choices. Big Data are useful for decision making only to the extent that they are analyzed or processed to yield critical information.
Thanks to technological and computing advances, Big Data are poised to
Definitions and Origin of the Term Big Data
It is helpful to first define data: they are values of qualitative or quantitative variables, typically the result of measurements, belonging to a set of items. They are considered raw before they are processed; in fact, the processed data from one stage may be the raw data for the next stage. Metadata (sometimes referred to as data about data) describe the content and context of a set or file of data; for example, a photo file’s metadata would identify the photographer, the camera settings, the date taken, and so forth.
In this paper the definition of data includes measurements, raw values, processed values, and metavalues. More specifically, the focus is on digital data, whose basic unit of measurement is a bit, an abbreviation for a binary digit that can be stored in a device and that has two possible distinct values or levels (say, 0 and 1). A byte is a basic unit of information containing 8 bits, which can include 28, or 256 values (say, 0 to 255). Digital data can be measured in kilobytes (10001 bits), megabytes (10002 bits), gigabytes (10003 bits), terabytes (10004 bits), petabytes (10005 bits), exabytes (10006 bits), zettabytes (10007 bits), and, for now, up to yottabytes (10008 bits).
Clearly, this assortment of data can add up very quickly; the International Data Corporation estimates that, on a worldwide basis, the total amount of digital data created and replicated each year will grow exponentially from 1 zettabyte in 2010 to 35 zettabytes in 2020! Thus it is projected that about 5 zettabytes of digital data are being generated in 2014.
According to the current version of Wikipedia (accessed on October 15, 2014), the term Big Data “usually includes datasets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.” Obviously, the definition of what constitutes Big Data is shifting as software tools become more powerful; today, depending on the nature and mix of the data, a dataset is considered big if its contents range from a few terabytes to many petabytes.
Who coined the term Big Data? Mayer-Schönberger and Cukier (2013) credit computer scientist Oren Etzioni with having the foresight to see the world as a series of Big Data problems before the concept of Big Data was even introduced. Etzioni received his bachelor’s degree in computer science from Harvard University in 1986 and his PhD from Carnegie Mellon University in 1991. In 1994 he helped build one of the Web’s first search engines, MetaCrawler, and later he cofounded Netbot, the first major Web shopping site as well as ClearForest (for abstracting meaning from text documents) and Farecast (a predictive model for airline and other fares).
Mayer-Schönberger and Cukier (2013) provide not only a historical perspective on the birth and evolution of Big Data but also a persuasive argument about its importance, with insights about “what” is in the data (not necessarily about the “whys” behind the insights). In short, they consider Big Data a revolution that can transform how people live, work, and think.
Evolution of the Impacts and Uses of Big Data
At the beginning of the 21st century the growing volumes of data presented a seemingly insoluble problem; storage and central processing unit (CPU) technologies were overwhelmed by the terabytes of data being generated. Fortunately, Moore’s law came to the rescue and helped to make storage and CPUs larger, faster, smarter, and cheaper.
Today, Big Data are no longer a technical problem: they are a competitive advantage. As indicated earlier, enterprises are developing and using Big Data tools to explore their data troughs, to discover insights that could help them develop better relationships with their customers, to identify new areas for business opportunities, and to better manage their supply chains, all in an increasingly competitive business environment. Big Data critically affect decisions, risk, informatics, services, goods, and customization or personalization. In short, Big Data can be used to improve services, products, and processes, especially by supporting timely decisions.
The next section details the four major components of Big Data—acquisition, access, analytics, and application—and the final section offers critical observations about benefits and concerns associated with Big Data.
It is helpful to begin by comparing the Big Data approach with the traditional data processing approach (Table 1), as there are major differences between the two in the four steps, or components, of data processing:
In particular, in contrast to the traditional, mostly statistical, approach, Big Data seek to unleash information in a manner that can support informed decisions by compensating for data quality issues with data quantity, data access restrictions with on-demand cloud computing, causative analysis with correlative data analytics, and model-driven with evidence-driven applications. This somewhat expedient (but not necessarily valid) approach can result in further problems or concerns, which are discussed in the concluding section.
On the other hand, the feasibility (or “good enough”) focus of Big Data is usually more realistic than the optimality focus of traditional, operations research methods. In fact, the steady-state assumption that underpins optimality is, for the most part, unrealistic, especially in real-time environments where values are changing and agent-negotiated solutions are indeed messy and at best only feasible.
No matter what the purpose of processing or analyzing data, it is critical that the data contain the insights or answers being sought; otherwise, the exercise involves no more than “garbage in, garbage out.” Metadata can therefore play an important role in ascertaining the scope, validity, and viability of the data.
The following sections present the current particulars of each component (e.g., associated terms, providers, recent developments, and projections) as well as remarks about the astounding growth in each area that warrant consideration, concern, and/or further research.
Advances in digital sensors, communications, computation, and storage have yielded zettabytes of data—from customer order transactions, emails, radio frequency identification (RFID) sensors, smartphones, films, video recordings, audio recordings (including ecosystem recordings; e.g., from crickets and rainstorms), genetic sequences, and the Internet of Things (both the Internet of Things and its data are proliferating because of the 2012 Internet Protocol, which allows for trillions of devices to be connected to the Web).
Methods of data capture include such stealth approaches as keystroke loggers and clickstreams (both of which can provide real-time insights into consumer behavior); smart sensors (which are becoming ubiquitous in devices, products, buildings, and even cities); health monitors (for both humans and animals in the monitoring of body temperature, blood pressure, etc.); and drones (including wing-flapping ornithopters and stamp-sized Memoto cameras).
Remarks about the astounding growth in Big Data acquisition:
In regard to data service, platform as a service (PaaS) consists of a computing platform and a solution stack as a service; together with software as a service (SaaS) and infrastructure as a service (IaaS), it is now typically associated with cloud computing.
In this service schema, the consumer creates the software using tools and/or libraries from the provider and is also able to control software deployment and configuration settings. Powerhouses such as Google, VMware, Amazon, Microsoft, HP, and Oracle provide the networks, servers, storage, and related services. Thus, as an example, Netflix uses Amazon to stream its on-demand videos and movies to more than 35 million subscribers in North and South America, the Caribbean, and several countries in Europe; indeed, on a typical weeknight Netflix movies account for about one-third of all downstream Internet traffic in North America.
Cloud computing—provided by companies such as Microsoft, Google, OpenStack, Amazon, and Rackspace—is growing in size as technical and security issues are being resolved and enterprises become dependent on the cloud for their growth, system efficiencies, and new product processes. It is projected that by 2015 more than 2.5 billion users and 15 billion devices will be accessing cloud services.
Remarks about the astounding growth in Big Data access:
Analysis of Big Data necessarily involves the computer, resulting in data analytics, the application of computer technology, operational research, and statistics to solve problems in business and industry. Analytics is carried out in a computer-based information system; of course, mathematics underpins the methods and algorithms used in analytics, and the science of analytics is concerned with extracting useful insights or properties of the data using computable functions. In addition to a decision-driven focus that permeates business and engineering, Big Data analytics has been used to gain scientific insights concerning, say, the laws of nature, genomics, and human behavior.
With the advent of Big Data analytics, a number of niche analytics have been developed—in retail sales, financial services, risk and credit, marketing, buying behavior, loan collections, fraud, pricing, telecommunications, supply chain, demand chain, transportation, and visualization. Early efforts at risk management were focused on risk avoidance; however, lost opportunities should also be factored into such efforts.
To process Big Data within tolerable elapsed times, Manyika and colleagues (2011) suggest a variety of technologies: association rule learning, classification, cluster analysis, crowdsourcing, data fusion and integration, ensemble learning, genetic algorithms, machine learning, natural language processing, neural networks, pattern recognition, predictive modeling, regression, sentiment analysis, signal processing, supervised and unsupervised learning, simulation, time series analysis, and visualization.
Additional technologies being applied to Big Data include massively parallel processing databases, search-based applications, data mining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.
Remarks about the astounding growth in Big Data analytics:
It is, of course, difficult to separate Big Data analytics from its applications; nevertheless, it is helpful to identify applications in terms of enterprises that use Big Data techniques in their day-to-day activities, as shown in Table 3.
Among established companies, IBM (with its artificially intelligent Watson computer system) is credited with coining the term smarter planet; Google (with its wearable Glass and head-mounted display) is supporting augmented reality and cognition, yielding a range of data-driven technologies; and Microsoft (with its HDInsight) is empowering organizations with new insights on previously unstructured data.
Among growing companies, Cloudera uses Apache Hadoop to extract the most value from data, presumably at the least cost; Splunk provides a platform for real-time operational intelligence; and MongoDB Inc. is an open source, NoSQL database system.
Among academic institutes and within the past few years, the Simons Foundation selected the University of California, Berkeley, to host an ambitious $60 million Simons Institute for the Theory of Computing, where an interdisciplinary team of scientists and engineers is tackling problems in fields as diverse as health care, astrophysics, genetics, and economics; Boston University received $15 million to establish the Rafik B. Hariri Institute for Computing and Computational Science and Engineering; and the University of Rochester is investing $100 million in its Institute for Data Science to support data-informed, personalized medicine, national security, and online commerce.
Remarks about the astounding growth in Big Data application:
It is helpful to briefly consider the benefits and concerns associated with Big Data. With respect to benefits, Big Data allow for
Table 4 presents a summary of potential concerns about the focus, emphasis, and scope of the four data processing components: acquisition, access, analytics, and application. Other concerns include surveillance by autocratic governments and the processing of data in an increasingly unfocused, unproductive, and generally “shallow manner” (Carr 2010). Even Google’s vaunted flu prediction algorithm, which in 2009 was able to predict and locate the H1N1 flu spread on a near real-time basis, failed in 2012, when it predicted more than double the proportion of doctor visits for influenza-like illness than the Centers for Disease Control and Prevention (which bases its estimates on a survey of clinics located throughout the United States). Lazer and colleagues (2014) blame this failure on Big Data hubris and algorithm dynamics.
Of course, potential Big Data concerns or problems can be mitigated with thoughtful and effective approaches and practices; for example, legislation could be passed to forbid the invasion of privacy and to impose severe sanctions on those who break the law or knowingly publish false findings. Alternatively, a watchdog organization can be created to discover such findings, much like the recently established METRICS (Meta-Research Innovation Center at Stanford), whose mission is “identifying and minimizing persistent threats to medical research quality.”
Finally, as suggested throughout this article, Big Data have to be regarded as a permanent disruptive innovation or transformation. That is, data must be constantly acquired, accessed, analyzed, and applied, resulting in new—and changing—insights that might be disruptive in nature. To profit from Big Data, one must accept uncertainty and change as a permanent state of affairs, as part of any enterprise’s DNA. Indeed, some companies invite such changes by adopting processes that enable variation, not eliminate it, and by valuing disruptions over the relentless pursuit of a single vision (e.g., efficiency). As an example, Google encourages some of its workers to spend 20 percent of their time on projects of their own choosing and provides additional resources to those with the most merit.
In short, change is the only constant; companies that do not embrace it will face the same demise as Kodak, Digital Equipment Corporation, and Atari. On the other hand, those—such as GE, IBM, and Intel—that allow for disruptive innovations have not only survived but thrived.
Carr N. 2010. The Shallows: What the Internet Is Doing to Our Brains. New York: Norton.
Lazer D, Kennedy R, King G, Vespignani A. 2014. The parable of Google flu: Traps in Big Data analysis. Science 343:1203–1206.
Manyika J, Chui M, Bughin J, Brown B, Dobbs R, Roxbury C, Byers AH. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity. New York: McKinsey Global Institute.
Mayer-Schönberger V, Cukier K. 2013. Big Data: A Revolution That Will Transform How We Live, Work, and Think. New York: Houghton Mifflin Harcourt.
McAfee A, Brynjolfsson E. 2012. Big Data: The management revolution. Harvard Business Review, October 3–9.
Pentland A. 2014. Social Physics: How Good Ideas Spread—The Lessons from a New Science. New York: Penguin Press.
Taleb NN. 2010. The Black Swan, 2nd ed. New York: Random House.
Tien JM. 2003. Toward a decision informatics paradigm: A real-time information-based approach to decision making. IEEE Transactions on Systems, Man and Cybernetics, Part C 33(1):102–113.
Tien JM. 2012. The next industrial revolution: Integrated services and goods. Journal of Systems Science and Systems Engineering 21(3):257–296.
Tien JM, Berg D. 2003. A case for service systems engineering. International Journal of Systems Engineering 12(1):13–39.
Tien JM, Krishnamurthy A, Yasar A. 2004. Toward real-time management of supply and demand chains. Journal of Systems Science and Systems Engineering 13(3):257–278.
Turing AM. 1950. Computing machinery and intelligence. Mind 59:433–460.
This article draws liberally from earlier papers by the author (Tien 2003, 2012; Tien and Berg 2003; Tien et al. 2004).