In This Issue
Winter Bridge: A Global View of Big Data
December 15, 2014 Volume 44 Issue 4

Overview of Big Data A US Perspective

Monday, December 15, 2014

Author: James M. Tien

It is projected that about 5 zettabytes (or 1021 bits) of digital data are being generated each year by everything from retail transactions to underground physics experiments to global positioning systems to the “Internet of Things.” In the United States, public and private sector programs are being funded to deal with Big Data in all five sectors of the economy—services, manufacturing, construction, agriculture, and mining. This article presents an overview and analysis of Big Data from a US perspective.

Introduction

The term Big Data applies to datasets whose size exceeds the capacity of available tools to perform acquisition, access, analytics, and/or application in a reasonable amount of time.

“Data rich, information poor” (DRIP) problems have been pervasive since the advent of large-scale data collections or warehouses (Tien 2003), but somewhat mitigated by the Big Data approach, which supports informed (’though not necessarily defensible or valid) decisions or choices. Big Data are useful for decision making only to the extent that they are analyzed or processed to yield critical information.

Thanks to technological and computing advances, Big Data are poised to

  • add greater value to businesses, which can plumb their transactional data to detect patterns suggesting the effectiveness of their pricing, marketing, and supply chain strategies;
  • enhance understanding of planet Earth, which is being extensively monitored on the ground, in the air, and in the water;
  • support efforts to solve science and engineering problems, which increasingly require data-driven solutions;
  • support modern medicine, which is collecting and mining large amounts of image scans and genetic markers;
  • enhance the World Wide Web, which is amassing terabytes of textual, audio, and visual material made available through search engines such as Google, Yahoo, and Bing; and
  • aid national security agencies, which are collecting and mining satellite and thermal imagery, audio intercepts, and other readily available digital information.

Definitions and Origin of the Term Big Data

It is helpful to first define data: they are values of qualitative or quantitative variables, typically the result of measurements, belonging to a set of items. They are considered raw before they are processed; in fact, the processed data from one stage may be the raw data for the next stage. Metadata (sometimes referred to as data about data) describe the content and context of a set or file of data; for example, a photo file’s metadata would identify the photographer, the camera settings, the date taken, and so forth.

In this paper the definition of data includes measurements, raw values, processed values, and metavalues. More specifically, the focus is on digital data, whose basic unit of measurement is a bit, an abbreviation for a binary digit that can be stored in a device and that has two possible distinct values or levels (say, 0 and 1). A byte is a basic unit of information containing 8 bits, which can include 28, or 256 values (say, 0 to 255). Digital data can be measured in kilobytes (10001 bits), megabytes (10002 bits), gigabytes (10003 bits), terabytes (10004 bits), petabytes (10005 bits), exabytes (10006 bits), zettabytes (10007 bits), and, for now, up to yottabytes (10008 bits).

Clearly, this assortment of data can add up very quickly; the International Data Corporation estimates that, on a worldwide basis, the total amount of digital data created and replicated each year will grow exponentially from 1 zettabyte in 2010 to 35 zettabytes in 2020! Thus it is projected that about 5 zettabytes of digital data are being generated in 2014.

According to the current version of Wikipedia (accessed on October 15, 2014), the term Big Data “usually includes datasets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.” Obviously, the definition of what constitutes Big Data is shifting as software tools become more powerful; today, depending on the nature and mix of the data, a dataset is considered big if its contents range from a few terabytes to many petabytes.

Who coined the term Big Data? Mayer-Schönberger and Cukier (2013) credit computer scientist Oren Etzioni with having the foresight to see the world as a series of Big Data problems before the concept of Big Data was even introduced. Etzioni received his bachelor’s degree in computer science from Harvard University in 1986 and his PhD from Carnegie Mellon University in 1991. In 1994 he helped build one of the Web’s first search engines, MetaCrawler, and later he cofounded Netbot, the first major Web shopping site as well as ClearForest (for abstracting meaning from text documents) and Farecast (a predictive model for airline and other fares).

Mayer-Schönberger and Cukier (2013) provide not only a historical perspective on the birth and evolution of Big Data but also a persuasive argument about its importance, with insights about “what” is in the data (not necessarily about the “whys” behind the insights). In short, they consider Big Data a revolution that can transform how people live, work, and think.

Evolution of the Impacts and Uses of Big Data

At the beginning of the 21st century the growing volumes of data presented a seemingly insoluble problem; storage and central processing unit (CPU) technologies were overwhelmed by the terabytes of data being generated. Fortunately, Moore’s law came to the rescue and helped to make storage and CPUs larger, faster, smarter, and cheaper.

Today, Big Data are no longer a technical problem: they are a competitive advantage. As indicated earlier, enterprises are developing and using Big Data tools to explore their data troughs, to discover insights that could help them develop better relationships with their customers, to identify new areas for business opportunities, and to better manage their supply chains, all in an increasingly competitive business environment. Big Data critically affect decisions, risk, informatics, services, goods, and customization or personalization. In short, Big Data can be used to improve services, products, and processes, especially by supporting timely decisions.

The next section details the four major components of Big Data—acquisition, access, analytics, and application—and the final section offers critical observations about benefits and concerns associated with Big Data.

Components

It is helpful to begin by comparing the Big Data approach with the traditional data processing approach (Table 1), as there are major differences between the two in the four steps, or components, of data processing:

  • acquisition (including data capture);
  • access (including data indexing, storage, sharing, and archiving);
  • analytics (including data analysis and manipulation); and
  • application (including data publication).

Table 1

In particular, in contrast to the traditional, mostly statistical, approach, Big Data seek to unleash information in a manner that can support informed decisions by compensating for data quality issues with data quantity, data access restrictions with on-demand cloud computing, causative analysis with correlative data analytics, and model-driven with evidence-driven applications. This somewhat expedient (but not necessarily valid) approach can result in further problems or concerns, which are discussed in the concluding section.

On the other hand, the feasibility (or “good enough”) focus of Big Data is usually more realistic than the optimality focus of traditional, operations research methods. In fact, the steady-state assumption that underpins optimality is, for the most part, unrealistic, especially in real-time environments where values are changing and agent-negotiated solutions are indeed messy and at best only feasible.

No matter what the purpose of processing or analyzing data, it is critical that the data contain the insights or answers being sought; otherwise, the exercise involves no more than “garbage in, garbage out.” Metadata can therefore play an important role in ascertaining the scope, validity, and viability of the data.

The following sections present the current particulars of each component (e.g., associated terms, providers, recent developments, and projections) as well as remarks about the astounding growth in each area that warrant consideration, concern, and/or further research.

Acquisition

Advances in digital sensors, communications, computation, and storage have yielded zettabytes of data—from customer order transactions, emails, radio frequency identification (RFID) sensors, smartphones, films, video recordings, audio recordings (including ecosystem recordings; e.g., from crickets and rainstorms), genetic sequences, and the Internet of Things (both the Internet of Things and its data are proliferating because of the 2012 Internet Protocol, which allows for trillions of devices to be connected to the Web).

Methods of data capture include such stealth approaches as keystroke loggers and clickstreams (both of which can provide real-time insights into consumer behavior); smart sensors (which are becoming ubiquitous in devices, products, buildings, and even cities); health monitors (for both humans and animals in the monitoring of body temperature, blood pressure, etc.); and drones (including wing-flapping ornithopters and stamp-sized Memoto cameras). 

Remarks about the astounding growth in Big Data acquisition:

  • In order to become truly smart in, say, a smart city sense, all the sensors must be connected or electronically fused on to a common platform that can streamline both the data gathering and the resultant analyses.
  • Inasmuch as all sensors must communicate—and they do, mostly in a wireless manner—the question remains: What is the potential health effect from long-term exposure to radio frequency (RF) energy emitted by these sensors? At present, there is no long-term health study that can provide a definitive answer.
  • The speed of data acquisition is accelerating; for example, when the human genome was first being decoded in 2003, it required almost a decade to sequence one person’s 3.2 billion base pairs; today, a single facility can sequence an individual’s complete genome in a day!
  • Both the acquisition and use of personal data raise a number of privacy issues, from misuse to abuse; yet, these same data can save lives or at least help to make lives better, if not safer. Clearly, there are tradeoffs to be considered.

Access

In regard to data service, platform as a service (PaaS) consists of a computing platform and a solution stack as a service; together with software as a service (SaaS) and infrastructure as a service (IaaS), it is now typically associated with cloud computing.

In this service schema, the consumer creates the software using tools and/or libraries from the provider and is also able to control software deployment and configuration settings. Powerhouses such as Google, VMware, Amazon, Microsoft, HP, and Oracle provide the networks, servers, storage, and related services. Thus, as an example, Netflix uses Amazon to stream its on-demand videos and movies to more than 35 million subscribers in North and South America, the Caribbean, and several countries in Europe; indeed, on a typical weeknight Netflix movies account for about one-third of all downstream Internet traffic in North America.

Cloud computing—provided by companies such as Microsoft, Google, OpenStack, Amazon, and Rackspace—is growing in size as technical and security issues are being resolved and enterprises become dependent on the cloud for their growth, system efficiencies, and new product processes. It is projected that by 2015 more than 2.5 billion users and 15 billion devices will be accessing cloud services.

Remarks about the astounding growth in Big Data access:

  • Big media firms are worried that broadband access may cause greater video piracy, as was the case in South Korea where the home entertainment industry was decimated by digital piracy that was supposedly enabled by the widely available high-speed Internet. Obviously, piracy must be prevented, most likely by a technological solution that is yet to be developed.
  • There remains a policy question regarding cybersecurity and whether the US government is responsible for protecting commerce (especially financial businesses) from cyberattacks, just as the US military is responsible for defending the homeland from an invasion.
  • As with Big Data acquisition, Big Data access is subject to the same privacy and confidentiality concerns.

Analytics

Analysis of Big Data necessarily involves the computer, resulting in data analytics, the application of computer technology, operational research, and statistics to solve problems in business and industry. Analytics is carried out in a computer-based information system; of course, mathematics underpins the methods and algorithms used in analytics, and the science of analytics is concerned with extracting useful insights or properties of the data using computable functions. In addition to a decision-driven focus that permeates business and engineering, Big Data analytics has been used to gain scientific insights concerning, say, the laws of nature, genomics, and human behavior.

With the advent of Big Data analytics, a number of niche analytics have been developed—in retail sales, financial services, risk and credit, marketing, buying behavior, loan collections, fraud, pricing, telecommunications, supply chain, demand chain, transportation, and visualization. Early efforts at risk management were focused on risk avoidance; however, lost opportunities should also be factored into such efforts.

To process Big Data within tolerable elapsed times, Manyika and colleagues (2011) suggest a variety of technologies: association rule learning, classification, cluster analysis, crowdsourcing, data fusion and integration, ensemble learning, genetic algorithms, machine learning, natural language processing, neural networks, pattern recognition, predictive modeling, regression, sentiment analysis, signal processing, supervised and unsupervised learning, simulation, time series analysis, and visualization.

Additional technologies being applied to Big Data include massively parallel processing databases, search-based applications, data mining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.

Remarks about the astounding growth in Big Data analytics:

  • There is a concern that modeling or design support software (e.g., Ansys and SolidWorks) may undermine the need for young engineers to engage in hands-on activities; on the other hand, such support may provide more time for aspiring engineers to become more involved in understanding the physical complexities of their software-supported designs.
  • A similar concern is that powerful machines such as IBM’s Watson might displace human workers, but machines that can satisfy the Turing (1950) test of artificial intelligence are yet to be built; in the meantime, existing machines are, for the most part, only doing the drudge work that humans dislike, including searching, mining, and matching. In fact, the new Watson Engagement Advisor will give customer service transactions a layer of cognitive computing help, leveraging Watson’s unique skills to semantically answer questions.
  • It is appropriate to assess the usefulness or impact of Big Data analytics. Table 2 presents the author’s ranking of the potential impact of Big Data in regard to the 14 Grand Challenges (promulgated by the US National Academy of Engineering and conveniently grouped into three categories: health care and technobiology, informatics and risk, and sustainable and smart systems; www.engineeringchallenges.org)—resulting in an overall impact valuation of 1.9 (medium on a 3-point scale).
  • As with acquisition and access, Big Data analytics is subject to the same privacy and confidentiality concerns.

Table 2

Application

It is, of course, difficult to separate Big Data analytics from its applications; nevertheless, it is helpful to identify applications in terms of enterprises that use Big Data techniques in their day-to-day activities, as shown in Table 3.

Table 3

Among established companies, IBM (with its artificially intelligent Watson computer system) is credited with coining the term smarter planet; Google (with its wearable Glass and head-mounted display) is supporting augmented reality and cognition, yielding a range of data-driven technologies; and Microsoft (with its HDInsight) is empowering organizations with new insights on previously unstructured data.

Among growing companies, Cloudera uses Apache Hadoop to extract the most value from data, presumably at the least cost; Splunk provides a platform for real-time operational intelligence; and MongoDB Inc. is an open source, NoSQL database system.

Among academic institutes and within the past few years, the Simons Foundation selected the University of California, Berkeley, to host an ambitious $60 million Simons Institute for the Theory of Computing, where an interdisciplinary team of scientists and engineers is tackling problems in fields as diverse as health care, astrophysics, genetics, and economics; Boston University received $15 million to establish the Rafik B. Hariri Institute for Computing and Computational Science and Engineering; and the University of Rochester is investing $100 million in its Institute for Data Science to support data-informed, personalized medicine, national security, and online commerce.

Remarks about the astounding growth in Big Data application:

  • There is a concern that some smart innovations (e.g., smart power grids) are compromising privacy and raising costs.
  • A similar concern is that smart innovations such as robots and driverless cars will be difficult to accept. It should be noted, however, that robots, airplanes, and bullet trains are all subject to autonomous control—accidents may still happen but their occurrence is much less frequent.
  • McAfee and Brynjolfsson (2012) caution that an enterprise’s decision-making culture must change before the benefits of Big Data can revolutionize company management and performance. Thus as more local, state, and federal agencies make their data public or digitally available, more new Big Data–related businesses will flourish (e.g., car navigation, precision farming, property valuation, or matching suppliers and consumers). Recent research by Pentland (2014), for example, has found that social or group incentives are much more effective than individual incentives to motivate disruptive changes in behavior.

Conclusion

It is helpful to briefly consider the benefits and concerns associated with Big Data. With respect to benefits, Big Data allow for

  • better integration or fusion and subsequent analysis of both quantitative and qualitative data;
  • better observation of rare but great impact events or “black swans” (Taleb 2010);
  • greater system and system-of-systems efficiency and effectiveness;
  • better evidence-based—“data rich, information unleashed” (DRIU)—decisions that can overcome the prejudices of the unconscious mind; and
  • messier findings that are nonetheless good enough to support informed decisions.

Table 4

Table 4 presents a summary of potential concerns about the focus, emphasis, and scope of the four data processing components: acquisition, access, analytics, and application. Other concerns include surveillance by autocratic governments and the processing of data in an increasingly unfocused, unproductive, and generally “shallow manner” (Carr 2010). Even Google’s vaunted flu prediction algorithm, which in 2009 was able to predict and locate the H1N1 flu spread on a near real-time basis, failed in 2012, when it predicted more than double the proportion of doctor visits for influenza-like illness than the Centers for Disease Control and Prevention (which bases its estimates on a survey of clinics located throughout the United States). Lazer and colleagues (2014) blame this failure on Big Data hubris and algorithm dynamics.

Of course, potential Big Data concerns or problems can be mitigated with thoughtful and effective approaches and practices; for example, legislation could be passed to forbid the invasion of privacy and to impose severe sanctions on those who break the law or knowingly publish false findings. Alternatively, a watchdog organization can be created to discover such findings, much like the recently established METRICS (Meta-Research Innovation Center at Stanford), whose mission is “identifying and minimizing persistent threats to medical research quality.”

Finally, as suggested throughout this article, Big Data have to be regarded as a permanent disruptive innovation or transformation. That is, data must be constantly acquired, accessed, analyzed, and applied, resulting in new—and changing—insights that might be disruptive in nature. To profit from Big Data, one must accept uncertainty and change as a permanent state of affairs, as part of any enterprise’s DNA. Indeed, some companies invite such changes by adopting processes that enable variation, not eliminate it, and by valuing disruptions over the relentless pursuit of a single vision (e.g., efficiency). As an example, Google encourages some of its workers to spend 20 percent of their time on projects of their own choosing and provides additional resources to those with the most merit.

In short, change is the only constant; companies that do not embrace it will face the same demise as Kodak, Digital Equipment Corporation, and Atari. On the other hand, those—such as GE, IBM, and Intel—that allow for disruptive innovations have not only survived but thrived.

References

Carr N. 2010. The Shallows: What the Internet Is Doing to Our Brains. New York: Norton.

Lazer D, Kennedy R, King G, Vespignani A. 2014. The parable of Google flu: Traps in Big Data analysis. Science 343:1203–1206.

Manyika J, Chui M, Bughin J, Brown B, Dobbs R, Roxbury C, Byers AH. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity. New York: McKinsey Global Institute.

Mayer-Schönberger V, Cukier K. 2013. Big Data: A Revolution That Will Transform How We Live, Work, and Think. New York: Houghton Mifflin Harcourt.

McAfee A, Brynjolfsson E. 2012. Big Data: The management revolution. Harvard Business Review, October 3–9.

Pentland A. 2014. Social Physics: How Good Ideas Spread—The Lessons from a New Science. New York: Penguin Press.

Taleb NN. 2010. The Black Swan, 2nd ed. New York: Random House.

Tien JM. 2003. Toward a decision informatics paradigm: A real-time information-based approach to decision making. IEEE Transactions on Systems, Man and Cybernetics, Part C 33(1):102–113.

Tien JM. 2012. The next industrial revolution: Integrated services and goods. Journal of Systems Science and Systems Engineering 21(3):257–296.

Tien JM, Berg D. 2003. A case for service systems engineering. International Journal of Systems Engineering 12(1):13–39.

Tien JM, Krishnamurthy A, Yasar A. 2004. Toward real-time management of supply and demand chains. Journal of Systems Science and Systems Engineering 13(3):257–278.

Turing AM. 1950. Computing machinery and intelligence. Mind 59:433–460.

 

FOOTNOTES

This article draws liberally from earlier papers by the author (Tien 2003, 2012; Tien and Berg 2003; Tien et al. 2004).

About the Author:James M. Tien (NAE) is a distinguished professor and dean, College of Engineering, University of Miami.