Click here to login if you're an NAE Member
Recover Your Account Information
Author: Yong Shi
The concept of Big Data comprises applications, engineering, and scientific aspects, but there is not yet a unified definition of Big Data; it varies among academic and business communities. In some academic communities the term refers to information technology applications for dealing with massive data problems in business, and the scientific components or research aspects of Big Data are called data science. In some professional communities, the terms business intelligence and business analytics are used to mean Big Data analytics or Big Data mining (Chen et al. 2012). The National Science Foundation describes Big Data as “large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future” (NSF 2012).
In May 2013 a group of international scholars brainstormed two definitions of Big Data in a session (that I cochaired) on Data Science and Big Data at the Xiangshan Science Conference (XSSC 2013) in Beijing. The first definition, for academic and business communities, is “a collection of data with complexity, diversity, heterogeneity, and high potential value that are difficult to process and analyze in reasonable time,” and the second, for policymakers, is “a new type of strategic resource in the digital era and the key factor to drive innovation, which is changing the way of humans’ current production and living.” In addition, “4Vs”—volume, velocity, variety, and veracity—are used to capture the main characteristics of Big Data (Laney 2001).
In this paper I sketch the early beginnings of efforts to analyze quantities of information and then review current areas of professional and academic activity in Big Data, including measures by international governments. There remain three particular challenges associated with Big Data; attention to these problems will help to ensure progress toward the full use of Big Data for all its social and economic benefits.
Historic Events Related to Big Data
The history of data analysis can be traced back 250 years, to the early use of statistics to solve real-life problems. In the area of statistics, Bayes’ theorem has played a key role in the development of probability theory and statistical applications. However, it was Richard Price (1723–1791), the famous statistician, who edited the theorem after Thomas Bayes’ death in 1761 (Bayes and Price 1763). Price was also one of the scientists who initiated the use of statistics in analyzing social and economic datasets.
In 1783 Price published the “Northampton Table,” calculations of the probability of the duration of human life in England based on his observations as an actuary. The observations were shown in tables with rows for records and columns for attributes as the basis of statistical analysis. Such tables are now commonly used in data mining as multidimensional tables. Therefore, from a historical point of view, the multidimensional table should be called the “Richard Price Table” and Price should be honored as a father of data analysis and data mining.
Since the 1950s, as computing technology has gradually been used in commercial applications, many corporations have developed databases to store and analyze collected data. Mathematical tools for handling data-sets have evolved from statistics to methods of artificial intelligence, including neural networks and decision trees. In the 1990s the database community started using the term data mining, which is interchangeable with the term knowledge discovery in databases (Fayyad et al. 1996). Data mining, which intersects human intervention, machine learning, mathematical modeling, and databases, is now the common approach to data analysis. The investigation of theoretical components of Big Data, or data science, calls for interdisciplinary efforts from mathematics, sociology, economics, computational science, and management science.
The key value of Big Data analytics or data mining is to obtain intelligent knowledge.
Current State of Big Data
It is not easy to describe how Big Data are deeply and quickly influencing the world. However, four recent developments should be mentioned: Big Data associations, conferences, journals, and access to government sources. In addition, Big Data account for a growing share of countries’ economy.
Associations, Conferences, and Journals
Academic and professional communities around the world have established a number of Big Data–related nonprofit organizations to exchange and disseminate theoretical findings, practical experience, and case studies about Big Data as well as data science. These include China’s Research Centers on Fictitious Economy and Data Science and for Dataology and Data Science, the Data Science Consortium in Japan, the International Council for Science: Committee on Data for Science and Technology (CODATA; based in France), and the UK Data Science Institute. In the United States alone, more than a dozen groups and institutes are located across the country.
In 2013–2014 numerous Big Data conferences were held around the world, organized by professional societies and universities to address Big Data writ large and specific aspects such as technology, algorithms, nonparametric statistics, and cloud computing. These conferences have attracted thousands of scholars, engineers, and practitioners for their common interests in Big Data problems.
There are two categories of Big Data–related academic journals: one addresses Big Data and the other data science. They publish articles on research, business, intelligence, and society. Most of the journals are new and feature cutting-edge research findings and technological advances in Big Data areas.
Role of Government
Governments play a key role in promoting Big Data applications. In the United States President Barack Obama has proposed to open governmental data sources in order to increase citizen participation, collaboration, and transparency in government; the website Data.gov is part of this effort. At the June 2013 G-8 Summit the countries agreed on an “open government plan” that encourages governments to open their data to the public according to five principles: “open data by default, quality and quantity, usable by all, releasing data for improved governance, and releasing data for innovation” (Cabinet Office 2013). Other countries as well have set up a data.gov-style website; for example, on December 20, 2013, the Japanese government launched data.go.jp. And on February 28, 2014, China announced that President Xi Jinping would head China’s central Internet security and information group, to demonstrate the country’s resolve to build itself into a strong cyberpower. The open government project of China is part of a broader agenda.
In addition to the G-8 countries, many more are adopting open government initiatives, as shown in Table 1. Big Data made available to the public by government agencies span a very wide range of categories that include agriculture, infrastructure, climate and weather, energy, jobs and employment, public safety and security, science and technology, education, and transportation.
Big Data in the International Market
According to an IDC report, as of 2012 the United States had 32 percent of the international Big Data market, western Europe had 19 percent, China 13 percent, India 4 percent, and the rest of the world 32 percent (Gantz and Reinsel 2012). By 2020 the emerging markets will account for 62 percent of the digital universe and China alone will generate 21 percent of the Big Data in the world. This prediction seems plausible given China’s population of 1.3 billion, with 564 million Internet users and 420 million cellular phone users. More people generate more data.
Big Data Challenges
There are many challenges for Big Data analytics (Tien 2014). The following three problems are urgent to solve in order to gain benefits from the use of Big Data in science, engineering, and business applications:
Transformation of Semi- and Unstructured Data to Structured Data
In the academic field of Big Data, the principles, basic rules, and properties of data, especially semi- and unstructured data, are yet to be elucidated because of the complexity of such data. This complexity reflects not only the variety of the objects that the data represent but also the fact that each dataset can present only a partial image for a given object: although a dataset may accurately represent an aspect of the object, it cannot convey the whole picture. Thus the relationship between data representation and a real object is like that of the blind men and the elephant: the resulting perceived image will depend greatly on the particular aspect viewed.
Thanks to recent advances, technologies such as Hadoop and MapReduce make it possible to collect a large amount of semistructured and unstructured data in a reasonable amount of time. The key engineering challenge is how to effectively analyze these data and extract knowledge from them within a specific amount of time. The likely first step is to transform the semi- and/or unstructured data to structured data, and then apply data mining algorithms developed for the structured data.
Once the data are structured, the known data mining algorithm can produce rough knowledge. This stage of the process can be regarded as first-order mining. The structured rough knowledge may reflect new properties that decision makers can use if it is then upgraded to intelligent knowledge. This upgrade requires analysts to draw on human knowledge such as experience, common sense, and subject matter expertise. This stage is called second-order mining (Zhang et al. 2009). Because the knowledge changes with the individual and situation, the human-machine interface (Big Data mining vs. human knowledge) plays a key role in Big Data analytics.
Complexity, Uncertainty, and Systematic Modeling
As mentioned above, any data representation of a given object is only a partial picture. The complexity of Big Data is caused by the quantity and variety of the data, and the uncertainty comes from changes in the nature and variety of data representations.
When a certain analytical method is applied to Big Data, the resulting knowledge is specific to that particular angle or aspect of the real object. Once the angle is changed, by either the means of collection or the analytical method, the knowledge is no longer as useful. For example, in petroleum exploration engineering, which involves Big Data, data mining has been applied to a spatial database generated from seismic tests and well log data. The underground geological structure itself is complicated. The nonlinear patterns of data are changeable via different dimensions and angles. Thus any results of data mining or analysis yield knowledge only for the given surface. If the surface changes, the result also changes. The challenge is to determine how to derive meaningful knowledge from different surfaces of spatial data (Ouyang and Shi 2011).
To address this challenge, systematic modeling of the complexity and uncertainty of Big Data is needed. It may be difficult to establish a comprehensive mathematical system that is broadly applicable to Big Data, but by understanding the particular complexity or uncertainty of given subjects or areas it may be possible to create domain-based systematic modeling for specific Big Data representation. A series of such modeling structures could simulate Big Data analytics for different subjects or areas.
If engineers can determine some general approaches to deal with the complexity and uncertainty of Big Data in a certain field—say, the financial market (with data stream and media news) or Internet shopping (images and media evaluations)—this will be of great benefit to societal and economic development. Many known techniques in engineering (e.g., optimization, utility theory, expectation analysis) can be used to measure how the rough knowledge gained from Big Data is efficiently combined with human judgment in the second-order mining process of eliciting the intelligent knowledge needed for decision support.
Data Heterogeneity, Knowledge Heterogeneity, and Decision Heterogeneity
Big Data present decision makers with problems of data heterogeneity, knowledge heterogeneity, and decision heterogeneity. Decision making has traditionally depended on knowledge learned from others and from experience. Knowledge acquisition is now increasingly based on data analysis and data mining.
Like the data, decision making can be classified as structured, semistructured, or unstructured depending on the allocation of responsibilities in an organization (Laudon and Laudon 2012). The needs of decision makers for (quantitative) data or information and (qualitative) knowledge differ according to their level of responsibility. Operational staff handling routine work make structured decisions. Managers’ decisions are based on a combination of subordinates’ reports (most of them structured) and their own judgment and are thus semi-structured. Top-level managers or chief executive officers (CEOs) make final decisions that are unstructured.
Big Data are disruptively changing the decision-making process. Using Big Data analytics, the functions of operational staff, managers, and CEOs can be combined for streamlined decision making. For instance, a salesperson may use a real-time credit card approval system based on Big Data mining technology to quickly approve a credit limit for a customer without reporting to a supervisor. Such a decision has almost zero risk. The sales associate is the final decision maker, representing both manager and CEO.
In a data mining process using structured data, the rough knowledge normally is structured knowledge, given its numerical formats. In Big Data mining, although rough knowledge in the first-order mining is derived from heterogeneous data, it can be viewed as structured knowledge since the data mining is carried out in a structured data–like format. At the second-order mining stage, the structured knowledge is combined with the semistructured or unstructured domain knowledge of the manager or CEO and gradually upgraded to intelligent knowledge. Intelligent knowledge thus becomes a representation of unstructured knowledge.
If business operations involve only semistructured and/or unstructured data, the result is either unstructured knowledge without data analysis or structured knowledge from data mining. Such structured or unstructured knowledge can affect semistructured or unstructured decisions depending on the levels of management involved.
Based on rough knowledge from first-order mining, searching for intelligent knowledge through second-order mining is key to understanding the relationship between data heterogeneity, knowledge heterogeneity, and decision heterogeneity. Efforts to learn how decision making can be changed by Big Data require an understanding of the relationships among the processing of heterogeneous data, Big Data mining, the domain knowledge of decision makers, and their involvement in decision making.
Theoretical contributions and engineering technological breakthroughs on the above three challenges can enhance the application of Big Data. Such efforts will involve interdisciplinary efforts from mathematics, sociology, economics, computational science, and management science. With such progress the use of Big Data will spread widely from the field of information technology to multimedia, finance, insurance, education, and a host of other areas for the formulation of new business models—boosting investment, driving consumption, improving production, and increasing productivity.
Data scientists and engineers can support such efforts by identifying and addressing the challenges and opportunities of Big Data. To that end, they need to provide more theoretical findings and creative or innovative techniques to support Big Data development into the future.
Looking around the world, Big Data development is just beginning. Big Data are a treasure created by the people and should be used to benefit the people. All governments should develop strategic planning for Big Data, allow public use of Big Data to improve productivity, and establish laws or regulations to push enterprises to share their Big Data for better business applications.
The author thanks Managing Editor Cameron H. Fletcher for her excellent editing of the original version of this manuscript. This work was partially supported by the National Nature Science Foundation of China (Grant Nos.70921061, 71331005).
Bayes T, Price R. 1763. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London 53:370–418.
Cabinet Office. 2013. Policy Paper: G8 Open Data Charter and Technical Annex. London. Available at https://www.gov.uk/government/publications/open-data- charter /g8-open-data-charter-and-technical-annex.
Chen H, Chiang RHL, Storey V. 2012. Business intelligence and analytics: From big data to big import. MIS Quarterly 36(4):1165–1188.
Fayyad UM, Piatetsky SG, Smyth P. 1996. From data mining to knowledge discovery: An overview. In: Fayyad UM, Piatetsky SG, Smyth P, Uthurusamy R, eds. Advances in Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press/MIT Press. pp. 1–34.
Gantz J, Reinsel D. 2012. The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. Framingham, MA: International Data Corporation (IDC). Available at www.emc.com/leadership/digital-universe/index.htm.
Laney D. 2001. 3D Data Management: Controlling Data Volume, Velocity, and Variety. Stamford, CT: MetaGroup.
Laudon KC, Laudon JP. 2012. Management Information Systems. Upper Saddle River, NJ: Pearson.
NSF [National Science Foundation]. 2012. Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA). Washington. Available at www.nsf.gov/pubs/2012/nsf12499/nsf12499.htm.
Ouyang ZB, Shi Y. 2011. A fuzzy clustering algorithm for petroleum data. Proceedings of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 03:233–236. Lyon, August 22–27.
Tien J. 2014. Overview of big data: A US perspective. Bridge 44(4):12–19.
XSSC. 2013. Report on the 462nd Session: Data Science and Big Data. Xingshan Science Conference, May 29–31, Chinese Academy of Sciences, Beijing.
Zhang L, Li J, Shi Y, Liu X. 2009. Foundations of intelligent knowledge management. Journal of Human Systems Management 28(4):145–161.