Perspectives to Definition of Big Data : A Mapping Study and Discussion

Big data is an emerging research area where common terminology is still evolving. Different perspectives to the research area and terminology exist, but a common definition for big data does not exist. We have performed a systematic mapping study in order to identify different big data definitions and their perspectives. As a result, we present a state-of-the-art review of the current status in big data definitions, discuss the shortcomings of the current definitions, and propose possible solutions for the shortcomings. The paper contributes to the emerging big data research by analyzing current definitions of big data from different perspectives, suggesting enhancement to the terminology as well as pointing out new research avenues. In addition, the article helps new researchers and practitioners to understand what big data is, and bridges the knowledge between theory and practice.


Introduction
Digitization is a current megatrend, meaning that digital technologies are integrated into our everyday life.The use of digital technologies enables the connection of different services and automation of many processes.Although digitization itself is an important technological (r)evolution, it enables even more fundamental change: datafication.An increasing number of devices and sensors are constantly connected to the Internet.Cameras, mobile phones, tablets, various applications and services running on them produce wide varieties of digital data.This data generation phenomenon is called datafication (Mayer-Schönberger and Cukier, 2013).Lycett (2013) defines datafication as a "sense-making process", which emphasizes the value generation aspect.Digitization and datafication make it possible to capture different situations, actions, or even series of events in the form of data.A vague term "big data" describes the data resulting from datafication.This phenomenon has widespread effects.
As an example, let us consider quadcopters.Amazon and DHL1 , among others, are prototyping these small flying devices for delivering goods to customers.Quadcopters society, organizations, and individuals -to shed light on the definition of big data.As the method, we use a systematic mapping study.According to Kitchenham (2007), mapping studies are designed to give a broad overview of a research area.Mapping studies have typically broad research questions.Our research questions are: • What kind of definitions of big data exist in research papers and among practitioners?• How has the definition of big data evolved?
• How do the definitions reflect the different characteristics and perspectives of big data?

Literature Search
Our initial search covered three major reference databases: Scopus, ProQuest, and Web of Science.We considered this as a good starting point, as these databases index a broad range of papers, covering both technical and business fields.Figure 1 gives an overview of the search process.In addition to wide research questions, Kitchenham (2007) suggests that mapping studies should use rather loose search criteria.We searched the databases (title, keywords, abstract) by using ("big data" and "definition") as a search string.All papers written in English and indexed up to 02-Sep-2015 were included in the initial result set.No additional limitations were set.A total of 479 papers were identified.Next, we removed duplicate articles (117).After removing the duplicates, we read the abstracts, and where necessary, the whole text of each of the resulting papers.We categorized the papers by using the following inclusion/exclusion criteria: If the paper contains a definition of big data, include it, otherwise reject it.Due to the loose search criteria, a number of papers defining other things than big data were included in the initial search.Papers that obviously did not meet the eligibility criteria were rejected.If the decision was not clear, we performed a full text review, and the paper was either included or excluded on the basis of the review.Additional 17 papers were excluded because they were either commercial, high-price reports or they could not be found.As a result of this phase, 27 papers were included in the result set.
In the reference-tracking phase, we searched for additional papers on the basis of citations in the included papers (backward snowballing).Possibly interesting references were checked in the article context, and if still promising, they were tracked from databases or web sources, including Google Scholar and various web pages.If the article met the eligibility criteria, it was included.We identified additional 35 papers in this phase.
At the end of the search process phase, we had identified 62 papers that contained a definition of big data.The year-wise distribution of these papers is presented in Figure 2. It seems that although the first definition was presented more than 10 years ago, the discussion of the definition of big data started only a few years back.These papers and their definitions were examined further.

Analysis of the Definitions
The first part of the analysis covers the evolution of the definition of big data.Definitions, their existence in time, as well as similarities and differences are presented.This analysis reveals what perspectives (or components) various participants have added to the definition over time.The second part of our analysis identifies gaps between the current definitions and big data value propositions, in order to find out what perspectives are still missing.
The term "big data" is not new.It has been used both in research and non-research papers for quite a long time.Back in 1997 it was used in the context of visualizing large data sets (Cox and Ellsworth, 1997).In 1998 it was used in a hardware-related presentation (Mashey, 1998) and also in the data mining context (Weiss and Indurkhya, 1998), and 2003 in combination with statistics (Diebold, 2003).In the beginning, big meant the size and all these sources recognized and referenced big data with the increasing volumes of data.However, year 2001 can be considered as a major milestone in the definition of big data.Laney (2001) described three essential dimensions of big data: volume, velocity and variety.Operating with a swarm of autonomous quadcopters requires the management of high-volume, high-velocity (real-time) data that have many types (variety).
During the following decade, trailblazers like Google and Amazon developed practical big data solutions.These solutions have proved to add value to their businesses.In fact, the trailblazers build their business models on big data solutions.An article published in 2008 in the Wired magazine (Anderson, 2008) aroused public interest in the use of big data and its effects in science.The next significant milestone was 2011, when McKinsey Global Institute and IDC published reports (Gantz and Reinsel, 2011;Manyika et al., 2011) that drew wide public attention to the potential value of big data.Since then a number of newspaper articles, scientific big data papers and books have been published.We considered Laney (2001) to be the one to offer the first real definition, although the term big data had been used earlier.In our analysis of the studies, we could not identify references to earlier papers.Naturally, Laney must have been influenced by earlier work, but his paper was the first to introduce the three big data dimensions: volume, variety and velocity.Most of the definitions rely at least partly on the 3V definition by Laney (2001).Figure 3 shows the most common characteristics used in the definitions of big data (see Appendix 1 for details of the definitions).95% (59 occurrences out of 62) of the papers identified volume as a key characteristic of big data.In addition, the papers considered variety (55 occurrences) and velocity ( 46) to be typical big data factors.Value (17) and veracity ( 14) had also caught attention.These five dimensions dominate the current definitions of big data.
The included 62 papers (see Appendix 1 for details) were arranged by their publishing date, and each paper was inspected against previously published definitions.If the paper contained a new definition or added some new elements to the existing definitions, it was considered to be a new definition.This analysis resulted in 17 different definitions.These 17 definitions have similarities in the sense that many of them aim to widen the 3V definition to cover technical and especially business aspects.This is quite a natural consequence with regard to the big data value proposal.However, wide definitions can be problematic, and some essential aspects of big data are still lacking.We will discuss these aspects below.The rest of the papers (45) contained definitions essentially covered in earlier papers.Appendix 1 presents details of the definitions.The 3V definition was the de-facto big data standard until 2011, when both Manyika et al. (2011) and Gantz and Reinsel (2011) published their reports.Manyika et al. (2011) emphasize the potential value of big data, but curiously enough, their definition focuses on data volume including only a hint ("analyze") of the value.Also, when compared to Laney (2001), Manyika et al. (2011) have left out velocity and variety.Gantz and Reinsel (2011) include the three Vs, and add value extraction and new architectures.They have also decided to define big data technologies instead of big data.This approach allows them to balance the definition between data, technology and business components with a reasonable logic.
The big data hype was at its peak in the years 2012 and 2013.Several aspects of big data were discussed, such as privacy, security, (business) value, and veracity.We identified seven definitions from 2012 that were either completely new, like the one by Microsoft (2012), or added new components to existing definitions (Gartner, 2012;Schroeck et al., 2012;Fan and Bifet, 2013), and three from the year 2013.After that date we identified four more additions.Two of these later definitions (Demchenko, DeLaat, et al., 2014;Baro et al., 2015) note the importance of delivering the results to consumers.This analysis showed that the evolution of the definition started with data and especially data volumes, and then the discussion shifted to infrastructure topics, followed by the (business) value of data.Finally, more fine-grained aspects, like data delivery and collaboration, appeared.

Definitions vs. Big Data Value Chain
An interesting question is how the 17 different big data definitions reflect the significant value proposal of big data?Several frameworks explain how data adds value.One of the first of such models is the Virtual Value Creation (VVC) framework presented by Rayport and Sviokla (1995).This framework describes five steps that are required to create value from data: gather, organize, select, synthesize, and distribute (see Figure 5).The steps gather and organize are data-related, and they cover aspects like data acquisition from sensors, integration with other data, and data storing.The steps select, synthesize and distribution depend on data usage.They are activities like filtering data for analysis, or represented as artifacts like analytical models, data visualization, and information delivery tools.Value is expected to increase as data items from various sources are combined to form meaningful information chunks in the VVC process.
A quadcopter reads its current location from the GPS sensor and combines it with the destination information (gather, organize).Based on the analysis, it may take a decision to change its direction (select, synthetize).At frequent intervals, the copter sends data (e.g.location) to the command center (distribute).This simple VVC process adds value, as it enables the copter to work autonomously.However, taking a helicopter view by looking at the whole fleet instead of one quadcopter, it becomes clear that much more value is available.The command center systems gathers data from each of the copters and other sources, e.g. from delivery orders (gather, organize).An analytical model calculates the routes (select, synthetize) and sends instructions (like pick-up and delivery addresses) to each of the copters (delivery).This automated VVC process creates value from the data by producing optimal routes, maximizing the number of deliveries and increasing efficiency.Table 1 maps the 17 different definitions to the big data value chain.Together the current big data definitions cover all phases of the value chain.However, most of the definitions cover only parts of the chain.There are two definitions that consider all five phases, those of Demchenko, DeLaat, et al. (2014) and Baro et al. (2015).Note that the table shows which phases of the value chain the new perspective of each definition emphasizes.This is for clarity: many of the definitions cover also other phases, e.g.Demchenko, DeLaat, et al. (2014) have also covered other steps.However, the new perspective of their definition is the delivery aspect, and therefore only the distribute phase is included in Table 1.

Discussion
As can be seen in the definitions and analysis, big data can mean different things, depending on the selected viewpoint.Some perceive big data as a technical challenge, others view it as a vehicle to increase efficiency or profits.In this section we will show that combining data and its intended usage leads to vague definitions, and consider how the disruptive nature of big data should be taken into account.

Separate Data and Its Usage
Our analysis revealed that several definitions have logical incoherencies.Value, for example, must be derived from the data by using analytics, there is no value in plain data as such (Ackoff, 1989).Value is also case-dependent.A certain piece of information may be worthless to one company but highly valued by some other firm or in another situation.For example, quadcopter flight details are much more valuable in case of an accident than in a normal situation.This is emphasized by Mayer-Schönberger and Cukier (2013) who state that the value of big data is in the secondary uses of the data.For veracity, analytics is required to determine whether the data is relevant for the planned usage.As important factors as value and veracity are in practice, they do not define the characteristics of big data, but instead they reflect the usage of the data.Vague definitions are typically hard to understand as they raise questions that cannot be answered coherently.This will lead to different interpretations and misunderstandings.
The original 3V definition (Laney, 2001) leaved the business effects out.This is one of the main reasons why many new definitions have emerged.Both technology vendors and enterprises have an interest to add a value proposition.Companies see big data as a vehicle to gain value, vendors naturally like to justify the costs of their offerings with potential benefits.A natural tendency would be to add a value component to the definition.However, as discussed above, value is not a characteristic of data.Definitions should be clear and unambiguous.Therefore, adding data usage to the definition is not a good idea, as the definition would become unambiguous, and coherency would be lost.Our suggestion to the problem is that the data and its usage should be separated.Data is similar to oil: when combined with data management and analytics processes it provides organizations with value.Analytics and data usage are of course essential elements in successful big data exploitation.However, from the definition point of view, combining data and its usage is like combining oil and engine into one single definition.Separating big data from its intended usage clarifies the inconsistencies of the definitions and helps us to understand the plain characteristics of big data.As the purpose of data usage is to realize the potential value of the data, we propose the term big data insights to be used in any context in data usage -related activities (see also figure 5).In addition to technical and value aspects, scholars have focused on several other perspectives to big data, such as privacy, security (Altshuler, 2011;Berghel, 2013;Lu et al., 2014), and policy-making (Keen et al., 2013;Blume et al., 2014;Truyens and Van Eecke, 2014).None of the current definitions of big data consider these.These aspects are not characteristics of big data; we do not suggest that these aspects should be included in the definition.Instead, they are aspects that help to understand big data as a phenomenon.Moreover, these perspectives are important, as failing to consider them can drive an organization to difficulties.Another, even more important aspect is that the current definitions neglect the disruptive nature of big data.On the basis of the literature it seems obvious that in the future, big data will have significant impacts on businesses (Manyika et al., 2011;Schmarzo, 2013;Davenport, 2014).Big data is seen as a technology that can have huge impacts on most industries and enterprises.Data-driven companies can achieve significant benefits (McAfee et al., 2012), but transformational business changes (Dehning et al., 2003) are required to achieve full competitive advantage from big data.The impact of big data will be significant, but the nature of the change is even more important.The effects of big data on firms, ecosystems and industries will be disruptive (Earley, 2014;Fan and Gordon, 2014;Kim et al., 2014).Industry structures are changing, and new business opportunities are emerging.On the other hand, this means that also competitors may be able to invent new business models, not to speak of new entrants, which will increase the turbulence effectively.The impacts of big data may -and will -be positive for some organizations, negative for others.Due to the disruptive nature of big data, companies must review their business models in order to reveal possible threats and opportunities.Moreover, as the disruptive drivers are technological by nature, these technologies and their potential effects must be linked with strategy.
We suggest that a new definition for big data as a phenomenon should be considered.
For clarity and coherency, the definition of big data should cover only data and data management aspects (like the 3V definition).The phenomenon of big data is a broad concept that deserves a definition of its own.Instead of defining big data, the definition should consider several important perspectives of it.In our opinion, this definition should include the disruptive nature and strategic importance of the phenomenon.Adding these elements would help managers to understand the importance of the matter.This opens a new research avenue.Discussing and defining the nature and relations between various perspectives would build understanding of the broader context of big data, big data as a phenomenon.

Conclusions
Our aim was to shed light to the concept of big data, especially from the following viewpoints: • What kind of definitions of big data exist in research papers and among practitioners?
• How has the definition of big data evolved?
• How do the definitions reflect the different characteristics and perspectives of big data?A systematic mapping study was conducted in order to find answers to these questions.We made a search in major reference databases, search engines, and web sources containing both technical and business topics.A total of 62 sources were included in the result set.With regard to our research questions, we chose a broad search strategy in order to cover a wide range of possible sources.We identified 17 different definitions of big data that together presented a clear picture of the current situation and evolution of the definition, thus providing answers to our first and second research questions.We also compared the current definitions with various characteristics of big data.We found that the current definitions do not cover several perspectives that are discussed among big data scholars and practitioners, which answers our third research question.In addition, we identified several logically incoherent definitions.This clouds the matter further, as these definitions raise new questions, which will typically lead to ambiguous answers.

Results
This study revealed 17 different big data definitions from 62 relevant source papers.Each of the papers was analyzed against previously published definitions.If the paper contained a new definition or added some new elements to the existing definitions, it was considered as a new definition.The key contributions of this study are: • Although there are various opinions on what big data is, the 3V definition by Laney (2001) contains three dimensions (volume, velocity, variety), which are common to most definitions.In addition to these dimensions, many definitions include technical parts and components related to the intended usage of the data, such as analysis or decision-making.• Many of the definitions are logically inconsistent, which is one reason for the vagueness of the term big data.A typical flaw is to include both the data and its intended usage in the definition.We suggest that they should be separated.The term big data should cover data-related aspects, whereas a new term big data insights should be used when discussing data usagerelated activities.• The current definitions do not consider several important aspects of the big data phenomenon, such as security and privacy, or its disruptive nature.These are not characteristics of big data, but they are important factors of the big data phenomenon that both scholars and practitioners must consider.We suggest that a new definition for big data as a phenomenon should be developed.In addition, this study bridges the knowledge between theory and practice.We have presented the history and the state-of-the-art of the definition of big data.This will help new researchers and practitioners to understand the different perspectives of big data, as well as the limitations of the current definitions.Therefore, we hope that this paper will also stimulate discussion about the terminology and help parties coming from different backgrounds to understand each other and communicate their reasoning clearly.

Limitations
We recognize that an uncountable number of various definitions of big data exist in the "Internet jungle", e.g. in blog postings and discussion forums.However, due to limited resources, identifying and analyzing all or even most of them would be impossible, and therefore we have filtered blogs and forums out.Another limitation is that we have excluded all non-English language sources.

Suggestions for Further Studies
There are several possible topics for further studies, including the following.It is clear that there is a need to develop the terminology and taxonomy further (including related terms, such as big data analytics, big data phenomenon, and veracity) in order to create common understanding of the key concepts and their relationships in the area of big data.Another interesting research avenue would be to investigate the effects of big data on organizations' business models or decision-making processes, organizational structures, and culture.

2013-May
"Extensions to the (3V) model that take Value into account are then proposed and discussed.… However recording the data does not bring any value to the company.It only becomes valuable once that data is used or processed.... High Value Data (HVD) is data that has a known benefit from its storage.... Low Value Data (LVD) is data that is stored in the anticipation that value will be drawn from it in the future." High & low value data (Sagiroglu and Sinanc, 2013)

2013-Aug
Big data is about "building new analytic applications based on new types of data, in order to better serve your customers and drive a better competitive advantage."competitive advantage (Ward and Barker, 2013)

2013-Sep
"Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning."- (Stonebraker and Robertson, 2013)

2013-Sep
In summary, big data can mean big volume, big velocity, or big variety.

2013-Nov
"A simple definition is that it gives organisations insights into data which they don't already have and does that in a way that helps them improve their operational efficiency and helps them make better decisions."- (Vossen, 2014) 2013-Nov Volume, velocity, variety, veracity -

Fig. 4 .
Fig. 4. Evolution of the definition of big data.The fishbone diagram in figure4gives an overview of the evolution.The bones show essential additions of all 17 different definitions, i.e. new aspects or components that each definition adds.Laney (2001) presented the original, so-called 3V definition of big data.The Vs come from volume, velocity and variety.Volume refers to everincreasing amounts of data.Velocity indicates the need to capture and analyze highspeed or bursts of data in (near) real-time, or else the value may be lost.Variety is related to different types of data, be it structured or non-structured, such as social media posts or a video.
refers to large, diverse, complex, longitudinal, and distributed data sets generated from instruments, sensors, Internet transactions, e-mail, video, click streams, and other digital sources available today and in the future" .weintend to propose wider definition of Big Data as 5 Vs: Volume, Velocity, Variety and additionally Value and Veracity."-(Membrey et al., 2013)

Table 1 .
Mapping the new perspectives of the big data definitions to the value chain. http://www.open-jim.org77