Frequently Asked Questions about Big Data

Many questions about big data have yet to be answered in a vendor-neutral way. With so many definitions, opinions run the gamut. Here I will attempt to cut to the heart of the matter by addressing some key questions I often get from readers, clients and industry analysts.

1) What is Big Data?
Big data” is an all-inclusive term used to describe vast amounts of information. In contrast to traditional structured data which is typically stored in a relational database, big data varies in terms of volume, velocity, and variety. Big data is characteristically generated in large volumes – on the order of terabytes or exabytes of data (starts with 1 and has 18 zeros after it, or 1 million terabytes) per individual data set. Big data is also generated with high velocity – it is collected at frequent intervals – which makes it difficult to analyze (though analyzing it rapidly makes it more valuable).

Or in simple words we can say “Big Data includes data sets whose size is beyond the ability of traditional software tools to capture, manage, and process the data in a reasonable time.”

2) How much data does it take to be called Big Data?

This question cannot be easily answered absolutely. Based on the infrastructure on the market the lower threshold is at about 1 to 3 terabytes.

But using Big Data technologies can be sensible for smaller databases as well, for example if complex mathematiccal or statistical analyses are run against a database. Netezza offers about 200 built in functions and computer languages like Revolution R or Phyton which can be used in such cases.

3) What is the role of intuition in the era of big data? Have machines and data supplanted the human mind?

Contrary to what some people believe, intuition is as important as ever. When looking at massive, unprecedented datasets, you need someplace to start. In Too Big to Ignore, I argue that intuition is more important than ever precisely because there’s so much data now. We are entering an era in which more and more things can be tested.

Big data has not replaced intuition — at least not yet; the latter merely complements the former. The relationship between the two is a continuum, not a binary.

4) A key piece of big data is its reliance on “unstructured” and “semi-structured” data. Can you explain what’s going on here?

Roughly 80% of the information generated today is of an unstructured variety. Small data is still very important — e.g., lists of customers, sales, employees and the like. Think Excel spreadsheets and database tables. However, tweets, blog posts, Facebook likes, YouTube videos, pictures and other forms of unstructured data have become too big to ignore.

Again, big data here serves as a complement to — not a substitute for — small data. When used right, big data can reduce uncertainty, not eliminate it. We can know more about previously unknowable things. We can solve previously vexing problems. And finally, there’s the Holy Grail: Big data is helping organizations make better predictions and better business decisions.

5) Is it a new trend?
Not exactly. Though there is a lot of buzz around the topic, big data has been around a long time. Think back to when you first heard of scientific researchers using supercomputers to analyze massive amounts of data. The difference now is that big data is accessible to regular BI users and is applicable to the enterprise. The reason it is gaining traction is because there are more public use cases about companies getting real value from big data (like Walmart analyzing real-time social media data for trends, then using that information to guide online ad purchases). Though big data adoption is limited right now, IDC determined that the big data technology and services market was worth $3.2B USD in 2010 and is going to skyrocket to $16.9B by 2015.

6) Where does big data come from?
Big data is often boiled down to a few varieties including social data, machine data, and transactional data. Social media data is providing remarkable insights to companies on consumer behavior and sentiment that can be integrated with CRM data for analysis, with 230 million tweets posted on Twitter per day, 2.7 billion Likes and comments added to Facebook every day, and 60 hours of video uploaded to YouTube every minute (this is what we mean by velocity of data). Machine data consists of information generated from industrial equipment, real-time data from sensors that track parts and monitor machinery (often also called the Internet of Things), and even web logs that track user behavior online. Major retailers like Amazon.com, which posted $10B in sales in Q3 2011, and restaurants like US pizza chain Domino’s, which serves over 1 million customers per day, are generating petabytes of transactional big data. The thing to note is that big data can resemble traditional structured data or unstructured, high frequency information.

7) Where is the big data trend going?
Eventually the big data hype will wear off, but studies show that big data adoption will continue to grow. With a projected $16.9B market by 2015 (Wikibon goes even further to say $50B by 2017), it is clear that big data is here to stay. However, the big data talent pool is lagging behind and will need to catch up to the pace of the market. McKinsey & Company estimated in May 2011 that by 2018, the US alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

The emergence of big data analytics has permanently altered many businesses’ way of looking at data. Big data can take companies down a long road of staff, technology, and data storage augmentation, but the payoff – rapid insight into never-before-examined data – can be huge. As more use cases come to light over the coming years and technologies mature, big data will undoubtedly reach critical mass and will no longer be labeled a trend. Soon it will simply be another mechanism in the BI ecosystem.

8) Who are some of the BIG DATA users?
From cloud companies like Amazon to healthcare companies to financial firms, it seems as if everyone is developing a strategy to use big data. For example, every mobile phone user has a monthly bill which catalogs every call and every text; processing the sheer volume of that data can be challenging. Software logs, remote sensing technologies, information-sensing mobile devices all pose a challenge in terms of the volumes of data created. The size of Big Data can be relative to the size of the enterprise. For some, it may be hundreds of gigabytes, for others, tens or hundreds of terabytes to cause consideration.

9) Data visualization is becoming more popular than ever.

In my opinion, it is absolutely essential for organizations to embrace interactive data visualization tools. Blame or thank big data for that and these tools are amazing. They are helping employees make sense of the never-ending stream of data hitting them faster than ever. Our brains respond much better to visuals than rows on a spreadsheet.

Companies like Amazon, Apple, Facebook, Google, Twitter, Netflix and many others understand the cardinal need to visualize data. And this goes way beyond Excel charts, graphs or even pivot tables. Companies like Tableau Software have allowed non-technical users to create very interactive and imaginative ways to visually represent information.

10) Data science, some say, is actually a mix of art and science — the art of knowing what to look at amidst a profusion of information. Can you explain a bit about this? How people can develop those skills?

The data scientist is one of the hottest jobs in the world right now. In a recent report, McKinsey estimated that the U.S. will soon face a shortage of approximately 175,000 data scientists. Demand far exceeds supply, especially given the hype around big data.

However, to become a data scientist one does not necessarily follow a linear path. There are many myths surrounding data scientists. True data scientists possess a wide variety of skills. Most come from backgrounds in statistics, data modeling, computer science and general business. Above all, however, they are a curious lot. They are never really satisfied. They enjoy looking at data and running experiments.

11) We seem to be entering an era of exponential growth of data. Is there a point at which many enterprise systems will cease to operate?

It’s an interesting point, and I discuss it in Chapter 4 of Too Big to Ignore. If we look at the relational databases that organizations have historically used to store and retrieve enterprise information, then you are absolutely right. However, new tools like MapReduce, Hadoop, NoSQL, NewSQL, Amazon Web Services (AWS) and others allow organizations to store much larger data sets. The old boss is not the same as the new boss.

12) How will big data impact small businesses? Will we see an era where every business (even barbershops or corner stores) will somehow be leveraging big data?

A few relatively small organizations that have taken advantage of big data. Quantcast is one of them. There’s no shortage ofmyths around big data, and one of the most pernicious is that an organization needs thousands of employees and billions in revenue to take advantage of it. Simply not true.

I don’t know in the near future if my electrician or my barber will embrace big data. However, we are living in an era of ubiquitous and democratized technology.

13) How big data will trickle down and impact individuals? Are there direct ways this will impact our day-to-day lives in the coming years?

It’s already happening. Big data is affecting our lives in more ways than we can possibly fathom. The recent NSA Prism scandal shed light on the fact that governments are tracking what we’re doing. Companies like Amazon, Apple, Facebook, Google, Twitter and others would not be nearly as effective without big data.

As you know, most people don’t work in data centers. Rather, it’s better for people to know about the companies whose services they use. Are those companies using big data? These days, the answer is probably yes. By extension, then, big data is affecting you whether you know it or not.

In addition, as more and more companies embrace big data, there will be major disruption in the workforce.

14) Are “big data skills” something that everyone will need to learn moving forward? Or will it become simple enough over time that anyone can do it — much like anyone who knows Microsoft Word can update a website now versus needing to know HTML 15 years ago? What skills do workers need to sharpen to prepare for the era of big data?

I hesitate to say that everyone will need to learn data-related skills. Dataphobes will always exist, for better or worse. (Again, the barber example is a good one.) However, knowledge workers will have to follow, lead or get out of the way. Based upon my research, we have entered a more data-oriented world. Millenials are particularly comfortable with data. They are constantly interacting with technology and data. Wearable technology and the Internet of Things are coming, and soon.

“Get used to big data. It really has become too big to ignore.”

15) What tools do I need to analyze it?
Another reason big data is starting to go mainstream is the fact the tools to analyze it are becoming more accessible. For decades, arcplan partners Teradata (NYSE: TDC), IBM (NYSE: IBM), and Oracle (NasdaqGS: ORCL) have provided thousands of companies with terabyte scale data warehouses, but there is a new trend of big data being stored across multiple servers that can handle unstructured data and scale easily. This is due to the increasing use of open source technologies like Hadoop, a framework for distributing data processing across multiple nodes, which allows for fast data loading and real-time analytic capabilities. In effect, Hadoop allows the analysis to occur where the data resides, but it does require specific skills and is not an easy technology to adopt. Analytic platforms like arcplan, which connects to Teradata and SAP HANA, SAP’s (NYSE: SAP) big data appliance, allow data analysis and visualization on big data sets. So in order to make use of big data, companies may need to implement new technologies, but some traditional BI solutions can make the move with you. Big data is simply a new data challenge that requires leveraging existing systems in a different way.

16) What is Hadoop?
The Apache Hadoop software library allows for the distributed processing of large data sets across clusters of computers using a simple programming model. The software library is designed to scale from single servers to thousands of machines; each server using local computation and storage. Instead of relying on hardware to deliver high-availability, the library itself handles failures at the application layer. As a result, the impact of failures is minimized by delivering a highly-available service on top of a cluster of computers. For more info, see this Hadoop FAQ.


Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce.
17) Who supports and funds Hadoop?

Hadoop is one of the projects of the Apache Software Foundation. The main Hadoop project is contributed to by a global network of developers. Sub-projects of Hadoop are supported by the world’s largest Web companies, including Facebook and Yahoo.

18) Why is Hadoop popular?

Hadoop’s popularity is partly due to the fact that it is used by some of the world’s largest Internet businesses to analyze unstructured data. Hadoop enables distributed applications to handle data volumes in the order of thousands of exabytes.

19) Where does Hadoop find applicability in business?

Hadoop, as a scalable system for parallel data processing, is useful for analyzing large data sets. Examples are search algorithms, market risk analysis, data mining on online retail data, and analytics on user behavior data.

Hadoop’s scalability makes it attractive to businesses because of the exponentially increasing nature of the data they handle. Another core strength of Hadoop is that it can handle structured as well as unstructured data, from a variable number of sources.

20) What are the enterprise adoption challenges associated with Hadoop?

To many enterprises, the Hadoop framework is attractive because it gives them the power to analyze their data, regardless of volume. Not all enterprises, however, have the expertise to drive that analysis such that it delivers business value. Scaling up and optimizing Hadoop computing clusters involves custom coding, which can mean a steep learning curve for data analytics developers.

Hadoop was not originally designed with the security functionalities typically required for sensitive enterprise data. Other potential problem areas for enterprise adoption of Hadoop include integration with existing databases and applications, and the absence of industry-wide best practices.

21) How has Hadoop evolved over the years?

Hadoop originally derives from Google’s implementation of a programming model called MapReduce. Google’s MapReduce framework could break down a program into many parallel computations, and run them on very large data sets, across a large number of computing nodes. An example use for such a framework is search algorithms running on Web data.

Hadoop, initially associated only with web indexing, evolved rapidly to become a leading platform for analyzing big data. Cloudera, an enterprise software company, began providing Hadoop-based software and services in 2008.

In 2012, GoGrid, a cloud infrastructure company, partnered with Cloudera to accelerate the adoption of Hadoop-based business applications. Also in 2012, Dataguise, a data security company, launched a data protection and risk assessment tool for Hadoop.

22) Is there an easy way to migrate data from Hadoop into a relational database?

The Hadoop JDBC driver can be used to pull data out of Hadoop and then use the DataDirect JDBC Driver to bulk load the data into Oracle, DB2, SQL Server, Sybase, and other relational databases.

23) When loading results from a Big Data reduction into a relational database with indexing we see some really slow results dealing with such a large index. How can we make it more manageable?

The load operation is actually updating the index while you’re loading – the key is to make sure you’re not indexing while loading as it causes too many collisions and slows the whole process down.

24) What is big data security analytics?

Add the words “information security” (or “cybersecurity” if you like) before the term “data sets” in the definition above. Security and IT operations tools spit out an avalanche of data like logs, events, packets, flow data, asset data, configuration data, and assortment of other things on a daily basis. Security professionals need to be able to access and analyze this data in real-time in order to mitigate risk, detect incidents, and respond to breaches. These tasks have come to the point where they are “difficult to process using on-hand data management tools or traditional (security) data processing applications.”

25) What is a security analytic?

First, security analysis is the examination of a multitude of phenomena for the purpose of detecting and/or responding to security incidents capable of impacting the confidentiality, integrity, or availability of IT assets. I would then define a security analytic as: A deduction based upon the results of interactions of multiple simultaneous security phenomena. The thing that big data security analytics technologies allow us to do is capture more data and perform multi-variable security analytics. In the past we relied on simple security analytics to help us trigger a response. For example: “Trigger a security alarm when someone has 3 failed log-in attempts on a critical system.” Effective but too simple and way too many false positives. With big data security analytics, we can generate security analytics that get much deeper: “Trigger a security alert when someone has 3 failed log-in attempts on a critical system when this activity is executed after hours from an employee device, the employee’s job responsibility is such that he or she should not be logging into this system, and the physical security system indicates that the employee is not in the building.” This is the kind of stuff that companies like Click Security, Lancope, and Solera Networks are working on.


26) Do big data security analytics require Hadoop?

No. Hadoop technologies are certainly built into some big data security analytics solutions from vendors like IBM and RSA, but there is no requirement for Hadoop per se. Lots of vendors have developed their own data repositories (in lieu of Hadoop) that collect, store, and analyze security data. In the future, it is likely that Hadoop and other big data technologies will find their way into big data security analytics solutions but there are plenty of leading big data security analytics solutions that don’t use or integrate with Hadoop at this time.

 27) Isn’t big data security analytics only good for analysis of massive amounts of historical data?

This is certainly one of the primary use cases but there are others as well. Many big data security analytics solutions are built using “stream processing” to accommodate the high I/O rate needed to process massive amounts of security data. In simple terms, stream processing distributes the processing load over a number of distributed nodes. Each node can provide local security analytics and the nodes combine to form a computing grid for more global security data analysis value. Big data security analytics built using this type of stream processing and grid architecture are designed for instant event detection and forensics. ESG calls this model, “real-time big data security analytics solutions.” ESG calls big data security analytics designed for the historical use “asymmetric big data security analytics solutions.”

28) Isn’t big data security analytics for big companies with lots of security skills and resources?

Yes, those are the types of organizations on the leading edge but I would argue that all medium to large organizations need this type of security intelligence. Big companies will likely buy products and solutions while smaller companies will reach out to service providers like Arbor Networks (PacketLoop), Dell/SecureWorks, or the new SAIC spin-out Leidos. The best products and services will bake-in intelligent algorithms, intuitive visualization, and process automation.

29) How do I get started with big data security analytics? 

My suggestion is to download open source tools like BigSnarf, PacketPig, or sqrrl. This isn’t an exhaustive list but I’ve hit the major areas. Hopefully, this will help security professionals move beyond the hype and start to understand how big data security analytics can deliver real value.

30) How well does Hadoop scale?

Hadoop has been demonstrated on clusters of up to 4000 nodes. Sort performance on 900 nodes is good (sorting 9TB of data on 900 nodes takes around 1.8 hours) and improving using these non-default configuration values:

  • block.size = 134217728
  • namenode.handler.count = 40
  • reduce.parallel.copies = 20
  • child.java.opts = -Xmx512m
  • inmemory.size.mb = 200
  • sort.factor = 100
  • sort.mb = 200
  • file.buffer.size = 131072

Sort performances on 1400 nodes and 2000 nodes are pretty good too – sorting 14TB of data on a 1400-node cluster takes 2.2 hours; sorting 20TB on a 2000-node cluster takes 2.5 hours. The updates to the above configuration being:

  • job.tracker.handler.count = 60
  • reduce.parallel.copies = 50
  • http.threads = 50
  • child.java.opts = -Xmx1024m

31) What kind of hardware scales best for Hadoop?

The short answer is dual processor/dual core machines with 4-8GB of RAM using ECC memory, depending upon workflow needs. Machines should be moderately high-end commodity machines to be most cost-effective and typically cost 1/2 – 2/3 the cost of normal production application servers but are not desktop-class machines. This cost tends to be $2-5K.

32) Does Hadoop require SSH?

Hadoop provided scripts (e.g., start-mapred.sh and start-dfs.sh) use ssh in order to start and stop the various daemons and some other utilities. The Hadoop framework in itself does not require ssh. Daemons (e.g. TaskTracker and DataNode) can also be started manually on each node without the script’s help.

33) Is there any filemerger being used, if so what’s so the steps followed( any tool or its a Python prgm).
      Yes based on your hadoop flavor you can use hadoop merge functionality itself. For more details you can refer to the following url.

34) Monitoing in Nagios/Gangalia — what’s does include( disk full,corrupt blocks or what are types).

      You can use Nagios/Gangalia to collect metrics for all (disk full,CPU,RAM etc )
and can use it to display to your dashboard like ambari. But I just want to mentioned that hortonwork stopped using Nagios and Ganglia for ambari because now they are using Ambari-metrics-collector.
35) What’s the role as an admin perform during software upgrades from Hadooper side. 
    Hadoop Admin has to do everything with respect to hadoop upgrade except yum part. You can refer following link for more details. http://www.hadoopadmin.co.in/uncategorized/rolling-upgrade-hdp-2-2-to-hdp-2-3/