Monthly Archives: July 2016

  • 0

Application Timeline Server (ATS) issue error code: 500, message: Internal Server Error

I have seen an issue with Application Timeline Server (ATS). Actually Application Timeline Server (ATS) uses a LevelDB database which is stored in the location specified by yarn.timeline-service.leveldb-timeline-store.path in yarn-site.xml.All metadata store in *.sst files under specified location.

Due to this we may face an space issue.But It is not good practice to delete *.sst files directly. An *.sst file is a sorted table of key/value entries sorted by key  and key/value entries are partitioned into different *.sst files by key instead of timestamp, such that there’s actually no old *.sst file to delete.

But to solve the space of the leveldb storage, you can enable TTL (time to live). Once it is enabled, the timeline entities out of ttl will be discarded and you can set ttl to a smaller number than the default to give a timeline entity shorter lifetime.

<property>
<description>Enable age off of timeline store data.</description>
<name>yarn.timeline-service.ttl-enable</name>
<value>true</value>
</property>

<property>
<description>Time to live for timeline store data in milliseconds.</description>
<name>yarn.timeline-service.ttl-ms</name>
<value>604800000</value>
</property>

But if by mistake you deleted these files manually as I did then you may see ATS issue or you may get following error.

error code: 500, message: Internal Server Error{“message”:”Failed to fetch results by the proxy from url: http://server:8188/ws/v1/timeline/TEZ_DAG_ID?limit=11&_=1469716920323&primaryFilter=user:$user&”,”status”:500,”trace”:”{\”exception\”:\”WebApplicationException\”,\”message\”:\”java.io.IOException: org.iq80.leveldb.DBException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /hadoop/yarn/timeline/leveldb-timeline-store.ldb/6378017.sst: No such file or directory\”,\”javaClassName\”:\”javax.ws.rs.WebApplicationException\”}”}

Or

(AbstractService.java:noteFailure(272)) – Service org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state INITED; cause: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 116 missing files; e.g.: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/001052.sst
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 116 missing files; e.g.: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/001052.sst

 

Resolution: 

  • Goto configured location /hadoop/yarn/timeline/leveldb-timeline-store.ldb  and then you will see a text file named “CURRENT”
    • cd /hadoop/yarn/timeline/leveldb-timeline-store.ldb
    • ls -ltrh | grep -i CURRENT
  • Copy your CURRENT file to some temporary location
    • cp /hadoop/yarn/timeline/leveldb-timeline-store.ldb/CURRENT /tmp 
  • Now you need to remove this file
    • rm /hadoop/yarn/timeline/leveldb-timeline-store.ldb/CURRENT
  • Restart the YARN service via Ambari

With the help of above steps I have resolved this issue. I hope it will help you as well.

Please feel free to give your feedback.


  • 0

Real time use cases of Hadoop

Category : Bigdata

As data continues to grow, businesses now have access to (or generate) more data than ever before–much of which goes unused. How can you turn this data into a competitive advantage? In this article, we explore different ways businesses are capitalizing on data.

We keep hearing statistics about the growth of data. For instance:

  • Data volume in the enterprise is going to grow 50x year-over-year between now and 2020.
  • The volume of business data worldwide, across all companies, doubles every 1.2 years.
  • Back in 2010, Eric Schmidt famously stated that every 2 days, we create as much information as we did from the dawn of civilization up until 2003.

The big questions: Where is this data? How can you use it to your advantage?

If you want to capitalize on this data, you must first begin storing it somewhere. But, how can you store and process massive data sets without spending a fortune on storage? “That’s where Hadoop comes into play”

Hadoop is an open-source software framework for storing and processing large data sets. It stores data in a distributed fashion on clusters of commodity hardware, and is designed to scale up easily as needed. Hadoop helps businesses store and process massive amounts of data without purchasing expensive hardware.

The great advantage of Hadoop: It lets you collect data now and ask questions later. You don’t need to know every question you want answered before you start using Hadoop.

Once you begin storing data in Hadoop, the possibilities are endless. Companies across the globe are using this data to solve big problems, answer pressing questions, improve revenue, and more. How? Here are some real-life examples of ways other companies are using Hadoop to their advantage.

1. Analyze life-threatening risks: Suppose you’re a doctor in a busy hospital. How can you quickly identify patients with the biggest risks? How can you ensure that you’re treating those with life-threatening issues, before spending your time on minor problems? Here’s a great example of one hospital using big data to determine risk–and make sure they’re treating the right patients.

“Patients in a New York hospital with suspicion of heart attack were submitted to series of tests, and the results were analyzed with use of big data – history of previous patients,” says Agnieszka Idzik, Senior Product Manager at SALESmanago. “Whether a patient was admitted or sent home depended on the algorithm, which was more efficient than human doctors.”

2. Identify warning signs of security breaches: What if you could stop security breaches before they happened? What if you could identify suspicious employee activity before they took action? The solution lies in data.

As explained below, security breaches usually come with early warning signs. Storing and analyzing data in Hadoop is a great way to identify these problems before they happen.

“Data breaches like we saw with Target, Sony, and Anthem never just happen; there are typically early warning signs – unusual server pings, even suspicious emails, IMs or other forms of communication that could suggest internal collusion,” according to Kon Leong, CEO, ZL Technologies. “Fortunately, with the ability to now mine and correlate people, business, and machine-generated data all in one seamless analytics environment, we can get a far more complete picture of who is doing what and when, including the detection of collusion, bribery, or an Ed Snowden in progress even before he has left the building.”

3. Prevent hardware failure:Machines generate a wealth of information–much of which goes unused. Once you start collecting that data with Hadoop, you’ll learn just how useful this data can be.

For instance, this recent webinar on “Practical Uses of Hadoop,” explores one great example. Capturing data from HVAC systems helps a business identify potential problems with products and locations.

Here’s another great example: One power company combined sensor data from the smart grid with a map of the network to predict which generators in the grid were likely to fail, and how that failure would affect the network as a whole. Using this information, they could react to problems before they happened.

4. Understand what people think about your company: Do you ever wonder what customers and prospects say about your company? Is it good or bad? Just imagine how useful that data could be if you captured it.

With Hadoop, you can mine social media conversations and figure out what people think of you and your competition. You can then analyze this data and make real-time decisions to improve user perception.

For instance, this article explains how one company used Hadoop to track user sentiment online. It gave their marketing teams the ability to assess external perception of the company (positive, neutral, or negative), and make adjustments based on that data.

5. Understand when to sell certain products:

“Done well, data can help companies uncover, quantitatively, both pain points and areas of opportunity,” says Mark Schwarz, VP of Data Science, at Square Root. “For example, tracking auto sales across dealerships may highlight that red cars are selling and blue cars or not. Knowing this, the company could adjust inventory to avoid the cost of blue cars sitting on the lot and increase revenue from having more red cars. It’s a data-driven way to understand what’s working and what’s not in a business and helps eliminate “gut reaction” decision making.”

Of course, this can go far beyond determining which product is selling best. Using Hadoop, you can analyze sales data against any number of factors.

For instance, if you analyzed sales data against weather data, you could determine which products sell best on hot days, cold days, or rainy days.

Or, what if you analyzed sales data by time and day. Do certain products sell better on specific weeks/days/hours?

Those are just a couple of examples, but I’m sure you get the point. If you know when products are likely to sell, you can better promote those products.

6. Find your ideal prospects: Chances are, you know what makes a good customer. But, do you know exactly where they are? What if you could use freely available data to identify and target your best prospects?

There’s a great example in this article. It explains how one company compared their customer data with freely available census data. They identified the location of their best prospects, and ran targeted ads at them. The results: Increased conversions and sales.

7. Gain insight from your log files: Just like your hardware, your software generates lots of useful data. One of the most common examples: Server log files. Server logs are computer-generated log files that capture network and server operations data. How can this data help? Here are a couple examples:

Security: What happens if you suspect a security breach? The server log data can help you identify and repair the vulnerability.

Usage statistics: As demonstrated in this webinar, server log data provides valuable insight into usage statistics. You can instantly see which applications are most popular, and which users are most active.

8. Threat Analysis:  How can companies detect threats and fraudulent activity?
Businesses have struggled with theft, fraud, and abuse since long before computers existed. Computers and on-line systems create new opportunities for criminals to act swiftly, efficiently, and anonymously. On-line businesses use Hadoop to monitor and combat criminal behavior.
Challenge: Online criminals write viruses and malware to take over individual computers and steal valuable data. They buy and sell using fraudulent identities and use scams to steal money or goods. They lure victims into scams by sending email or other spam over networks. In “pay-per-click” systems like online advertising, they use networks of compromised computers
to automate fraudulent activity, bilking money from advertisers or ad networks. Online businesses must capture, store, and analyze both the content and the pattern of messages that flow through the network to tell the difference between a legitimate
transaction and fraudulent activity by criminals.

Solution: 
One of the largest users of Hadoop, and in particular of HBase, is a global developer of software and services to protect against computer viruses. Many detection systems compute a “signature” for a virus or other malware, and use that signature to spot instances of the virus in the wild. Over the decades, the company has built up an enormous library of malware indexed by signatures. HBase provides an inexpensive and high-performance storage system for this data. The vendor uses MapReduce to compare instances of malware to one another, and to build higher-level models of the threats that the different pieces of malware pose. The ability to examine all the data comprehensively allows the company to build much more robust tools for detecting known and emerging threats. A large online email provider has a Hadoop cluster that provides a similar service. Instead of detecting viruses, though, the system recognizes spam messages. Email flowing through the system is examined automatically. New spam messages are properly flagged, and the system detects and reacts to new attacks as criminals create them. Sites that sell goods and services over the internet are particularly vulnerable to fraud and theft. Many use web logs to monitor user behavior on the site. By tracking that activity, tracking IP addresses and using knowledge of the location of individual visitors, these sites are able to recognize and prevent fraudulent activity. The same techniques work for online advertisers battling click fraud. Recognizing patterns of activity by individuals permits the ad networks to detect and reject fraudulent activity. Hadoop is a powerful platform for dealing with fraudulent and criminal activity like this. It is
flexible enough to store all of the data—message content, relationships among people and computers, patterns of activity—that matters. It is powerful enough to run sophisticated detection and prevention algorithms and to create complex models from historical data to monitor real-time activity

 

9. Ad Targeting:  How can companies increase campaign efficiency?

Two leading advertising networks use Hadoop to choose the best ad to show to any given user.

Challenge:  Advertisement targeting is a special kind of recommendation engine. It selects ads best suited to a particular visitor. There is, though, an additional twist: each advertiser is willing to pay a certain amount to have its ad seen. Advertising networks auction ad space, and advertisers want their ads shown to the people most likely to buy their products. This creates a complex optimization challenge. Ad targeting systems must understand user preferences and behavior, estimate how interested a given user will be in the different ads available for display, and choose the one that maximizes revenue to both the advertiser and the advertising network. The data managed by these systems is simple and structured. The ad exchanges, however, provide services to a large number of advertisers, deliver advertisements on a wide variety of Web properties and must scale to millions of end users browsing the web and loading pages that must include advertising. The data volume is enormous. Optimization requires examining both the relevance of a given advertisement to a particular user, and the collection of bids by different advertisers who want to reach that visitor. The analytics required to make the correct choice are complex, and running them on the large dataset requires a large-scale, parallel system.

Solution: One advertising exchange uses Hadoop to collect the stream of user activity coming off of its servers. The system captures that data on the cluster, and runs analyses continually to determine how successful the system has been at displaying ads that appealed to users. Business analysts at the exchange are able to see reports on the performance of individual ads, and to adjust the system to improve relevance and increase revenues immediately. A second exchange builds sophisticated models of user behavior in order to choose the right ad for a given visitor in real time. The model uses large amounts of historical data on user behavior to cluster ads and users, and to deduce preferences. Hadoop delivers much better-targeted advertisements by steadily refining those models and delivering better ads.

Article credit: Joe Stangarone and Cloudera


  • 0

How to change knox heap size

Category : Bigdata

Some time due to heavy load you may a requirement to increase your knox jvm size to deal more reques and to give response in a time.

So in that case you can change your knox jvm size in following ways.

  1. go to /usr/hdp/current/knox-server/bin/gateway.sh and seach for APP_MEM_OPTS string.
  2. Once you get it then you can change it accordingly:

             APP_MEM_OPTS=“-Xms2g -Xmx2g”

You need to restart your knox gateway services to take effect.


  • 0

Analyze your jobs running on top of Tez

Category : Bigdata

Sometime we have to analyze our jobs to tune our jobs or to prepare some reports. We can use following method to get running time for each and every steps for your job in tez execution engine.

By setting up hive.tez.exec.print.summary=true property you can achieve it.

hive> select count(*) from cars_beeline;

Query ID = s0998dnz_20160711080520_e282c377-5607-4cf4-bcda-bd7010918f9c

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1468229364042_0003)

——————————————————————————–

        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

——————————————————————————–

Map 1 ……….   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ……   SUCCEEDED      1          1        0        0       0       0

——————————————————————————–

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 6.65 s     

——————————————————————————–

OK

6

Time taken: 11.027 seconds, Fetched: 1 row(s)

hive> set hive.tez.exec.print.summary=true;

hive> select count(*) from cars_beeline;

Query ID = s0998dnz_20160711080557_28453c83-9e17-4874-852d-c5e13dd97f82

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1468229364042_0003)

——————————————————————————–

        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

——————————————————————————–

Map 1 ……….   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ……   SUCCEEDED      1          1        0        0       0       0

——————————————————————————–

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.36 s    

——————————————————————————–

Status: DAG finished successfully in 15.36 seconds

METHOD                         DURATION(ms)

parse                                    2

semanticAnalyze                        130

TezBuildDag                            229

TezSubmitToRunningDag                   13

TotalPrepTime                          979

VERTICES         TOTAL_TASKS  FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS    CPU_TIME_MILLIS     GC_TIME_MILLIS  INPUT_RECORDS   OUTPUT_RECORDS

Map 1                      1                0            0            10.64              9,350                299              6                1

Reducer 2                  1                0            0             0.41                760                  0              1                0

OK

6

Time taken: 16.478 seconds, Fetched: 1 row(s)


  • 0

Import & Export in Hive

Category : Bigdata

When we work on Hive, there would be lots of scenarios that we need to move data(i.e tables from one cluster to another cluster) from one cluster to another.

For example, sometimes we need to copy some production table from one cluster to another cluster. Now we have got very good functionality in hive which give us two easy commands to do it.

Version 0.8 onwards, Hive supports EXPORT and IMPORT features that allows us to export the metadata as well as the data for the corresponding table to a directory in HDFS, which can then be imported back to another database or Hive instance.

Now with the help of following example I have tried to copy cars_beeline from cluster A to Cluster B :

Cluster A: 

hive>show databases;

OK

admintestdb

default

kmsdatabase

kmstest

samplebeelinetest

samplehivetest

sampletest

sampletest1

Time taken: 2.911 seconds, Fetched: 14 row(s)

hive> use samplebeelinetest;

OK

Time taken: 0.287 seconds

hive> show tables;

OK

cars

cars_beeline

cars_internal

i0014_itm_typ

test

Time taken: 0.295 seconds, Fetched: 5 row(s)

hive> select * from cars_beeline limit 1;

OK

Name NULL NULL NULL NULL NULL NULL NULL O

Time taken: 1.118 seconds, Fetched: 1 row(s)

hive> select * from cars_beeline limit 10;

OK

Name NULL NULL NULL NULL NULL NULL NULL O

“chevrolet chevelle malibu” 18 8 307 130 3504 12 1970-01-01 A

“buick skylark 320” 15 8 350 165 3693 12 1970-01-01 A

“plymouth satellite” 18 8 318 150 3436 11 1970-01-01 A

“amc rebel sst” 16 8 304 150 3433 12 1970-01-01 A

“ford torino” 17 8 302 140 3449 11 1970-01-01 A

Time taken: 0.127 seconds, Fetched: 6 row(s)

hive> export table cars_beeline to ‘/tmp/cars_beeline’;

Copying data from file:/tmp/s0998dnz/0bd6949f-c28c-4113-a9ab-eeaea4dcd434/hive_2016-07-11_05-41-39_786_4427147069708259788-1/-local-10000/_metadata

Copying file: file:/tmp/s0998dnz/0bd6949f-c28c-4113-a9ab-eeaea4dcd434/hive_2016-07-11_05-41-39_786_4427147069708259788-1/-local-10000/_metadata

Copying data from hdfs://HDPCLUSTERAHA/zone_encr2/data

Copying file: hdfs://HDPCLUSTERAHA/zone_encr2/data/cars.csv

OK

Time taken: 0.52 seconds

hive> dfs -ls /tmp/cars_beeline;

Found 2 items

-rwxrwxrwx   3 s0998dnz hdfs       1701 2016-07-11 05:41 /tmp/cars_beeline/_metadata

drwxrwxrwx   – s0998dnz hdfs          0 2016-07-11 05:41 /tmp/cars_beeline/data

Now use distcp to copy that dir from Cluster A to Cluster B: 

[root@server1 ~]$ hadoop distcp hdfs://HDPCLUSTERAHA/tmp/cars_beeline hdfs://HDPCLUSTERBHA/tmp/cars_beeline

16/07/11 05:43:09 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile=’null’, copyStrategy=’uniformsize’, sourceFileListing=null, sourcePaths=[hdfs://HDPCLUSTERAHA/tmp/cars_beeline], targetPath=hdfs://HDPCLUSTERAHA/tmp/cars_beeline, targetPathExists=false, preserveRawXattrs=false}

16/07/11 05:43:09 INFO impl.TimelineClientImpl: Timeline service address: http://server2:8188/ws/v1/timeline/

16/07/11 05:43:11 INFO impl.TimelineClientImpl: Timeline service address: http://server2:8188/ws/v1/timeline/

16/07/11 05:43:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

16/07/11 05:43:11 INFO mapreduce.JobSubmitter: number of splits:4

16/07/11 05:43:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468229364042_0002

16/07/11 05:43:11 INFO impl.YarnClientImpl: Submitted application application_1468229364042_0002

16/07/11 05:43:11 INFO mapreduce.Job: The url to track the job: http://server1:8088/proxy/application_1468229364042_0002/

16/07/11 05:43:11 INFO tools.DistCp: DistCp job-id: job_1468229364042_0002

16/07/11 05:43:11 INFO mapreduce.Job: Running job: job_1468229364042_0002

16/07/11 05:43:25 INFO mapreduce.Job: Job job_1468229364042_0002 running in uber mode : false

16/07/11 05:43:25 INFO mapreduce.Job:  map 0% reduce 0%

16/07/11 05:43:31 INFO mapreduce.Job:  map 75% reduce 0%

16/07/11 05:43:35 INFO mapreduce.Job:  map 100% reduce 0%

16/07/11 05:43:36 INFO mapreduce.Job: Job job_1468229364042_0002 completed successfully

Run following steps on target cluster B:

[root@server1ofclusterb ~]$ hadoop fs -ls /tmp/cars_beeline;

Found 2 items

-rw-r–r–   3 s0998dnz hdfs       1701 2016-07-11 05:43 /tmp/cars_beeline/_metadata

drwxr-xr-x   – s0998dnz hdfs          0 2016-07-11 05:43 /tmp/cars_beeline/data

[root@server1ofclusterb ~]$ hive

WARNING: Use “yarn jar” to launch YARN applications.

16/07/11 05:45:22 WARN conf.HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist

16/07/11 05:45:22 WARN conf.HiveConf: HiveConf of name hive.server2.http.port does not exist

Logging initialized using configuration in file:/etc/hive/2.3.4.0-3485/0/hive-log4j.properties

hive> create database testing;

OK

Time taken: 0.342 seconds

hive> use testing;

OK

Time taken: 0.411 seconds

hive> IMPORT TABLE cars_beeline from ‘/tmp/cars_beeline’;

Copying data from hdfs://HDPCLUSTERBHA/tmp/cars_beeline/data

Copying file: hdfs://HDPCLUSTERBHA/tmp/cars_beeline/data/cars.csv

Loading data to table testing.cars_beeline

OK

Time taken: 1.196 seconds

hive> show tables;

OK

cars_beeline

Time taken: 0.246 seconds, Fetched: 1 row(s)

hive> select * from cars_beeline limit 10;

OK

Name NULL NULL NULL NULL NULL NULL NULL O

“chevrolet chevelle malibu” 18 8 307 130 3504 12 1970-01-01 A

“buick skylark 320” 15 8 350 165 3693 12 1970-01-01 A

“plymouth satellite” 18 8 318 150 3436 11 1970-01-01 A

“amc rebel sst” 16 8 304 150 3433 12 1970-01-01 A

“ford torino” 17 8 302 140 3449 11 1970-01-01 A

Time taken: 0.866 seconds, Fetched: 6 row(s)

hive>

I hope it will help you to move table from one cluster to another cluster. Please fell free to give your suggestion to improve this article.