Category Archives: Bigdata

  • 0

Application Timeline Server (ATS) issue error code: 500, message: Internal Server Error

I have seen an issue with Application Timeline Server (ATS). Actually Application Timeline Server (ATS) uses a LevelDB database which is stored in the location specified by yarn.timeline-service.leveldb-timeline-store.path in yarn-site.xml.All metadata store in *.sst files under specified location.

Due to this we may face an space issue.But It is not good practice to delete *.sst files directly. An *.sst file is a sorted table of key/value entries sorted by key  and key/value entries are partitioned into different *.sst files by key instead of timestamp, such that there’s actually no old *.sst file to delete.

But to solve the space of the leveldb storage, you can enable TTL (time to live). Once it is enabled, the timeline entities out of ttl will be discarded and you can set ttl to a smaller number than the default to give a timeline entity shorter lifetime.

<property>
<description>Enable age off of timeline store data.</description>
<name>yarn.timeline-service.ttl-enable</name>
<value>true</value>
</property>

<property>
<description>Time to live for timeline store data in milliseconds.</description>
<name>yarn.timeline-service.ttl-ms</name>
<value>604800000</value>
</property>

But if by mistake you deleted these files manually as I did then you may see ATS issue or you may get following error.

error code: 500, message: Internal Server Error{“message”:”Failed to fetch results by the proxy from url: http://server:8188/ws/v1/timeline/TEZ_DAG_ID?limit=11&_=1469716920323&primaryFilter=user:$user&”,”status”:500,”trace”:”{\”exception\”:\”WebApplicationException\”,\”message\”:\”java.io.IOException: org.iq80.leveldb.DBException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /hadoop/yarn/timeline/leveldb-timeline-store.ldb/6378017.sst: No such file or directory\”,\”javaClassName\”:\”javax.ws.rs.WebApplicationException\”}”}

Or

(AbstractService.java:noteFailure(272)) – Service org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state INITED; cause: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 116 missing files; e.g.: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/001052.sst
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 116 missing files; e.g.: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/001052.sst

 

Resolution: 

  • Goto configured location /hadoop/yarn/timeline/leveldb-timeline-store.ldb  and then you will see a text file named “CURRENT”
    • cd /hadoop/yarn/timeline/leveldb-timeline-store.ldb
    • ls -ltrh | grep -i CURRENT
  • Copy your CURRENT file to some temporary location
    • cp /hadoop/yarn/timeline/leveldb-timeline-store.ldb/CURRENT /tmp 
  • Now you need to remove this file
    • rm /hadoop/yarn/timeline/leveldb-timeline-store.ldb/CURRENT
  • Restart the YARN service via Ambari

With the help of above steps I have resolved this issue. I hope it will help you as well.

Please feel free to give your feedback.


  • 0

Real time use cases of Hadoop

Category : Bigdata

As data continues to grow, businesses now have access to (or generate) more data than ever before–much of which goes unused. How can you turn this data into a competitive advantage? In this article, we explore different ways businesses are capitalizing on data.

We keep hearing statistics about the growth of data. For instance:

  • Data volume in the enterprise is going to grow 50x year-over-year between now and 2020.
  • The volume of business data worldwide, across all companies, doubles every 1.2 years.
  • Back in 2010, Eric Schmidt famously stated that every 2 days, we create as much information as we did from the dawn of civilization up until 2003.

The big questions: Where is this data? How can you use it to your advantage?

If you want to capitalize on this data, you must first begin storing it somewhere. But, how can you store and process massive data sets without spending a fortune on storage? “That’s where Hadoop comes into play”

Hadoop is an open-source software framework for storing and processing large data sets. It stores data in a distributed fashion on clusters of commodity hardware, and is designed to scale up easily as needed. Hadoop helps businesses store and process massive amounts of data without purchasing expensive hardware.

The great advantage of Hadoop: It lets you collect data now and ask questions later. You don’t need to know every question you want answered before you start using Hadoop.

Once you begin storing data in Hadoop, the possibilities are endless. Companies across the globe are using this data to solve big problems, answer pressing questions, improve revenue, and more. How? Here are some real-life examples of ways other companies are using Hadoop to their advantage.

1. Analyze life-threatening risks: Suppose you’re a doctor in a busy hospital. How can you quickly identify patients with the biggest risks? How can you ensure that you’re treating those with life-threatening issues, before spending your time on minor problems? Here’s a great example of one hospital using big data to determine risk–and make sure they’re treating the right patients.

“Patients in a New York hospital with suspicion of heart attack were submitted to series of tests, and the results were analyzed with use of big data – history of previous patients,” says Agnieszka Idzik, Senior Product Manager at SALESmanago. “Whether a patient was admitted or sent home depended on the algorithm, which was more efficient than human doctors.”

2. Identify warning signs of security breaches: What if you could stop security breaches before they happened? What if you could identify suspicious employee activity before they took action? The solution lies in data.

As explained below, security breaches usually come with early warning signs. Storing and analyzing data in Hadoop is a great way to identify these problems before they happen.

“Data breaches like we saw with Target, Sony, and Anthem never just happen; there are typically early warning signs – unusual server pings, even suspicious emails, IMs or other forms of communication that could suggest internal collusion,” according to Kon Leong, CEO, ZL Technologies. “Fortunately, with the ability to now mine and correlate people, business, and machine-generated data all in one seamless analytics environment, we can get a far more complete picture of who is doing what and when, including the detection of collusion, bribery, or an Ed Snowden in progress even before he has left the building.”

3. Prevent hardware failure:Machines generate a wealth of information–much of which goes unused. Once you start collecting that data with Hadoop, you’ll learn just how useful this data can be.

For instance, this recent webinar on “Practical Uses of Hadoop,” explores one great example. Capturing data from HVAC systems helps a business identify potential problems with products and locations.

Here’s another great example: One power company combined sensor data from the smart grid with a map of the network to predict which generators in the grid were likely to fail, and how that failure would affect the network as a whole. Using this information, they could react to problems before they happened.

4. Understand what people think about your company: Do you ever wonder what customers and prospects say about your company? Is it good or bad? Just imagine how useful that data could be if you captured it.

With Hadoop, you can mine social media conversations and figure out what people think of you and your competition. You can then analyze this data and make real-time decisions to improve user perception.

For instance, this article explains how one company used Hadoop to track user sentiment online. It gave their marketing teams the ability to assess external perception of the company (positive, neutral, or negative), and make adjustments based on that data.

5. Understand when to sell certain products:

“Done well, data can help companies uncover, quantitatively, both pain points and areas of opportunity,” says Mark Schwarz, VP of Data Science, at Square Root. “For example, tracking auto sales across dealerships may highlight that red cars are selling and blue cars or not. Knowing this, the company could adjust inventory to avoid the cost of blue cars sitting on the lot and increase revenue from having more red cars. It’s a data-driven way to understand what’s working and what’s not in a business and helps eliminate “gut reaction” decision making.”

Of course, this can go far beyond determining which product is selling best. Using Hadoop, you can analyze sales data against any number of factors.

For instance, if you analyzed sales data against weather data, you could determine which products sell best on hot days, cold days, or rainy days.

Or, what if you analyzed sales data by time and day. Do certain products sell better on specific weeks/days/hours?

Those are just a couple of examples, but I’m sure you get the point. If you know when products are likely to sell, you can better promote those products.

6. Find your ideal prospects: Chances are, you know what makes a good customer. But, do you know exactly where they are? What if you could use freely available data to identify and target your best prospects?

There’s a great example in this article. It explains how one company compared their customer data with freely available census data. They identified the location of their best prospects, and ran targeted ads at them. The results: Increased conversions and sales.

7. Gain insight from your log files: Just like your hardware, your software generates lots of useful data. One of the most common examples: Server log files. Server logs are computer-generated log files that capture network and server operations data. How can this data help? Here are a couple examples:

Security: What happens if you suspect a security breach? The server log data can help you identify and repair the vulnerability.

Usage statistics: As demonstrated in this webinar, server log data provides valuable insight into usage statistics. You can instantly see which applications are most popular, and which users are most active.

8. Threat Analysis:  How can companies detect threats and fraudulent activity?
Businesses have struggled with theft, fraud, and abuse since long before computers existed. Computers and on-line systems create new opportunities for criminals to act swiftly, efficiently, and anonymously. On-line businesses use Hadoop to monitor and combat criminal behavior.
Challenge: Online criminals write viruses and malware to take over individual computers and steal valuable data. They buy and sell using fraudulent identities and use scams to steal money or goods. They lure victims into scams by sending email or other spam over networks. In “pay-per-click” systems like online advertising, they use networks of compromised computers
to automate fraudulent activity, bilking money from advertisers or ad networks. Online businesses must capture, store, and analyze both the content and the pattern of messages that flow through the network to tell the difference between a legitimate
transaction and fraudulent activity by criminals.

Solution: 
One of the largest users of Hadoop, and in particular of HBase, is a global developer of software and services to protect against computer viruses. Many detection systems compute a “signature” for a virus or other malware, and use that signature to spot instances of the virus in the wild. Over the decades, the company has built up an enormous library of malware indexed by signatures. HBase provides an inexpensive and high-performance storage system for this data. The vendor uses MapReduce to compare instances of malware to one another, and to build higher-level models of the threats that the different pieces of malware pose. The ability to examine all the data comprehensively allows the company to build much more robust tools for detecting known and emerging threats. A large online email provider has a Hadoop cluster that provides a similar service. Instead of detecting viruses, though, the system recognizes spam messages. Email flowing through the system is examined automatically. New spam messages are properly flagged, and the system detects and reacts to new attacks as criminals create them. Sites that sell goods and services over the internet are particularly vulnerable to fraud and theft. Many use web logs to monitor user behavior on the site. By tracking that activity, tracking IP addresses and using knowledge of the location of individual visitors, these sites are able to recognize and prevent fraudulent activity. The same techniques work for online advertisers battling click fraud. Recognizing patterns of activity by individuals permits the ad networks to detect and reject fraudulent activity. Hadoop is a powerful platform for dealing with fraudulent and criminal activity like this. It is
flexible enough to store all of the data—message content, relationships among people and computers, patterns of activity—that matters. It is powerful enough to run sophisticated detection and prevention algorithms and to create complex models from historical data to monitor real-time activity

 

9. Ad Targeting:  How can companies increase campaign efficiency?

Two leading advertising networks use Hadoop to choose the best ad to show to any given user.

Challenge:  Advertisement targeting is a special kind of recommendation engine. It selects ads best suited to a particular visitor. There is, though, an additional twist: each advertiser is willing to pay a certain amount to have its ad seen. Advertising networks auction ad space, and advertisers want their ads shown to the people most likely to buy their products. This creates a complex optimization challenge. Ad targeting systems must understand user preferences and behavior, estimate how interested a given user will be in the different ads available for display, and choose the one that maximizes revenue to both the advertiser and the advertising network. The data managed by these systems is simple and structured. The ad exchanges, however, provide services to a large number of advertisers, deliver advertisements on a wide variety of Web properties and must scale to millions of end users browsing the web and loading pages that must include advertising. The data volume is enormous. Optimization requires examining both the relevance of a given advertisement to a particular user, and the collection of bids by different advertisers who want to reach that visitor. The analytics required to make the correct choice are complex, and running them on the large dataset requires a large-scale, parallel system.

Solution: One advertising exchange uses Hadoop to collect the stream of user activity coming off of its servers. The system captures that data on the cluster, and runs analyses continually to determine how successful the system has been at displaying ads that appealed to users. Business analysts at the exchange are able to see reports on the performance of individual ads, and to adjust the system to improve relevance and increase revenues immediately. A second exchange builds sophisticated models of user behavior in order to choose the right ad for a given visitor in real time. The model uses large amounts of historical data on user behavior to cluster ads and users, and to deduce preferences. Hadoop delivers much better-targeted advertisements by steadily refining those models and delivering better ads.

Article credit: Joe Stangarone and Cloudera


  • 1

How to change knox heap size

Category : Bigdata

Some time due to heavy load you may a requirement to increase your knox jvm size to deal more reques and to give response in a time.

So in that case you can change your knox jvm size in following ways.

  1. go to /usr/hdp/current/knox-server/bin/gateway.sh and seach for APP_MEM_OPTS string.
  2. Once you get it then you can change it accordingly:

             APP_MEM_OPTS=“-Xms2g -Xmx2g”

You need to restart your knox gateway services to take effect.


  • 0

Analyze your jobs running on top of Tez

Category : Bigdata

Sometime we have to analyze our jobs to tune our jobs or to prepare some reports. We can use following method to get running time for each and every steps for your job in tez execution engine.

By setting up hive.tez.exec.print.summary=true property you can achieve it.

hive> select count(*) from cars_beeline;

Query ID = s0998dnz_20160711080520_e282c377-5607-4cf4-bcda-bd7010918f9c

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1468229364042_0003)

——————————————————————————–

        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

——————————————————————————–

Map 1 ……….   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ……   SUCCEEDED      1          1        0        0       0       0

——————————————————————————–

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 6.65 s     

——————————————————————————–

OK

6

Time taken: 11.027 seconds, Fetched: 1 row(s)

hive> set hive.tez.exec.print.summary=true;

hive> select count(*) from cars_beeline;

Query ID = s0998dnz_20160711080557_28453c83-9e17-4874-852d-c5e13dd97f82

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1468229364042_0003)

——————————————————————————–

        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

——————————————————————————–

Map 1 ……….   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ……   SUCCEEDED      1          1        0        0       0       0

——————————————————————————–

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.36 s    

——————————————————————————–

Status: DAG finished successfully in 15.36 seconds

METHOD                         DURATION(ms)

parse                                    2

semanticAnalyze                        130

TezBuildDag                            229

TezSubmitToRunningDag                   13

TotalPrepTime                          979

VERTICES         TOTAL_TASKS  FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS    CPU_TIME_MILLIS     GC_TIME_MILLIS  INPUT_RECORDS   OUTPUT_RECORDS

Map 1                      1                0            0            10.64              9,350                299              6                1

Reducer 2                  1                0            0             0.41                760                  0              1                0

OK

6

Time taken: 16.478 seconds, Fetched: 1 row(s)


  • 1

Import & Export in Hive

Category : Bigdata

When we work on Hive, there would be lots of scenarios that we need to move data(i.e tables from one cluster to another cluster) from one cluster to another.

For example, sometimes we need to copy some production table from one cluster to another cluster. Now we have got very good functionality in hive which give us two easy commands to do it.

Version 0.8 onwards, Hive supports EXPORT and IMPORT features that allows us to export the metadata as well as the data for the corresponding table to a directory in HDFS, which can then be imported back to another database or Hive instance.

Now with the help of following example I have tried to copy cars_beeline from cluster A to Cluster B :

Cluster A: 

hive>show databases;

OK

admintestdb

default

kmsdatabase

kmstest

samplebeelinetest

samplehivetest

sampletest

sampletest1

Time taken: 2.911 seconds, Fetched: 14 row(s)

hive> use samplebeelinetest;

OK

Time taken: 0.287 seconds

hive> show tables;

OK

cars

cars_beeline

cars_internal

i0014_itm_typ

test

Time taken: 0.295 seconds, Fetched: 5 row(s)

hive> select * from cars_beeline limit 1;

OK

Name NULL NULL NULL NULL NULL NULL NULL O

Time taken: 1.118 seconds, Fetched: 1 row(s)

hive> select * from cars_beeline limit 10;

OK

Name NULL NULL NULL NULL NULL NULL NULL O

“chevrolet chevelle malibu” 18 8 307 130 3504 12 1970-01-01 A

“buick skylark 320” 15 8 350 165 3693 12 1970-01-01 A

“plymouth satellite” 18 8 318 150 3436 11 1970-01-01 A

“amc rebel sst” 16 8 304 150 3433 12 1970-01-01 A

“ford torino” 17 8 302 140 3449 11 1970-01-01 A

Time taken: 0.127 seconds, Fetched: 6 row(s)

hive> export table cars_beeline to ‘/tmp/cars_beeline’;

Copying data from file:/tmp/s0998dnz/0bd6949f-c28c-4113-a9ab-eeaea4dcd434/hive_2016-07-11_05-41-39_786_4427147069708259788-1/-local-10000/_metadata

Copying file: file:/tmp/s0998dnz/0bd6949f-c28c-4113-a9ab-eeaea4dcd434/hive_2016-07-11_05-41-39_786_4427147069708259788-1/-local-10000/_metadata

Copying data from hdfs://HDPCLUSTERAHA/zone_encr2/data

Copying file: hdfs://HDPCLUSTERAHA/zone_encr2/data/cars.csv

OK

Time taken: 0.52 seconds

hive> dfs -ls /tmp/cars_beeline;

Found 2 items

-rwxrwxrwx   3 s0998dnz hdfs       1701 2016-07-11 05:41 /tmp/cars_beeline/_metadata

drwxrwxrwx   – s0998dnz hdfs          0 2016-07-11 05:41 /tmp/cars_beeline/data

Now use distcp to copy that dir from Cluster A to Cluster B: 

[root@server1 ~]$ hadoop distcp hdfs://HDPCLUSTERAHA/tmp/cars_beeline hdfs://HDPCLUSTERBHA/tmp/cars_beeline

16/07/11 05:43:09 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile=’null’, copyStrategy=’uniformsize’, sourceFileListing=null, sourcePaths=[hdfs://HDPCLUSTERAHA/tmp/cars_beeline], targetPath=hdfs://HDPCLUSTERAHA/tmp/cars_beeline, targetPathExists=false, preserveRawXattrs=false}

16/07/11 05:43:09 INFO impl.TimelineClientImpl: Timeline service address: http://server2:8188/ws/v1/timeline/

16/07/11 05:43:11 INFO impl.TimelineClientImpl: Timeline service address: http://server2:8188/ws/v1/timeline/

16/07/11 05:43:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

16/07/11 05:43:11 INFO mapreduce.JobSubmitter: number of splits:4

16/07/11 05:43:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468229364042_0002

16/07/11 05:43:11 INFO impl.YarnClientImpl: Submitted application application_1468229364042_0002

16/07/11 05:43:11 INFO mapreduce.Job: The url to track the job: http://server1:8088/proxy/application_1468229364042_0002/

16/07/11 05:43:11 INFO tools.DistCp: DistCp job-id: job_1468229364042_0002

16/07/11 05:43:11 INFO mapreduce.Job: Running job: job_1468229364042_0002

16/07/11 05:43:25 INFO mapreduce.Job: Job job_1468229364042_0002 running in uber mode : false

16/07/11 05:43:25 INFO mapreduce.Job:  map 0% reduce 0%

16/07/11 05:43:31 INFO mapreduce.Job:  map 75% reduce 0%

16/07/11 05:43:35 INFO mapreduce.Job:  map 100% reduce 0%

16/07/11 05:43:36 INFO mapreduce.Job: Job job_1468229364042_0002 completed successfully

Run following steps on target cluster B:

[root@server1ofclusterb ~]$ hadoop fs -ls /tmp/cars_beeline;

Found 2 items

-rw-r–r–   3 s0998dnz hdfs       1701 2016-07-11 05:43 /tmp/cars_beeline/_metadata

drwxr-xr-x   – s0998dnz hdfs          0 2016-07-11 05:43 /tmp/cars_beeline/data

[root@server1ofclusterb ~]$ hive

WARNING: Use “yarn jar” to launch YARN applications.

16/07/11 05:45:22 WARN conf.HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist

16/07/11 05:45:22 WARN conf.HiveConf: HiveConf of name hive.server2.http.port does not exist

Logging initialized using configuration in file:/etc/hive/2.3.4.0-3485/0/hive-log4j.properties

hive> create database testing;

OK

Time taken: 0.342 seconds

hive> use testing;

OK

Time taken: 0.411 seconds

hive> IMPORT TABLE cars_beeline from ‘/tmp/cars_beeline’;

Copying data from hdfs://HDPCLUSTERBHA/tmp/cars_beeline/data

Copying file: hdfs://HDPCLUSTERBHA/tmp/cars_beeline/data/cars.csv

Loading data to table testing.cars_beeline

OK

Time taken: 1.196 seconds

hive> show tables;

OK

cars_beeline

Time taken: 0.246 seconds, Fetched: 1 row(s)

hive> select * from cars_beeline limit 10;

OK

Name NULL NULL NULL NULL NULL NULL NULL O

“chevrolet chevelle malibu” 18 8 307 130 3504 12 1970-01-01 A

“buick skylark 320” 15 8 350 165 3693 12 1970-01-01 A

“plymouth satellite” 18 8 318 150 3436 11 1970-01-01 A

“amc rebel sst” 16 8 304 150 3433 12 1970-01-01 A

“ford torino” 17 8 302 140 3449 11 1970-01-01 A

Time taken: 0.866 seconds, Fetched: 6 row(s)

hive>

I hope it will help you to move table from one cluster to another cluster. Please fell free to give your suggestion to improve this article.


  • 0

Ambari shows all services down though hadoop services running

Category : Bigdata

We have seen many time that our hadoop services are up and running but when we open ambari then it shows all are down. So basically it means services do not have any issue,it is a problem with ambari-agent.

Ambari server typically gets to know about the service availability from Ambari agent and using the ‘*.pid’ files created in /var/run.

Suspected problem 1:

[root@sandbox ambari-agent]# ambari-agent status

Found ambari-agent PID: 12112

ambari-agent running.

Agent PID at: /var/run/ambari-agent/ambari-agent.pid

Agent out at: /var/log/ambari-agent/ambari-agent.out

Agent log at: /var/log/ambari-agent/ambari-agent.log

Now check pid in process also and compare like below :

[root@sandbox ambari-agent]# ps -ef | grep ‘ambari_agent’

root     12104     1  0 12:32 pts/0    00:00:00 /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/AmbariAgent.py start

root     12112 12104  6 12:32 pts/0    00:01:28 /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/main.py start

If the agent process id and /var/run/ambari-agent/ambari-agent.pid are matching, then possibly there is no issue with the agent process itself.

Now the issue is due to /var/lib/ambari-agent/data/structured-out-status.json. So cat this file to review the content. Typical content could be like following:

cat structured-out-status.json {“processes”: [], “securityState”: “UNKNOWN”} or

[root@sandbox ambari-agent]# cat /var/lib/ambari-agent/data/structured-out-status.json

{“processes”: [], “securityState”: “UNSECURED”}

Compare the content with the same file in another node which is working fine.

Resolution :

Now you need to delete this .json file and restart ambari-agent once again and see the content of this file to match with above given:

root@sandbox ambari-agent]# rm /var/lib/ambari-agent/data/structured-out-status.json

rm: remove regular file `/var/lib/ambari-agent/data/structured-out-status.json’? y

[root@sandbox ambari-agent]# ll /var/lib/ambari-agent/data/structured-out-status.json

ls: cannot access /var/lib/ambari-agent/data/structured-out-status.json: No such file or directory

[root@sandbox ambari-agent]# ambari-agent restart

Restarting ambari-agent

Verifying Python version compatibility…

Using python  /usr/bin/python2

Found ambari-agent PID: 13866

Stopping ambari-agent

Removing PID file at /var/run/ambari-agent/ambari-agent.pid

ambari-agent successfully stopped

Verifying Python version compatibility…

Using python  /usr/bin/python2

Checking for previously running Ambari Agent…

Starting ambari-agent

Verifying ambari-agent process status…

Ambari Agent successfully started

Agent PID at: /var/run/ambari-agent/ambari-agent.pid

Agent out at: /var/log/ambari-agent/ambari-agent.out

Agent log at: /var/log/ambari-agent/ambari-agent.log

[root@sandbox ambari-agent]# ll /var/lib/ambari-agent/data/structured-out-status.json

-rw-r–r– 1 root root 73 2016-06-29 12:59 /var/lib/ambari-agent/data/structured-out-status.json

[root@sandbox ambari-agent]# cat /var/lib/ambari-agent/data/structured-out-status.json

{“processes”: [], “securityState”: “UNSECURED”}

Suspected Problem 2: Ambari Agent is good, but the HDP services are still shown to be down

If there are only few services which are shown to be down, then it could be due to the /var/run/PRODUCT/product.pid file is not matching with the process running in the node.

For eg, if Hiveserver2 service is shown to be not up in Ambari, when hive is actually working fine, check the following files:

  1. # cd /var/run/hive # ls -lrt-rw-r–r– 1 hive hadoop 6 Feb 17 07:15 hive.pid -rw-r–r– 1 hive hadoop 6 Feb 17 07:16 hive-server.pid

Check the content of these files. For eg,

  1. # cat hive-server.pid
  2. 31342
  3. # ps -ef | grep 31342
  4. hive 31342 1 0 Feb17 ? 00:14:36 /usr/jdk64/jdk1.7.0_67/bin/java Xmx1024m Dhdp.version=2.2.9.03393 Djava.net.preferIPv4Stack=true Dhdp.version=2.2.9.03393 Dhadoop.log.dir=/var/log/hadoop/hive Dhadoop.log.file=hadoop.log Dhadoop.home.dir=/usr/hdp/2.2.9.03393/hadoop Dhadoop.id.str=hive Dhadoop.root.logger=INFO,console Djava.library.path=:/usr/hdp/current/hadoopclient/lib/native/Linuxamd6464:/usr/hdp/2.2.9.03393/hadoop/lib/native Dhadoop.policy.file=hadooppolicy.xml Djava.net.preferIPv4Stack=true Xmx1024m XX:MaxPermSize=512m Xmx1437m Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/hdp/2.2.9.03393/hive/lib/hiveservice0.14.0.2.2.9.03393.jar org.apache.hive.service.server.HiveServer2 hiveconf hive.aux.jars.path=file:///usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar -hiveconf hive.metastore.uris= -hiveconf hive.log.file=hiveserver2.log -hiveconf hive.log.dir=/var/log/hive

If the content of hive-server.pid and the process running for HiveServer2 aren’t matching, then Ambari wouldn’t report the status correctly.

Ensure that these files have correct ownership / permissions. For eg, the pid files for Hive should be owned by hive:hadoop and it should be 644. In this situation, change the ownership/ permission correctly and update the file with the correct PID of hive process. This would ensure that Ambari shows the status correctly.

Care should be taken while doing the above by ensuring that this is the only HiveServer2 process running in the system and that HiveServer2 is indeed working fine. If there are multiple HiveServer2 processes, then some of them could be stray which needs to be killed.

Post this, if possible also restart the affected services and ensure that the status of the services are correctly shown.


  • 0

How to enable debug logging for HDFS

Category : Bigdata

I have seen many time that sometime error does not give a clear picture about issue and it can be mislead to us. Also we have to waste so much time to investigate it. I have found enabling debug mode is a easy way to troubleshoot any hadoop problem as it gives us a detail picture and clear step to step overview about your task.

In this article I have tried to explain a process to enable debug mode.

There are two methods to enable debug mode :

  1. You can enable it run time only for a specific command or job like following :

[root@sandbox ~]# export HADOOP_ROOT_LOGGER=DEBUG,console

[root@sandbox ~]# echo $HADOOP_ROOT_LOGGER;

DEBUG,console

[root@sandbox ~]# hadoop fs -ls /

16/06/29 10:35:34 DEBUG util.Shell: setsid exited with exit code 0

16/06/29 10:35:34 DEBUG conf.Configuration: parsing URL jar:file:/usr/hdp/2.4.0.0-169/hadoop/hadoop-common-2.7.1.2.4.0.0-169.jar!/core-default.xml

16/06/29 10:35:34 DEBUG conf.Configuration: parsing input stream sun.net.www.protocol.jar.JarURLConnection$JarURLInputStream@d1e67eb

16/06/29 10:35:34 DEBUG conf.Configuration: parsing URL file:/etc/hadoop/2.4.0.0-169/0/core-site.xml

16/06/29 10:35:34 DEBUG conf.Configuration: parsing input stream java.io.BufferedInputStream@38509e85

16/06/29 10:35:34 DEBUG security.Groups:  Creating new Groups object

16/06/29 10:35:34 DEBUG util.NativeCodeLoader: Trying to load the custom-built native-hadoop library…

16/06/29 10:35:34 DEBUG util.NativeCodeLoader: Loaded the native-hadoop library

16/06/29 10:35:34 DEBUG security.JniBasedUnixGroupsMapping: Using JniBasedUnixGroupsMapping for Group resolution

16/06/29 10:35:34 DEBUG security.JniBasedUnixGroupsMappingWithFallback: Group mapping impl=org.apache.hadoop.security.JniBasedUnixGroupsMapping

16/06/29 10:35:34 DEBUG security.Groups: Group mapping impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; cacheTimeout=300000; warningDeltaMs=5000

16/06/29 10:35:34 DEBUG security.UserGroupInformation: hadoop login

16/06/29 10:35:34 DEBUG security.UserGroupInformation: hadoop login commit

16/06/29 10:35:34 DEBUG security.UserGroupInformation: using local user:UnixPrincipal: root

16/06/29 10:35:34 DEBUG security.UserGroupInformation: Using user: “UnixPrincipal: root” with name root

16/06/29 10:35:34 DEBUG security.UserGroupInformation: User entry: “root”

16/06/29 10:35:34 DEBUG security.UserGroupInformation: UGI loginUser:root (auth:SIMPLE)

16/06/29 10:35:34 DEBUG hdfs.BlockReaderLocal: dfs.client.use.legacy.blockreader.local = false

16/06/29 10:35:34 DEBUG hdfs.BlockReaderLocal: dfs.client.read.shortcircuit = true

16/06/29 10:35:34 DEBUG hdfs.BlockReaderLocal: dfs.client.domain.socket.data.traffic = false

16/06/29 10:35:34 DEBUG hdfs.BlockReaderLocal: dfs.domain.socket.path = /var/lib/hadoop-hdfs/dn_socket

16/06/29 10:35:34 DEBUG retry.RetryUtils: multipleLinearRandomRetry = null

16/06/29 10:35:34 DEBUG ipc.Server: rpcKind=RPC_PROTOCOL_BUFFER, rpcRequestWrapperClass=class org.apache.hadoop.ipc.ProtobufRpcEngine$RpcRequestWrapper, rpcInvoker=org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker@4215232f

16/06/29 10:35:34 DEBUG ipc.Client: getting client out of cache: org.apache.hadoop.ipc.Client@3253bcf3

16/06/29 10:35:34 DEBUG azure.NativeAzureFileSystem: finalize() called.

16/06/29 10:35:34 DEBUG azure.NativeAzureFileSystem: finalize() called.

16/06/29 10:35:35 DEBUG unix.DomainSocketWatcher: org.apache.hadoop.net.unix.DomainSocketWatcher$2@20282aa5: starting with interruptCheckPeriodMs = 60000

16/06/29 10:35:35 DEBUG shortcircuit.DomainSocketFactory: The short-circuit local reads feature is enabled.

16/06/29 10:35:35 DEBUG sasl.DataTransferSaslUtil: DataTransferProtocol not using SaslPropertiesResolver, no QOP found in configuration for dfs.data.transfer.protection

16/06/29 10:35:35 DEBUG ipc.Client: The ping interval is 60000 ms.

16/06/29 10:35:35 DEBUG ipc.Client: Connecting to sandbox.hortonworks.com/172.16.162.136:8020

16/06/29 10:35:35 DEBUG ipc.Client: IPC Client (1548560986) connection to sandbox.hortonworks.com/172.16.162.136:8020 from root: starting, having connections 1

16/06/29 10:35:35 DEBUG ipc.Client: IPC Client (1548560986) connection to sandbox.hortonworks.com/172.16.162.136:8020 from root sending #0

16/06/29 10:35:35 DEBUG ipc.Client: IPC Client (1548560986) connection to sandbox.hortonworks.com/172.16.162.136:8020 from root got value #0

16/06/29 10:35:35 DEBUG ipc.ProtobufRpcEngine: Call: getFileInfo took 52ms

16/06/29 10:35:35 DEBUG ipc.Client: IPC Client (1548560986) connection to sandbox.hortonworks.com/172.16.162.136:8020 from root sending #1

16/06/29 10:35:35 DEBUG ipc.Client: IPC Client (1548560986) connection to sandbox.hortonworks.com/172.16.162.136:8020 from root got value #1

16/06/29 10:35:35 DEBUG ipc.ProtobufRpcEngine: Call: getListing took 3ms

Found 11 items

drwxrwxrwx   – yarn   hadoop          0 2016-03-11 10:12 /app-logs

drwxr-xr-x   – hdfs   hdfs            0 2016-03-11 10:18 /apps

drwxr-xr-x   – yarn   hadoop          0 2016-03-11 10:12 /ats

drwxr-xr-x   – hdfs   hdfs            0 2016-03-11 10:41 /demo

drwxr-xr-x   – hdfs   hdfs            0 2016-03-11 10:12 /hdp

drwxr-xr-x   – mapred hdfs            0 2016-03-11 10:12 /mapred

drwxrwxrwx   – mapred hadoop          0 2016-03-11 10:12 /mr-history

drwxr-xr-x   – hdfs   hdfs            0 2016-03-11 10:33 /ranger

drwxrwxrwx   – spark  hadoop          0 2016-06-29 10:35 /spark-history

drwxrwxrwx   – hdfs   hdfs            0 2016-03-11 10:23 /tmp

drwxr-xr-x   – hdfs   hdfs            0 2016-03-11 10:24 /user

16/06/29 10:35:35 DEBUG ipc.Client: stopping client from cache: org.apache.hadoop.ipc.Client@3253bcf3

16/06/29 10:35:35 DEBUG ipc.Client: removing client from cache: org.apache.hadoop.ipc.Client@3253bcf3

16/06/29 10:35:35 DEBUG ipc.Client: stopping actual client because no more references remain: org.apache.hadoop.ipc.Client@3253bcf3

16/06/29 10:35:35 DEBUG ipc.Client: Stopping client

16/06/29 10:35:35 DEBUG ipc.Client: IPC Client (1548560986) connection to sandbox.hortonworks.com/172.16.162.136:8020 from root: closed

16/06/29 10:35:35 DEBUG ipc.Client: IPC Client (1548560986) connection to sandbox.hortonworks.com/172.16.162.136:8020 from root: stopped, remaining connections 0

Another option is edit logger property in ambari itself and whenever your service will start then only it will keep on adding debug log into your respective log file. 

1. Edit hadoop-env template section

2. Define this environment variable to enable debug logging for NameNode:

export HADOOP_NAMENODE_OPTS="${HADOOP_NAMENODE_OPTS} -Dhadoop.root.logger=DEBUG,DRFA"

3. Define this environment variable to enable debug logging for DataNode:

export HADOOP_DATANODE_OPTS=”${HADOOP_DATANODE_OPTS} -Dhadoop.root.logger=DEBUG,DRFA”

4. Save the configuration and restart the required HDFS services as suggested by Ambari


  • 0

Backup and Restore of Postgres Database

How To Backup Postgres Database

1. Backup a single postgres database

This example will backup erp database that belongs to user geekstuff, to the file mydb.sql

$ pg_dump -U geekstuff erp -f mydb.sql


It prompts for password, after authentication mydb.sql got created with create table, alter table and copy commands for all the tables in the erp database. Following is a partial output of mydb.sql showing the dump information of employee_details table.

--
-- Name: employee_details; Type: TABLE; Schema: public; Owner: geekstuff; Tablespace:
--

CREATE TABLE employee_details (
employee_name character varying(100),
emp_id integer NOT NULL,
designation character varying(50),
comments text
);

ALTER TABLE public.employee_details OWNER TO geekstuff;

--
-- Data for Name: employee_details; Type: TABLE DATA; Schema: public; Owner: geekstuff
--
COPY employee_details (employee_name, emp_id, designation, comments) FROM stdin;
geekstuff 1001 trainer
ramesh 1002 author
sathiya 1003 reader
\.
--
-- Name: employee_details_pkey; Type: CONSTRAINT; Schema: public; Owner: geekstuff; Tablespace:
--
ALTER TABLE ONLY employee_details

ADD CONSTRAINT employee_details_pkey PRIMARY KEY (emp_id);

2. Backup all postgres databases

To backup all databases, list out all the available databases as shown below.

Login as postgres / psql user:

$ su postgres

List the databases:

$ psql -l

List of databases
Name | Owner | Encoding
-----------+-----------+----------
article | sathiya | UTF8
backup | postgres | UTF8
erp | geekstuff | UTF8
geeker | sathiya | UTF8

Backup all postgres databases using pg_dumpall:

You can backup all the databases using pg_dumpall command.

$ pg_dumpall > all.sql

Verify the backup:

Verify whether all the databases are backed up,

$ grep "^[\]connect" all.sql
\connect article
\connect backup
\connect erp
\connect geeker

3. Backup a specific postgres table

$ pg_dump --table products -U geekstuff article -f onlytable.sql

To backup a specific table, use the –table TABLENAME option in the pg_dump command. If there are same table names in different schema then use the –schema SCHEMANAME option.

How To Restore Postgres Database

1. Restore a postgres database

$ psql -U erp -d erp_devel -f mydb.sql

This restores the dumped database to the erp_devel database.

Restore error messages

While restoring, there may be following errors and warning, which can be ignored.

psql:mydb.sql:13: ERROR:  must be owner of schema public
psql:mydb.sql:34: ERROR:  must be member of role "geekstuff"
psql:mydb.sql:59: WARNING:  no privileges could be revoked
psql:mydb.sql:60: WARNING:  no privileges could be revoked
psql:mydb.sql:61: WARNING:  no privileges were granted
psql:mydb.sql:62: WARNING:  no privileges were granted

2. Backup a local postgres database and restore to remote server using single command:

$ pg_dump dbname | psql -h hostname dbname

The above dumps the local database, and extracts it at the given hostname.

3. Restore all the postgres databases

$ su postgres
$ psql -f alldb.sql

4. Restore a single postgres table

The following psql command installs the product table in the geek stuff database.

$ psql -f producttable.sql geekstuff



  • 0

Hive Cross Cluster replication

Hive Cross-Cluster Replication

Here I tried to explain cross-Cluster Replication with a Feed entity. This is a simple way to enforce Disaster Recovery policies or aggregate data from multiple clusters to a single cluster for enterprise reporting. To further illustrate Apache Falcon’s capabilities, we will use an HCatalog/Hive table as the Feed entity.

Step 1: First create databases/tables on source and target clusters:

— Run on primary cluster
create database landing_db;
use landing_db;
CREATE TABLE summary_table(id int, value string) PARTITIONED BY (ds string);
ALTER TABLE summary_table ADD PARTITION (ds = ‘2014-01’);
ALTER TABLE summary_table ADD PARTITION (ds = ‘2014-02’);
ALTER TABLE summary_table ADD PARTITION (ds = ‘2014-03’);

 

insert into summary_table PARTITION(ds) values (1,’abc1′,”2014-01″);
insert into summary_table PARTITION(ds) values (2,’abc2′,”2014-02″);
insert into summary_table PARTITION(ds) values (3,’abc3′,”2014-03″);

 

— Run on secondary cluster

create database archive_db;
use archive_db;
CREATE TABLE summary_archive_table(id int, value string) PARTITIONED BY (ds string);
Step 2: Now create falcon staging and working directories on both clusters:

 

hadoop fs -mkdir /apps/falcon/staging

hadoop fs -mkdir /apps/falcon/working

hadoop fs -chown falcon /apps/falcon/staging

hadoop fs -chown falcon /apps/falcon/working

hadoop fs -chmod 777 /apps/falcon/staging

hadoop fs -chmod 755 /apps/falcon/working

 

Step 3: Configure your source and target cluster for Distcp in NN High Availability. 
http://www.hadoopadmin.co.in/bigdata/distcp-between-high-availability-enabled-cluster/
In order to distcp between two HDFS HA cluster (for example A and B),modify the following in the hdfs-site.xml for both clusters:

For example, nameservice for cluster A and B is HAA and HAB respectively.
– Add value to the nameservice for both clusters

dfs.nameservices = CLUSTERAHA,CLUSTERBHA

– Add property dfs.internal.nameservices
In cluster A:
dfs.internal.nameservices =CLUSTERAHA
In cluster B:
dfs.internal.nameservices =CLUSTERBHA

– Add dfs.ha.namenodes.<nameservice>
In cluster A
dfs.ha.namenodes.CLUSTERBHA = nn1,nn2
In cluster B
dfs.ha.namenodes.CLUSTERAHA = nn1,nn2

– Add property dfs.namenode.rpc-address.<cluster>.<nn>
In cluster A
dfs.namenode.rpc-address.CLUSTERBHA.nn1 =server1:8020
dfs.namenode.rpc-address.CLUSTERBHA.nn2 =server2:8020
In cluster B
dfs.namenode.rpc-address.CLUSTERAHA.nn1 =server1:8020
dfs.namenode.rpc-address.CLUSTERAHA.nn2 =server2:8020

– Add property dfs.client.failover.proxy.provider.<cluster – i.e HAA or HAB>
In cluster A
dfs.client.failover.proxy.provider.CLUSTERBHA = org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
In cluster B
dfs.client.failover.proxy.provider.CLUSTERAHA = org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

– Restart HDFS service.

Once complete you will be able to run the distcp command using the nameservice similar to:
hadoop distcp hdfs://falconG/tmp/testDistcp hdfs://falconE/tmp/

hadoop distcp hdfs://CLUSTERAHA/user/s0998dnz/input.txt hdfs://CLUSTERBHA/tmp/

Step 4: Now create cluster entities and submit them like below sample cluster definition for source and target cluster. 

 

[s0998dnz@server1 hiveReplication]$ ll

total 24

-rw-r–r– 1 s0998dnz hdpadm 1031 Jun 15 06:43 cluster1.xml

-rw-r–r– 1 s0998dnz hdpadm 1030 Jun 15 05:11 cluster2.xml

-rw-r–r– 1 s0998dnz hdpadm 1141 Jun 1 05:44 destinationCluster.xml

-rw-r–r– 1 s0998dnz hdpadm 794 Jun 15 05:05 feed.xml

-rw-r–r– 1 s0998dnz hdpadm 1114 Jun 1 06:36 replication-feed.xml

-rw-r–r– 1 s0998dnz hdpadm 1080 Jun 15 05:07 sourceCluster.xml

[s0998dnz@server1 hiveReplication]$ cat cluster1.xml

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

<cluster name=”source” description=”primary” colo=”primary” xmlns=”uri:falcon:cluster:0.1″>

<tags>EntityType=Cluster</tags>

<interfaces>

<interface type=”readonly” endpoint=”hdfs://CLUSTERAHA” version=”2.2.0″/>

<interface type=”write” endpoint=”hdfs://CLUSTERAHA” version=”2.2.0″/>

<interface type=”execute” endpoint=”server2:8050″ version=”2.2.0″/>

<interface type=”workflow” endpoint=”http://server1:11000/oozie/” version=”4.0.0″/>

<interface type=”messaging” endpoint=”tcp://server2:61616?daemon=true” version=”5.1.6″/>

<interface type=”registry” endpoint=”thrift://server2:9083″ version=”1.2.1″ />

</interfaces>

<locations>

<location name=”staging” path=”/apps/falcon/staging”/>

<location name=”temp” path=”/tmp”/>

<location name=”working” path=”/apps/falcon/working”/>

</locations>

</cluster>

 

[s0998dnz@server2 hiveReplication]$ cat cluster2.xml

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

<cluster name=”target” description=”target” colo=”backup” xmlns=”uri:falcon:cluster:0.1″>

<tags>EntityType=Cluster</tags>

<interfaces>

<interface type=”readonly” endpoint=”hdfs://CLUSTERBHA” version=”2.2.0″/>

<interface type=”write” endpoint=”hdfs://CLUSTERBHA” version=”2.2.0″/>

<interface type=”execute” endpoint=”server2:8050″ version=”2.2.0″/>

<interface type=”workflow” endpoint=”http://server2:11000/oozie/” version=”4.0.0″/>

<interface type=”messaging” endpoint=”tcp://server2:61616?daemon=true” version=”5.1.6″/>

<interface type=”registry” endpoint=”thrift://server2:9083″ version=”1.2.1″ />

</interfaces>

<locations>

<location name=”staging” path=”/apps/falcon/staging”/>

<location name=”temp” path=”/tmp”/>

<location name=”working” path=”/apps/falcon/working”/>

</locations>

</cluster>

 

falcon entity -type cluster -submit -file cluster1.xml

falcon entity -type cluster -submit -file cluster2.xml
Step 5: Copy updated configuration files (/etc/hadoop/conf/*) from source cluster to target’s server.

 

zip -r sourceClusterConf1.zip /etc/hadoop/conf/

scp sourceClusterConf1.zip s0998dnz@server1:/home/s0998dnz/
Step 6: At target oozie server run following command.

 

mkdir -p /hdptmp/hadoop_primary/conf

chmod 777 /hdptmp/hadoop_primary/conf

unzip sourceClusterConf1.zip

cp etc/hadoop/conf/* /hdptmp/hadoop_primary/conf/

cp -r etc/hadoop/conf/* /hdptmp/hadoop_primary/conf/
Step 7: Modify below property in your target cluster’s oozie once you have copied the configuration. 

<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*={{hadoop_conf_dir}},server2:8050=/hdptmp/hadoop_primary/conf,server1:8050=/hdptmp/hadoop_primary/conf,server1:8020=/hdptmp/hadoop_primary/conf,server2:8020=/hdptmp /hadoop_primary/conf</value>

Note : You can change /hdptmp/hadoop_primary/conf to directory of your choice. However oozie should have access to the path. 
Step 8: Finally submit and schedule the feed definition using attached feed.xml file. 

 

[s0998dnz@server1 hiveReplication]$ cat feed.xml

<?xml version=”1.0″ encoding=”UTF-8″?>

<feed description=”Monthly Analytics Summary” name=”replication-feed” xmlns=”uri:falcon:feed:0.1″>

<tags>EntityType=Feed</tags>

<frequency>months(1)</frequency>

<clusters>

<cluster name=”source” type=”source”>

<validity start=”2014-01-01T00:00Z” end=”2015-03-31T00:00Z”/>

<retention limit=”months(36)” action=”delete”/>

</cluster>

<cluster name=”target” type=”target”>

<validity start=”2014-01-01T00:00Z” end=”2016-03-31T00:00Z”/>

<retention limit=”months(180)” action=”delete”/>

<table uri=”catalog:archive_db:summary_archive_table#ds=${YEAR}-${MONTH}” />

</cluster>

</clusters>

<table uri=”catalog:landing_db:summary_table#ds=${YEAR}-${MONTH}” />

<ACL owner=”falcon” />

<schema location=”hcat” provider=”hcat”/>

</feed>

falcon entity -type feed -submit -file feed.xml

falcon entity -type feed -schedule -name replication-feed
This feed has been scheduled from 2014-01 so insert below values in your source table.

 


  • 0

How to read compressed data from hdfs through hadoop command

Category : Bigdata

Sometime we have a requirement where we need to read compressed data from hdfs through hdfs command. And we have many compressed algorithms like(.gz, .snappy, .lzo and .bz2 etc).

I have tried to explain how we can achieve this requirement with the help of following ways :

Step 1: Copy any compressed file to your hdfs dir: 

[s0998dnz@hdpm1 ~]$ hadoop fs -put logs.tar.gz /tmp/

Step 2: Now you can use in-build hdfs text command to read this .gz file. This command-line will automatically find the right decompressor for any simple text file and print the uncompressed data to standard output:

[user1@hdpm1 ~]$ hadoop fs -text /tmp/logs.tar.gzvar/log/hadoop/hdfs/gc.log-2016052306240000644002174336645170000001172412720563430016153 0ustar   hdfshadoop2016-05-23T06:24:03.539-0400: 2.104: [GC2016-05-23T06:24:03.540-0400: 2.104: [ParNew: 163840K->14901K(184320K), 0.0758510 secs] 163840K->14901K(33533952K), 0.0762040 secs] [Times: user=0.51 sys=0.01, real=0.08 secs]2016-05-23T06:24:04.613-0400: 3.178: [GC2016-05-23T06:24:04.613-0400: 3.178: [ParNew: 178741K->16370K(184320K), 0.1591140 secs] 965173K->882043K(33533952K), 0.1592230 secs] [Times: user=1.21 sys=0.03, real=0.16 secs]2016-05-23T06:24:06.121-0400: 4.686: [GC2016-05-23T06:24:06.121-0400: 4.686: [ParNew: 180210K->11741K(184320K), 0.0811950 secs] 1045883K->887215K(33533952K), 0.0813160 secs] [Times: user=0.63 sys=0.00, real=0.09 secs]2016-05-23T06:24:12.313-0400: 10.878: [GC2016-05-23T06:24:12.313-0400: 10.878: [ParNew: 175581K->9827K(184320K), 0.0751580 secs] 1051055K->892704K(33533952K), 0.0752800 secs] [Times: user=0.56 sys=0.01, real=0.07 secs]2016-05-23T06:24:13.881-0400: 12.445: [GC2016-05-23T06:24:13.881-0400: 12.445: [ParNew: 173667K->20480K(184320K), 0.0810330 secs] 1056544K->920485K(33533952K), 0.0812040 secs] [Times: user=0.58 sys=0.01, real=0.08 secs]2016-05-23T06:24:16.515-0400: 15.080: [GC2016-05-23T06:24:16.515-0400: 15.080: [ParNew: 184320K->13324K(184320K), 0.0867770 secs] 1084325K->931076K(33533952K), 0.0870140 secs] [Times: user=0.63 sys=0.01, real=0.08 secs]2016-05-23T06:24:17.268-0400: 15.833: [GC2016-05-23T06:24:17.268-0400: 15.833: [ParNew: 177164K->11503K(184320K), 0.0713880 secs] 1094916K->929256K(33533952K), 0.0715820 secs] [Times: user=0.55 sys=0.00, real=0.07 secs]2016-05-23T06:25:14.412-0400: 72.977: [GC2016-05-23T06:25:14.412-0400: 72.977: [ParNew: 175343K->18080K(184320K), 0.0779040 secs] 1093096K->935833K(33533952K), 0.0781710 secs] [Times: user=0.59 sys=0.01, real=0.07 secs]2016-05-23T06:26:49.597-0400: 168.161: [GC2016-05-23T06:26:49.597-0400: 168.162: [ParNew: 181920K->13756K(184320K), 0.0839120 secs] 1099673K->941811K(33533952K), 0.0841350 secs] [Times: user=0.62 sys=0.01, real=0.08 secs]2016-05-23T06:26:50.126-0400: 168.691: [GC2016-05-23T06:26:50.127-0400: 168.691: [ParNew: 177596K->9208K(184320K), 0.0641380 secs] 1105651K->937264K(33533952K), 0.0644310 secs] [Times: user=0.50 sys=0.00, real=0.07 secs]2016-05-23T06:27:19.282-0400: 197.846: [GC2016-05-23T06:27:19.282-0400: 197.847: [ParNew: 173048K->10010K(184320K), 0.0687210 secs] 1101104K->938065K(33533952K), 0.0689210 secs] [Times: user=0.54 sys=0.00, real=0.07 secs]2016-05-23T06:30:45.428-0400: 403.992: [GC2016-05-23T06:30:45.428-0400: 403.992: [ParNew: 173850K->9606K(184320K), 0.0723210 secs] 1101905K->937661K(33533952K), 0.0726160 secs] [Times: user=0.56 sys=0.00, real=0.07 secs]2016-05-23T06:37:15.629-0400: 794.193: [GC2016-05-23T06:37:15.629-0400: 794.193: [ParNew: 173446K->9503K(184320K), 0.0723460 secs] 1101501K->937558K(33533952K), 0.0726260 secs] [Times: user=0.57 sys=0.0

In the above example I have tried to read .gz files. It probably works for .snappy, .lzo and .bz2 files.

This is an important feature because Hadoop uses a custom file format for Snappy files. This is the only direct way to uncompress a Hadoop-created Snappy file.

Note: hadoop fs -text is single-threaded and runs the decompression on the machine where you run the command.