How to read compressed data from hdfs through hadoop command

  • 0

How to read compressed data from hdfs through hadoop command

Category : Bigdata

Sometime we have a requirement where we need to read compressed data from hdfs through hdfs command. And we have many compressed algorithms like(.gz, .snappy, .lzo and .bz2 etc).

I have tried to explain how we can achieve this requirement with the help of following ways :

Step 1: Copy any compressed file to your hdfs dir: 

[s0998dnz@hdpm1 ~]$ hadoop fs -put logs.tar.gz /tmp/

Step 2: Now you can use in-build hdfs text command to read this .gz file. This command-line will automatically find the right decompressor for any simple text file and print the uncompressed data to standard output:

[user1@hdpm1 ~]$ hadoop fs -text /tmp/logs.tar.gzvar/log/hadoop/hdfs/gc.log-2016052306240000644002174336645170000001172412720563430016153 0ustar   hdfshadoop2016-05-23T06:24:03.539-0400: 2.104: [GC2016-05-23T06:24:03.540-0400: 2.104: [ParNew: 163840K->14901K(184320K), 0.0758510 secs] 163840K->14901K(33533952K), 0.0762040 secs] [Times: user=0.51 sys=0.01, real=0.08 secs]2016-05-23T06:24:04.613-0400: 3.178: [GC2016-05-23T06:24:04.613-0400: 3.178: [ParNew: 178741K->16370K(184320K), 0.1591140 secs] 965173K->882043K(33533952K), 0.1592230 secs] [Times: user=1.21 sys=0.03, real=0.16 secs]2016-05-23T06:24:06.121-0400: 4.686: [GC2016-05-23T06:24:06.121-0400: 4.686: [ParNew: 180210K->11741K(184320K), 0.0811950 secs] 1045883K->887215K(33533952K), 0.0813160 secs] [Times: user=0.63 sys=0.00, real=0.09 secs]2016-05-23T06:24:12.313-0400: 10.878: [GC2016-05-23T06:24:12.313-0400: 10.878: [ParNew: 175581K->9827K(184320K), 0.0751580 secs] 1051055K->892704K(33533952K), 0.0752800 secs] [Times: user=0.56 sys=0.01, real=0.07 secs]2016-05-23T06:24:13.881-0400: 12.445: [GC2016-05-23T06:24:13.881-0400: 12.445: [ParNew: 173667K->20480K(184320K), 0.0810330 secs] 1056544K->920485K(33533952K), 0.0812040 secs] [Times: user=0.58 sys=0.01, real=0.08 secs]2016-05-23T06:24:16.515-0400: 15.080: [GC2016-05-23T06:24:16.515-0400: 15.080: [ParNew: 184320K->13324K(184320K), 0.0867770 secs] 1084325K->931076K(33533952K), 0.0870140 secs] [Times: user=0.63 sys=0.01, real=0.08 secs]2016-05-23T06:24:17.268-0400: 15.833: [GC2016-05-23T06:24:17.268-0400: 15.833: [ParNew: 177164K->11503K(184320K), 0.0713880 secs] 1094916K->929256K(33533952K), 0.0715820 secs] [Times: user=0.55 sys=0.00, real=0.07 secs]2016-05-23T06:25:14.412-0400: 72.977: [GC2016-05-23T06:25:14.412-0400: 72.977: [ParNew: 175343K->18080K(184320K), 0.0779040 secs] 1093096K->935833K(33533952K), 0.0781710 secs] [Times: user=0.59 sys=0.01, real=0.07 secs]2016-05-23T06:26:49.597-0400: 168.161: [GC2016-05-23T06:26:49.597-0400: 168.162: [ParNew: 181920K->13756K(184320K), 0.0839120 secs] 1099673K->941811K(33533952K), 0.0841350 secs] [Times: user=0.62 sys=0.01, real=0.08 secs]2016-05-23T06:26:50.126-0400: 168.691: [GC2016-05-23T06:26:50.127-0400: 168.691: [ParNew: 177596K->9208K(184320K), 0.0641380 secs] 1105651K->937264K(33533952K), 0.0644310 secs] [Times: user=0.50 sys=0.00, real=0.07 secs]2016-05-23T06:27:19.282-0400: 197.846: [GC2016-05-23T06:27:19.282-0400: 197.847: [ParNew: 173048K->10010K(184320K), 0.0687210 secs] 1101104K->938065K(33533952K), 0.0689210 secs] [Times: user=0.54 sys=0.00, real=0.07 secs]2016-05-23T06:30:45.428-0400: 403.992: [GC2016-05-23T06:30:45.428-0400: 403.992: [ParNew: 173850K->9606K(184320K), 0.0723210 secs] 1101905K->937661K(33533952K), 0.0726160 secs] [Times: user=0.56 sys=0.00, real=0.07 secs]2016-05-23T06:37:15.629-0400: 794.193: [GC2016-05-23T06:37:15.629-0400: 794.193: [ParNew: 173446K->9503K(184320K), 0.0723460 secs] 1101501K->937558K(33533952K), 0.0726260 secs] [Times: user=0.57 sys=0.0

In the above example I have tried to read .gz files. It probably works for .snappy, .lzo and .bz2 files.

This is an important feature because Hadoop uses a custom file format for Snappy files. This is the only direct way to uncompress a Hadoop-created Snappy file.

Note: hadoop fs -text is single-threaded and runs the decompression on the machine where you run the command.


Leave a Reply