Hadoop Archive Files – HAR

  • 1

Hadoop Archive Files – HAR

Hadoop archive files or HAR files are facility to pack HDFS files into archives. This is the best option for storing large number of small sized files in HDFS as storing large number of small sized files directly in HDFS is not very efficient.

The advantage of har files is that, these files can be directly used as input files in Mapreduce jobs.

 

Suppose we have two files in /user/saurkuma/ and we want to archive them.

[root@m1 ~]# hadoop fs -ls /user/saurkuma/

Found 2 items

-rw-r–r–   3 root hdfs        234 2016-09-20 20:42 /user/saurkuma/test.json

-rw-r–r–   3 root hdfs          9 2016-09-20 20:42 /user/saurkuma/users.txt

[root@m1 ~]# hadoop fs -cat /user/saurkuma/users.txt

saurkuma

[root@m1 ~]# hadoop fs -cat /user/saurkuma/test.json

{“scedulerInfo”: {

         “type”: “capacityScheduler”,

         “capacity”: 100,

         “usedCapacity”: 0,

         “maxCapacity”: 100,

         “queueName”: “root”,

         “queues”: “test1”,

         “health”: “test”

}}

Hadoop archive files can be created by below command and it will trigger a m/r job.

[root@m1 ~]# hadoop archive -archiveName testing.har -p /user/saurkuma/ /test

16/09/20 20:36:59 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/

16/09/20 20:37:01 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/

16/09/20 20:37:01 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/

You can see archived data on target location vi below command.

[root@m1 ~]# hadoop fs -ls /test/

Found 1 items

drwxr-xr-x   – root hdfs          0 2016-09-20 20:37 /test/testing.har

[root@m1 ~]# hadoop fs -ls /test/testing.har

Found 4 items

-rw-r–r–   3 root hdfs          0 2016-09-20 20:37 /test/testing.har/_SUCCESS

-rw-r–r–   5 root hdfs        474 2016-09-20 20:37 /test/testing.har/_index

-rw-r–r–   5 root hdfs         23 2016-09-20 20:37 /test/testing.har/_masterindex

-rw-r–r–   3 root hdfs   20710951 2016-09-20 20:37 /test/testing.har/part-0

The part files contain the contents of the original files concatenated together, and the indexes file contains offset and length of each file in the part file.

we can see the data in part-0 as concatenated data from users.txt and test.json files.

[root@m1 ~]# hadoop fs -cat /test/testing.har/part-0

{“scedulerInfo”: {

         “type”: “capacityScheduler”,

         “capacity”: 100,

         “usedCapacity”: 0,

         “maxCapacity”: 100,

         “queueName”: “root”,

         “queues”: “test1”,

         “health”: “test”

}}

saurkuma

To delete a HAR file, we need to use the recursive form of remove, as mentioned below.

[root@m1 ~]# hadoop fs -rmr /test/testing.har

rmr: DEPRECATED: Please use ‘rm -r’ instead.

16/09/20 20:43:42 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes.

Moved: ‘hdfs://HDPTSTHA/test/testing.har’ to trash at: hdfs://HDPTSTHA/user/root/.Trash/Current

Limitations of HAR Files:
  • Creation of HAR files will create a copy of the original files. So, we need as much disk space as size of original files which we are archiving. We can delete the original files after creation of archive to release some disk space.
  • Archives are immutable. Once an archive is created, to add or remove files from/to archive we need to re-create the archive.
  • HAR files can be used as input to MapReduce but there is no archive-aware InputFormat that can pack multiple files into a single MapReduce split, so processing lots of small files, even in a HAR file will require lots of map tasks which are inefficient.