Hadoop Archive Files – HAR
Category : HDFS
Hadoop archive files or HAR files are facility to pack HDFS files into archives. This is the best option for storing large number of small sized files in HDFS as storing large number of small sized files directly in HDFS is not very efficient.
The advantage of har files is that, these files can be directly used as input files in Mapreduce jobs.
Suppose we have two files in /user/saurkuma/ and we want to archive them.
[root@m1 ~]# hadoop fs -ls /user/saurkuma/
Found 2 items
-rw-r–r– 3 root hdfs 234 2016-09-20 20:42 /user/saurkuma/test.json
-rw-r–r– 3 root hdfs 9 2016-09-20 20:42 /user/saurkuma/users.txt
[root@m1 ~]# hadoop fs -cat /user/saurkuma/users.txt
saurkuma
[root@m1 ~]# hadoop fs -cat /user/saurkuma/test.json
{“scedulerInfo”: {
“type”: “capacityScheduler”,
“capacity”: 100,
“usedCapacity”: 0,
“maxCapacity”: 100,
“queueName”: “root”,
“queues”: “test1”,
“health”: “test”
}}
Hadoop archive files can be created by below command and it will trigger a m/r job.
[root@m1 ~]# hadoop archive -archiveName testing.har -p /user/saurkuma/ /test
16/09/20 20:36:59 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/
16/09/20 20:37:01 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/
16/09/20 20:37:01 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/
You can see archived data on target location vi below command.
[root@m1 ~]# hadoop fs -ls /test/
Found 1 items
drwxr-xr-x – root hdfs 0 2016-09-20 20:37 /test/testing.har
[root@m1 ~]# hadoop fs -ls /test/testing.har
Found 4 items
-rw-r–r– 3 root hdfs 0 2016-09-20 20:37 /test/testing.har/_SUCCESS
-rw-r–r– 5 root hdfs 474 2016-09-20 20:37 /test/testing.har/_index
-rw-r–r– 5 root hdfs 23 2016-09-20 20:37 /test/testing.har/_masterindex
-rw-r–r– 3 root hdfs 20710951 2016-09-20 20:37 /test/testing.har/part-0
The part files contain the contents of the original files concatenated together, and the indexes file contains offset and length of each file in the part file.
we can see the data in part-0 as concatenated data from users.txt and test.json files.
[root@m1 ~]# hadoop fs -cat /test/testing.har/part-0
{“scedulerInfo”: {
“type”: “capacityScheduler”,
“capacity”: 100,
“usedCapacity”: 0,
“maxCapacity”: 100,
“queueName”: “root”,
“queues”: “test1”,
“health”: “test”
}}
saurkuma
To delete a HAR file, we need to use the recursive form of remove, as mentioned below.
[root@m1 ~]# hadoop fs -rmr /test/testing.har
rmr: DEPRECATED: Please use ‘rm -r’ instead.
16/09/20 20:43:42 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes.
Moved: ‘hdfs://HDPTSTHA/test/testing.har’ to trash at: hdfs://HDPTSTHA/user/root/.Trash/Current
Limitations of HAR Files:
- Creation of HAR files will create a copy of the original files. So, we need as much disk space as size of original files which we are archiving. We can delete the original files after creation of archive to release some disk space.
- Archives are immutable. Once an archive is created, to add or remove files from/to archive we need to re-create the archive.
- HAR files can be used as input to MapReduce but there is no archive-aware InputFormat that can pack multiple files into a single MapReduce split, so processing lots of small files, even in a HAR file will require lots of map tasks which are inefficient.
1 Comment
Rambo
March 5, 2020 at 1:35 pmthis is awesom