“INSERT OVERWRITE” functional details

  • 0

“INSERT OVERWRITE” functional details

If the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced by the files referred to by filepath; otherwise the files referred by filepath will be added to the table.

  • Note that if the target table (or partition) already has a file whose name collides with any of the filenames contained in filepath, then the existing file will be replaced with the new file.
  • When Hive tries to “INSERT OVERWRITE” to a partition of an external table under existing directory, depending on whether the partition definition already exists in the metastore or not, Hive will behave differently:

1) If partition definition does not exist, it will not try to guess where the target partition directories are (either static or dynamic partitions), so it will not be able to delete existing files under those partitions that will be written to

2) If partition definition does exist, it will attempt to remove all files under the target partition directory before writing new data into those directories

You can reproduce this issue with following steps.

Step 1: Login as “hdfs” user, run the following commands

hdfs dfs -mkdir test
hdfs dfs -mkdir test/p=p1
touch test.txt
hdfs dfs -put test.txt test/p=p1

Step 2: Confirm that there is one file under test/p=p1

hdfs dfs -ls test/p=p1
Found 1 items
-rw-r–r– 3 hdfs supergroup 5 2015-05-04 17:30 test/p=p1/test.txt

Step 3 : Then start “hive”
DROP TABLE IF EXISTS partition_test;
CREATE EXTERNAL TABLE partition_test (a int) PARTITIONED BY (p string) LOCATION ‘/user/hdfs/test’;
INSERT OVERWRITE TABLE partition_test PARTITION (p = ‘p1’) SELECT <int_column> FROM <existing_table>;

The output from the above “INSERT OVERWRITE”:
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_1430100146027_0004, Tracking URL = http://host-10-17-74-166.coe.cloudera.com:8088/proxy/application_1430100146027_0004/
Kill Command = /opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/bin/hadoop job -kill job_1430100146027_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-05-05 00:15:35,220 Stage-1 map = 0%, reduce = 0%
2015-05-05 00:15:48,740 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.19 sec
MapReduce Total cumulative CPU time: 3 seconds 190 msec
Ended Job = job_1430100146027_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://ha-test/user/hdfs/test/p=p1/.hive-staging_hive_2015-05-05_00-13-47_253_4887262776207257351-1/-ext-10000
Loading data to table default.partition_test partition (p=p1)
Partition default.partition_test{p=p1} stats: [numFiles=2, numRows=33178, totalSize=194973, rawDataSize=161787]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.19 sec HDFS Read: 2219273 HDFS Write: 195055 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 190 msec

to confirm that test.txt is not removed

hdfs dfs -ls test/p=p1
Found 2 items
-rwxr-xr-x 3 hdfs supergroup 194965 2015-05-05 00:15 test/p=p1/000000_0
-rw-r–r– 3 hdfs supergroup 8 2015-05-05 00:10 test/p=p1/test.txt

rename 000000_0 to 11111111

hdfs dfs -mv test/p=p1/000000_0 test/p=p1/11111111
confirm now two files under test/p=p1

hdfs dfs -ls test/p=p1
Found 2 items
-rwxr-xr-x 3 hdfs supergroup 194965 2015-05-05 00:15 test/p=p1/11111111
-rw-r–r– 3 hdfs supergroup 8 2015-05-05 00:10 test/p=p1/test.txt

Step 4: Run the following query again:

INSERT OVERWRITE TABLE partition_test PARTITION (p = ‘p1’) SELECT <int_column> FROM <existing_table>;

The output from second “INSERT OVERWRITE”:

Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_1430100146027_0005, Tracking URL = http://host-10-17-74-166.coe.cloudera.com:8088/proxy/application_1430100146027_0005/
Kill Command = /opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/bin/hadoop job -kill job_1430100146027_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-05-05 00:23:39,298 Stage-1 map = 0%, reduce = 0%
2015-05-05 00:23:48,891 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.92 sec
MapReduce Total cumulative CPU time: 2 seconds 920 msec
Ended Job = job_1430100146027_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://ha-test/user/hdfs/test/p=p1/.hive-staging_hive_2015-05-05_00-21-58_505_3688057093497278728-1/-ext-10000
Loading data to table default.partition_test partition (p=p1)
Moved: ‘hdfs://ha-test/user/hdfs/test/p=p1/11111111’ to trash at: hdfs://ha-test/user/hdfs/.Trash/Current
Moved: ‘hdfs://ha-test/user/hdfs/test/p=p1/test.txt’ to trash at: hdfs://ha-test/user/hdfs/.Trash/Current
Partition default.partition_test{p=p1} stats: [numFiles=1, numRows=33178, totalSize=194965, rawDataSize=161787]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.92 sec HDFS Read: 2219273 HDFS Write: 195055 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 920 msec

Step 5. Finally confirm that only one file under test/p=p1 directory, both 11111111 and test.txt were moved to .Trash directory

hdfs dfs -ls test/p=p1
Found 1 items
-rwxr-xr-x 3 hdfs supergroup 4954 2015-05-04 17:36 test/p=p1/000000_0

The above test confirms that files remain in the target partition directory when table was newly created with no partition definitions.

Resolutions: To fix this issue, you can run the following hive query before the “INSERT OVERWRITE” to recover the missing partition definitions:

MSCK REPAIR TABLE partition_test;

OK
Partitions not in metastore: partition_test:p=p1
Repair: Added partition to metastore partition_test:p=p1
Time taken: 0.486 seconds, Fetched: 2 row(s)

Ref : http://www.ericlin.me/hive-insert-overwrite-does-not-remove-existing-data


Leave a Reply