Monthly Archives: September 2016

  • 0

“INSERT OVERWRITE” functional details

If the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced by the files referred to by filepath; otherwise the files referred by filepath will be added to the table.

  • Note that if the target table (or partition) already has a file whose name collides with any of the filenames contained in filepath, then the existing file will be replaced with the new file.
  • When Hive tries to “INSERT OVERWRITE” to a partition of an external table under existing directory, depending on whether the partition definition already exists in the metastore or not, Hive will behave differently:

1) If partition definition does not exist, it will not try to guess where the target partition directories are (either static or dynamic partitions), so it will not be able to delete existing files under those partitions that will be written to

2) If partition definition does exist, it will attempt to remove all files under the target partition directory before writing new data into those directories

You can reproduce this issue with following steps.

Step 1: Login as “hdfs” user, run the following commands

hdfs dfs -mkdir test
hdfs dfs -mkdir test/p=p1
touch test.txt
hdfs dfs -put test.txt test/p=p1

Step 2: Confirm that there is one file under test/p=p1

hdfs dfs -ls test/p=p1
Found 1 items
-rw-r–r– 3 hdfs supergroup 5 2015-05-04 17:30 test/p=p1/test.txt

Step 3 : Then start “hive”
DROP TABLE IF EXISTS partition_test;
CREATE EXTERNAL TABLE partition_test (a int) PARTITIONED BY (p string) LOCATION ‘/user/hdfs/test’;
INSERT OVERWRITE TABLE partition_test PARTITION (p = ‘p1’) SELECT <int_column> FROM <existing_table>;

The output from the above “INSERT OVERWRITE”:
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_1430100146027_0004, Tracking URL = http://host-10-17-74-166.coe.cloudera.com:8088/proxy/application_1430100146027_0004/
Kill Command = /opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/bin/hadoop job -kill job_1430100146027_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-05-05 00:15:35,220 Stage-1 map = 0%, reduce = 0%
2015-05-05 00:15:48,740 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.19 sec
MapReduce Total cumulative CPU time: 3 seconds 190 msec
Ended Job = job_1430100146027_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://ha-test/user/hdfs/test/p=p1/.hive-staging_hive_2015-05-05_00-13-47_253_4887262776207257351-1/-ext-10000
Loading data to table default.partition_test partition (p=p1)
Partition default.partition_test{p=p1} stats: [numFiles=2, numRows=33178, totalSize=194973, rawDataSize=161787]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.19 sec HDFS Read: 2219273 HDFS Write: 195055 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 190 msec

to confirm that test.txt is not removed

hdfs dfs -ls test/p=p1
Found 2 items
-rwxr-xr-x 3 hdfs supergroup 194965 2015-05-05 00:15 test/p=p1/000000_0
-rw-r–r– 3 hdfs supergroup 8 2015-05-05 00:10 test/p=p1/test.txt

rename 000000_0 to 11111111

hdfs dfs -mv test/p=p1/000000_0 test/p=p1/11111111
confirm now two files under test/p=p1

hdfs dfs -ls test/p=p1
Found 2 items
-rwxr-xr-x 3 hdfs supergroup 194965 2015-05-05 00:15 test/p=p1/11111111
-rw-r–r– 3 hdfs supergroup 8 2015-05-05 00:10 test/p=p1/test.txt

Step 4: Run the following query again:

INSERT OVERWRITE TABLE partition_test PARTITION (p = ‘p1’) SELECT <int_column> FROM <existing_table>;

The output from second “INSERT OVERWRITE”:

Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_1430100146027_0005, Tracking URL = http://host-10-17-74-166.coe.cloudera.com:8088/proxy/application_1430100146027_0005/
Kill Command = /opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/bin/hadoop job -kill job_1430100146027_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-05-05 00:23:39,298 Stage-1 map = 0%, reduce = 0%
2015-05-05 00:23:48,891 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.92 sec
MapReduce Total cumulative CPU time: 2 seconds 920 msec
Ended Job = job_1430100146027_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://ha-test/user/hdfs/test/p=p1/.hive-staging_hive_2015-05-05_00-21-58_505_3688057093497278728-1/-ext-10000
Loading data to table default.partition_test partition (p=p1)
Moved: ‘hdfs://ha-test/user/hdfs/test/p=p1/11111111’ to trash at: hdfs://ha-test/user/hdfs/.Trash/Current
Moved: ‘hdfs://ha-test/user/hdfs/test/p=p1/test.txt’ to trash at: hdfs://ha-test/user/hdfs/.Trash/Current
Partition default.partition_test{p=p1} stats: [numFiles=1, numRows=33178, totalSize=194965, rawDataSize=161787]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.92 sec HDFS Read: 2219273 HDFS Write: 195055 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 920 msec

Step 5. Finally confirm that only one file under test/p=p1 directory, both 11111111 and test.txt were moved to .Trash directory

hdfs dfs -ls test/p=p1
Found 1 items
-rwxr-xr-x 3 hdfs supergroup 4954 2015-05-04 17:36 test/p=p1/000000_0

The above test confirms that files remain in the target partition directory when table was newly created with no partition definitions.

Resolutions: To fix this issue, you can run the following hive query before the “INSERT OVERWRITE” to recover the missing partition definitions:

MSCK REPAIR TABLE partition_test;

OK
Partitions not in metastore: partition_test:p=p1
Repair: Added partition to metastore partition_test:p=p1
Time taken: 0.486 seconds, Fetched: 2 row(s)

Ref : http://www.ericlin.me/hive-insert-overwrite-does-not-remove-existing-data


  • 0

Ranger admin install fails with “007-updateBlankPolicyName.sql import failed”

If you see following error during ranger install then no need to worry as you can solve it by following just one step.

2016-03-18 16:10:44,048 [JISQL] /usr/jdk64/jdk1.8.0_60/bin/java -cp /usr/share/java/mysql-connector-java.jar:/usr/hdp/current/ranger-admin/jisql/lib/* org.apache.util.sql.Jisql -driver mysqlconj -cstring jdbc:mysql://mysqldb/ranger -u ‘user’ -p ‘********’ -noheader -trim -c \; -input /usr/hdp/current/ranger-admin/db/mysql/patches/007-updateBlankPolicyName.sql

Resolution :

  1. SET GLOBAL log_bin_trust_function_creators = 1
  2. Reinstall again Ranger service.

 


  • 0

Enable ‘Job Error Log’ in oozie

In the Oozie UI, ‘Job Error Log’ is a tab which was introduced in HDP v2.3 on Oozie v4.2 . By default it is disabled so with the help of following steps you can enable it.

In the Oozie UI, ‘Job Error Log’ is a tab which was introduced in HDP v2.3 on Oozie v4.2 .
This is the most simplest way of looking for error for the specified oozie job from the oozie log file.

To enable the oozie’s Job Error Log, please make the following changes in the oozie log4j property file:

1. Add the below set of lines after log4j.appender.oozie and before log4j.appender.oozieops:

log4j.appender.oozieError=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.oozieError.RollingPolicy=org.apache.oozie.util.OozieRollingPolicy
log4j.appender.oozieError.File=${oozie.log.dir}/oozie-error.log
log4j.appender.oozieError.Append=true
log4j.appender.oozieError.layout=org.apache.log4j.PatternLayout
log4j.appender.oozieError.layout.ConversionPattern=%d{ISO8601} %5p %c{1}:%L – SERVER[${oozie.instance.id}] %m%n
log4j.appender.oozieError.RollingPolicy.FileNamePattern=${log4j.appender.oozieError.File}-%d{yyyy-MM-dd-HH}
log4j.appender.oozieError.RollingPolicy.MaxHistory=720
log4j.appender.oozieError.filter.1 = org.apache.log4j.varia.LevelMatchFilter
log4j.appender.oozieError.filter.1.levelToMatch = WARN
log4j.appender.oozieError.filter.2 = org.apache.log4j.varia.LevelMatchFilter
log4j.appender.oozieError.filter.2.levelToMatch = ERROR
log4j.appender.oozieError.filter.3 =`enter code here` org.apache.log4j.varia.LevelMatchFilter
log4j.appender.oozieError.filter.3.levelToMatch = FATAL
log4j.appender.oozieError.filter.4 = org.apache.log4j.varia.DenyAllFilter

2. Modify the following from log4j.logger.org.apache.oozie=WARN, oozie to log4j.logger.org.apache.oozie=ALL, oozie, oozieError

3. Restart the oozie service. This would help in getting the job error log for the new jobs launched after restart of oozie service.

I hope it help you to enable your error log in oozie.


  • 0

After upgrading ambari it is not coming up (hostcomponentdesiredstate.admin_state)

If you upgrade ambari and in case if you see following error then you should not worry, following steps will help you to bring your cluster into running state.

Issue: Once you upgrade your cluster and after restarting you don’t see any service or their metrics on ambari then you need following given steps.

You also can see following error in your ambari-server logs.

23 Sep 2016 05:08:13,966 ERROR [main] ViewRegistry:1695 – Caught exception loading view TEZ{0.7.0.2.5.0.0-1}
Local Exception Stack:
Exception [EclipseLink-116] (Eclipse Persistence Services – 2.6.2.v20151217-774c696): org.eclipse.persistence.exceptions.DescriptorException
Exception Description: No conversion value provided for the value [NULL] in field [hostcomponentdesiredstate.admin_state].
Mapping: org.eclipse.persistence.mappings.DirectToFieldMapping[adminState–>hostcomponentdesiredstate.admin_state]
Descriptor: RelationalDescriptor(org.apache.ambari.server.orm.entities.HostComponentDesiredStateEntity –> [DatabaseTable(hostcomponentdesiredstate)])
at org.eclipse.persistence.exceptions.DescriptorException.noFieldValueConversionToAttributeValueProvided(DescriptorException.java:1066)
at org.eclipse.persistence.mappings.converters.ObjectTypeConverter.convertDataValueToObjectValue(ObjectTypeConverter.java:226)
at org.eclipse.persistence.mappings.converters.EnumTypeConverter.convertDataValueToObjectValue(EnumTypeConverter.java:141)
at org.eclipse.persistence.mappings.foundation.AbstractDirectMapping.getObjectValue(AbstractDirectMapping.java:616)
at org.eclipse.persistence.mappings.foundation.AbstractDirectMapping.valueFromRow(AbstractDirectMapping.java:1220)
at org.eclipse.persistence.mappings.DatabaseMapping.readFromRowIntoObject(DatabaseMapping.java:1539)
at org.eclipse.persistence.internal.descriptors.ObjectBuilder.buildAttributesIntoObject(ObjectBuilder.java:462)
at org.eclipse.persistence.internal.descriptors.ObjectBuilder.buildObject(ObjectBuilder.java:1005)
at org.eclipse.persistence.internal.descriptors.ObjectBuilder.buildWorkingCopyCloneNormally(ObjectBuilder.java:899)
at org.eclipse.persistence.internal.descriptors.ObjectBuilder.buildObjectInUnitOfWork(ObjectBuilder.java:852)
at org.eclipse.persistence.internal.descriptors.ObjectBuilder.buildObject(ObjectBuilder.java:735)
at org.eclipse.persistence.internal.descriptors.ObjectBuilder.buildObject(ObjectBuilder.java:689)
at org.eclipse.persistence.queries.ObjectLevelReadQuery.buildObject(ObjectLevelReadQuery.java:805)
at org.eclipse.persistence.queries.ReadObjectQuery.registerResultInUnitOfWork(ReadObjectQuery.java:895)
at org.eclipse.persistence.queries.ReadObjectQuery.executeObjectLevelReadQuery(ReadObjectQuery.java:562)
at org.eclipse.persistence.queries.ObjectLevelReadQuery.executeDatabaseQuery(ObjectLevelReadQuery.java:1175)
at org.eclipse.persistence.queries.DatabaseQuery.execute(DatabaseQuery.java:904)
at org.eclipse.persistence.queries.ObjectLevelReadQuery.execute(ObjectLevelReadQuery.java:1134)
at org.eclipse.persistence.queries.ReadObjectQuery.execute(ReadObjectQuery.java:441)
at org.eclipse.persistence.queries.ObjectLevelReadQuery.executeInUnitOfWork(ObjectLevelReadQuery.java:1222)

Resolutions : 

Step 1: Check all installed services admin_status in hostcomponentdesiredstate table. You should not have any NULL value in admin_state for any service. 

mysql> select * from hostcomponentdesiredstate;

+————+————————+—————+—————-+————-+——————-+——————+—————-+———+——————+

| cluster_id | component_name         | desired_state | service_name   | admin_state | maintenance_state | restart_required | security_state | host_id | desired_stack_id |

+————+————————+—————+—————-+————-+——————-+——————+—————-+———+——————+

|          2 | APP_TIMELINE_SERVER    | STARTED       | YARN           | NULL        | OFF               |                0 | UNSECURED      |       2 |                1 |

|          2 | ATLAS_SERVER           | STARTED       | ATLAS          | NULL        | OFF               |                0 | UNSECURED      |       1 |                1 |

|          2 | DATANODE               | STARTED       | HDFS           | INSERVICE   | OFF               |                0 | UNSECURED      |       3 |                1 |

|          2 | DATANODE               | STARTED       | HDFS           | INSERVICE   | OFF               |                0 | UNSECURED      |       4 |                1 |

|          2 | DATANODE               | STARTED       | HDFS           | INSERVICE   | OFF               |                0 | UNSECURED      |       5 |                1 |

|          2 | DRPC_SERVER            | STARTED       | STORM          | NULL        | OFF               |                0 | UNSECURED      |       2 |                1 |

|          2 | FALCON_CLIENT          | INSTALLED     | FALCON         | NULL        | OFF               |                0 | UNSECURED      |       1 |                1 |

|          2 | FALCON_CLIENT          | INSTALLED     | FALCON         | NULL        | OFF               |                0 | UNSECURED      |       2 |                1 |

|          2 | FALCON_SERVER          | STARTED       | FALCON         | NULL        | OFF               |                0 | UNSECURED      |       2 |                1 |

|          2 | FLUME_HANDLER          | STARTED       | FLUME          | INSERVICE   | OFF               |                0 | UNSECURED      |       3 |                1 |

|          2 | FLUME_HANDLER          | STARTED       | FLUME          | INSERVICE   | OFF               |                0 | UNSECURED      |       4 |                1 |

|          2 | FLUME_HANDLER          | STARTED       | FLUME          | INSERVICE   | OFF               |                0 | UNSECURED      |       5 |                1 |

|          2 | HBASE_CLIENT           | INSTALLED     | HBASE          | NULL        | OFF               |                0 | UNSECURED      |       1 |                1 |

|          2 | HBASE_CLIENT           | INSTALLED     | HBASE          | NULL        | OFF               |                0 | UNSECURED      |       2 |                1 |

|          2 | HBASE_MASTER           | STARTED       | HBASE          | NULL        | OFF               |                0 | UNSECURED      |       1 |                1 |

|          2 | HBASE_REGIONSERVER     | STARTED       | HBASE          | INSERVICE   | OFF               |                0 | UNSECURED      |       3 |                1 |

|          2 | HBASE_REGIONSERVER     | STARTED       | HBASE          | INSERVICE   | OFF               |                0 | UNSECURED      |       4 |                1 |

|          2 | HBASE_REGIONSERVER     | STARTED       | HBASE          | INSERVICE   | OFF               |                0 | UNSECURED      |       5 |                1 |

|          2 | HCAT                   | INSTALLED     | HIVE           | NULL        | OFF               |                0 | UNSECURED      |       1 |                1 |

|          2 | HCAT                   | INSTALLED     | HIVE           | NULL        | OFF               |                0 | UNSECURED      |       2 |                1 |

|          2 | HDFS_CLIENT            | INSTALLED     | HDFS           | NULL        | OFF               |                0 | UNSECURED      |       1 |                1 |

|          2 | HDFS_CLIENT            | INSTALLED     | HDFS           | NULL        | OFF               |                0 | UNSECURED      |       2 |                1 |

Step2: In that case you have to update this table manually :

mysql> update hostcomponentdesiredstate set admin_state=’INSERVICE’;

Query OK, 58 rows affected (0.01 sec)

Rows matched: 92  Changed: 58  Warnings: 0

Step 3: Now you can check once again the  admin_state to have all inservice: 

mysql> select * from hostcomponentdesiredstate;

+————+————————+—————+—————-+————-+——————-+——————+—————-+———+——————+

| cluster_id | component_name         | desired_state | service_name   | admin_state | maintenance_state | restart_required | security_state | host_id | desired_stack_id |

+————+————————+—————+—————-+————-+——————-+——————+—————-+———+——————+

|          2 | APP_TIMELINE_SERVER    | STARTED       | YARN           | INSERVICE   | OFF               |                0 | UNSECURED      |       2 |                1 |

|          2 | ATLAS_SERVER           | STARTED       | ATLAS          | INSERVICE   | OFF               |                0 | UNSECURED      |       1 |                1 |

|          2 | DATANODE               | STARTED       | HDFS           | INSERVICE   | OFF               |                0 | UNSECURED      |       3 |                1 |

|          2 | DATANODE               | STARTED       | HDFS           | INSERVICE   | OFF               |                0 | UNSECURED      |       4 |                1 |

|          2 | DATANODE               | STARTED       | HDFS           | INSERVICE   | OFF               |                0 | UNSECURED      |       5 |                1 |

|          2 | DRPC_SERVER            | STARTED       | STORM          | INSERVICE   | OFF               |                0 | UNSECURED      |       2 |                1 |

|          2 | FALCON_CLIENT          | INSTALLED     | FALCON         | INSERVICE   | OFF               |                0 | UNSECURED      |       1 |                1 |

|          2 | FALCON_CLIENT          | INSTALLED     | FALCON         | INSERVICE   | OFF               |                0 | UNSECURED      |       2 |                1 |

|          2 | FALCON_SERVER          | STARTED       | FALCON         | INSERVICE   | OFF               |                0 | UNSECURED      |       2 |                1 |

|          2 | FLUME_HANDLER          | STARTED       | FLUME          | INSERVICE   | OFF               |                0 | UNSECURED      |       3 |                1 |

|          2 | FLUME_HANDLER          | STARTED       | FLUME          | INSERVICE   | OFF               |                0 | UNSECURED      |       4 |                1 |

|          2 | FLUME_HANDLER          | STARTED       | FLUME          | INSERVICE   | OFF               |                0 | UNSECURED      |       5 |                1 |

|          2 | HBASE_CLIENT           | INSTALLED     | HBASE          | INSERVICE   | OFF               |                0 | UNSECURED      |       1 |                1 |

|          2 | HBASE_CLIENT           | INSTALLED     | HBASE          | INSERVICE   | OFF               |                0 | UNSECURED      |       2 |                1 |

|          2 | HBASE_MASTER           | STARTED       | HBASE          | INSERVICE   | OFF               |                0 | UNSECURED      |       1 |                1 |

Step 4: Now restart your ambari and you should be good with your cluster. 

[ambari@server1 ~]$ ambari-server restart

Using python  /usr/bin/python

Restarting ambari-server

Using python  /usr/bin/python

Stopping ambari-server

Ambari Server stopped

Using python  /usr/bin/python

Starting ambari-server

Organizing resource files at /var/lib/ambari-server/resources…

Unable to check firewall status when starting without root privileges.

Please do not forget to disable or adjust firewall if needed

Ambari database consistency check started…

No errors were found.

Ambari database consistency check finished

Server PID at: /var/run/ambari-server/ambari-server.pid

Server out at: /var/log/ambari-server/ambari-server.out

Server log at: /var/log/ambari-server/ambari-server.log

Waiting for server start………………..

Ambari Server ‘start’ completed successfully.


  • 1

Hadoop Archive Files – HAR

Hadoop archive files or HAR files are facility to pack HDFS files into archives. This is the best option for storing large number of small sized files in HDFS as storing large number of small sized files directly in HDFS is not very efficient.

The advantage of har files is that, these files can be directly used as input files in Mapreduce jobs.

 

Suppose we have two files in /user/saurkuma/ and we want to archive them.

[root@m1 ~]# hadoop fs -ls /user/saurkuma/

Found 2 items

-rw-r–r–   3 root hdfs        234 2016-09-20 20:42 /user/saurkuma/test.json

-rw-r–r–   3 root hdfs          9 2016-09-20 20:42 /user/saurkuma/users.txt

[root@m1 ~]# hadoop fs -cat /user/saurkuma/users.txt

saurkuma

[root@m1 ~]# hadoop fs -cat /user/saurkuma/test.json

{“scedulerInfo”: {

         “type”: “capacityScheduler”,

         “capacity”: 100,

         “usedCapacity”: 0,

         “maxCapacity”: 100,

         “queueName”: “root”,

         “queues”: “test1”,

         “health”: “test”

}}

Hadoop archive files can be created by below command and it will trigger a m/r job.

[root@m1 ~]# hadoop archive -archiveName testing.har -p /user/saurkuma/ /test

16/09/20 20:36:59 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/

16/09/20 20:37:01 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/

16/09/20 20:37:01 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/

You can see archived data on target location vi below command.

[root@m1 ~]# hadoop fs -ls /test/

Found 1 items

drwxr-xr-x   – root hdfs          0 2016-09-20 20:37 /test/testing.har

[root@m1 ~]# hadoop fs -ls /test/testing.har

Found 4 items

-rw-r–r–   3 root hdfs          0 2016-09-20 20:37 /test/testing.har/_SUCCESS

-rw-r–r–   5 root hdfs        474 2016-09-20 20:37 /test/testing.har/_index

-rw-r–r–   5 root hdfs         23 2016-09-20 20:37 /test/testing.har/_masterindex

-rw-r–r–   3 root hdfs   20710951 2016-09-20 20:37 /test/testing.har/part-0

The part files contain the contents of the original files concatenated together, and the indexes file contains offset and length of each file in the part file.

we can see the data in part-0 as concatenated data from users.txt and test.json files.

[root@m1 ~]# hadoop fs -cat /test/testing.har/part-0

{“scedulerInfo”: {

         “type”: “capacityScheduler”,

         “capacity”: 100,

         “usedCapacity”: 0,

         “maxCapacity”: 100,

         “queueName”: “root”,

         “queues”: “test1”,

         “health”: “test”

}}

saurkuma

To delete a HAR file, we need to use the recursive form of remove, as mentioned below.

[root@m1 ~]# hadoop fs -rmr /test/testing.har

rmr: DEPRECATED: Please use ‘rm -r’ instead.

16/09/20 20:43:42 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes.

Moved: ‘hdfs://HDPTSTHA/test/testing.har’ to trash at: hdfs://HDPTSTHA/user/root/.Trash/Current

Limitations of HAR Files:
  • Creation of HAR files will create a copy of the original files. So, we need as much disk space as size of original files which we are archiving. We can delete the original files after creation of archive to release some disk space.
  • Archives are immutable. Once an archive is created, to add or remove files from/to archive we need to re-create the archive.
  • HAR files can be used as input to MapReduce but there is no archive-aware InputFormat that can pack multiple files into a single MapReduce split, so processing lots of small files, even in a HAR file will require lots of map tasks which are inefficient.

  • 0

Falcon MQ log files location

Sometime we see that falcon use 90-100% of / space like showing in following example.

[user1@server localhost]$ du -sh /hadoop/falcon/hadoop/falcon/embeddedmq/data/localhost/KahaDB

67M     /hadoop/falcon/hadoop/falcon/embeddedmq/data/localhost/KahaDB

[users1@server localhost]$ du -sh /hadoop/falcon/embeddedmq/data/localhost/KahaDB/

849M   /hadoop/falcon/embeddedmq/data/localhost/KahaDB/

This is because we have installed falcon in embedded mode and we have set falcon.embeddedmq.data to that location. Falcon server starts embedded active mq whenever we start falcon. But Ideally it should not fill this dir as once any feed or process instate immediately it will get clean.

To fix this issue we need to change dir with the help of following steps.

1. Stop Falcon Service from Ambari.

2. Copy /hadoop/falcon to new patition /lowes/ and correct the ownership. cp -r /hadoop/falcon /lowes/; chown -R falcon:hadoop /lowes/falcon/*

3. Change the following falcon configuration to point to new location.

   a. Falcon data directory(/hadoop/falcon to /lowes/falcon) and Falcon store URI under Falcon Server

   b. *.config.store.uri, *.falcon.graph.serialize.path and *.falcon.graph.storage.directory under Falcon startup.properties

  c. falcon.embeddedmq.data under Advanced falcon-env

4. Start Falcon service from Ambari.

I hope this will help you to serve your purpose.


  • 0

Pig script with HCatLoader on Hive ORC table

Category : Pig

Sometime we have to run some pig command on hive orc tables then this article will help you to do that.

Step 1: First create hive orc table:

hive> CREATE TABLE ORC_Table(COL1 BIGINT,COL2 STRING) CLUSTERED BY (COL1) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\T’ STORED AS ORC TBLPROPERTIES (‘TRANSACTIONAL’=’TRUE’) ;

Step 2: Now insert data to this table:

hive> insert into orc_table values(122342,’test’);

hive> insert into orc_table values(231232,’rest’);

hive> select * from orc_table;

OK

122342 test

231232 rest

Time taken: 1.663 seconds, Fetched: 2 row(s)

Step 3: Now create pig script :

[user1@server1 ~]$ cat  myscript.pig

A = LOAD ‘test1.orc_table’ USING org.apache.hive.hcatalog.pig.HCatLoader();

Dump A;

Step 4: Now you have to run your pig script:

[user1@server1 ~]$ pig -useHCatalog -f myscript.pig

WARNING: Use “yarn jar” to launch YARN applications.

16/09/16 03:31:02 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

16/09/16 03:31:02 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

16/09/16 03:31:02 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2016-09-16 03:31:02,440 [main] INFO  org.apache.pig.Main – Apache Pig version 0.15.0.2.3.4.0-3485 (rexported) compiled Dec 16 2015, 04:30:33

2016-09-16 03:31:02,440 [main] INFO  org.apache.pig.Main – Logging error messages to: /home/user1/pig_1474011062438.log

2016-09-16 03:31:03,233 [main] INFO  org.apache.pig.impl.util.Utils – Default bootup file /home/user1/.pigbootup not found

2016-09-16 03:31:03,386 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://HDPINFHA

2016-09-16 03:31:04,269 [main] INFO  org.apache.pig.PigServer – Pig Script ID for the session: PIG-myscript.pig-eb253b46-2d2e-495c-9149-ef305ee4e408

2016-09-16 03:31:04,726 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/

2016-09-16 03:31:04,726 [main] INFO  org.apache.pig.backend.hadoop.ATSService – Created ATS Hook

2016-09-16 03:31:05,618 [main] INFO  hive.metastore – Trying to connect to metastore with URI thrift://server2:9083

2016-09-16 03:31:05,659 [main] INFO  hive.metastore – Connected to metastore.

2016-09-16 03:31:06,209 [main] INFO  org.apache.pig.tools.pigstats.ScriptState – Pig features used in the script: UNKNOWN

2016-09-16 03:31:06,247 [main] INFO  org.apache.pig.data.SchemaTupleBackend – Key [pig.schematuple] was not set… will not generate code.

2016-09-16 03:31:06,284 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer – {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}

2016-09-16 03:31:06,384 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler – File concatenation threshold: 100 optimistic? false

2016-09-16 03:31:06,409 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer – MR plan size before optimization: 1

2016-09-16 03:31:06,409 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer – MR plan size after optimization: 1

2016-09-16 03:31:06,576 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/

2016-09-16 03:31:06,758 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState – Pig script settings are added to the job

2016-09-16 03:31:06,762 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2016-09-16 03:31:06,999 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – This job cannot be converted run in-process

2016-09-16 03:31:07,292 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/hive-metastore-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp428549735/hive-metastore-1.2.1.2.3.4.0-3485.jar

2016-09-16 03:31:07,329 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/libthrift-0.9.2.jar to DistributedCache through /tmp/temp-1473630461/tmp568922300/libthrift-0.9.2.jar

2016-09-16 03:31:07,542 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/hive-exec-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp-1007595209/hive-exec-1.2.1.2.3.4.0-3485.jar

2016-09-16 03:31:07,577 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/libfb303-0.9.2.jar to DistributedCache through /tmp/temp-1473630461/tmp-1039107423/libfb303-0.9.2.jar

2016-09-16 03:31:07,609 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/jdo-api-3.0.1.jar to DistributedCache through /tmp/temp-1473630461/tmp-1375931436/jdo-api-3.0.1.jar

2016-09-16 03:31:07,642 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp-893657730/hive-hbase-handler-1.2.1.2.3.4.0-3485.jar

2016-09-16 03:31:07,674 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive-hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp-1850340790/hive-hcatalog-core-1.2.1.2.3.4.0-3485.jar

2016-09-16 03:31:07,705 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive-hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp58999520/hive-hcatalog-pig-adapter-1.2.1.2.3.4.0-3485.jar

2016-09-16 03:31:07,775 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/pig/pig-0.15.0.2.3.4.0-3485-core-h2.jar to DistributedCache through /tmp/temp-1473630461/tmp-422634726/pig-0.15.0.2.3.4.0-3485-core-h2.jar

2016-09-16 03:31:07,808 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1473630461/tmp1167068812/automaton-1.11-8.jar

2016-09-16 03:31:07,840 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1473630461/tmp708151030/antlr-runtime-3.4.jar

2016-09-16 03:31:07,882 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Setting up single store job

2016-09-16 03:31:07,932 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 1 map-reduce job(s) waiting for submission.

2016-09-16 03:31:08,248 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader – No job jar file set.  User classes may not be found. See Job or Job#setJar(String).

2016-09-16 03:31:08,351 [JobControl] INFO  org.apache.hadoop.hive.ql.log.PerfLogger – <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>

2016-09-16 03:31:08,355 [JobControl] INFO  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat – ORC pushdown predicate: null

2016-09-16 03:31:08,416 [JobControl] INFO  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat – FooterCacheHitRatio: 0/0

2016-09-16 03:31:08,416 [JobControl] INFO  org.apache.hadoop.hive.ql.log.PerfLogger – </PERFLOG method=OrcGetSplits start=1474011068351 end=1474011068416 duration=65 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>

2016-09-16 03:31:08,421 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths (combined) to process : 1

2016-09-16 03:31:08,514 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter – number of splits:1

2016-09-16 03:31:08,612 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter – Submitting tokens for job: job_1472564332053_0029

2016-09-16 03:31:08,755 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner – Job jar is not present. Not adding any jar to the list of resources.

2016-09-16 03:31:08,947 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl – Submitted application application_1472564332053_0029

2016-09-16 03:31:08,989 [JobControl] INFO  org.apache.hadoop.mapreduce.Job – The url to track the job: http://server2:8088/proxy/application_1472564332053_0029/

2016-09-16 03:31:08,990 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – HadoopJobId: job_1472564332053_0029

2016-09-16 03:31:08,990 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Processing aliases A

2016-09-16 03:31:08,990 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – detailed locations: M: A[1,4] C:  R:

2016-09-16 03:31:09,007 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 0% complete

2016-09-16 03:31:09,007 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Running jobs are [job_1472564332053_0029]

2016-09-16 03:31:28,133 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 50% complete

2016-09-16 03:31:28,133 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Running jobs are [job_1472564332053_0029]

2016-09-16 03:31:29,251 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/

2016-09-16 03:31:29,258 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate – Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server

2016-09-16 03:31:30,186 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

2.7.1.2.3.4.0-3485 0.15.0.2.3.4.0-3485 s0998dnz 2016-09-16 03:31:06 2016-09-16 03:31:30 UNKNOWN

Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs

job_1472564332053_0029 1 0 5 5 5 5 0 0 0 0 A MAP_ONLY hdfs://HDPINFHA/tmp/temp-1473630461/tmp1899757076,

Input(s):

Successfully read 2 records (28587 bytes) from: “test1.orc_table”

Output(s):

Successfully stored 2 records (32 bytes) in: “hdfs://HDPINFHA/tmp/temp-1473630461/tmp1899757076”

Counters:

Total records written : 2

Total bytes written : 32

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

Job DAG:

job_1472564332053_0029

2016-09-16 03:31:30,822 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Success!

2016-09-16 03:31:30,825 [main] WARN  org.apache.pig.data.SchemaTupleBackend – SchemaTupleBackend has already been initialized

2016-09-16 03:31:30,834 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat – Total input paths to process : 1

2016-09-16 03:31:30,834 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths to process : 1

(122342,test)

(231232,rest)

2016-09-16 03:31:30,984 [main] INFO  org.apache.pig.Main – Pig script completed in 28 seconds and 694 milliseconds (28694 ms)


  • 0

hive date time issue

Many time when we load data into hive tables and if we have a date & time field in our data then we may have seen an issue with getting data field. So to solve this issue I have created this article and explained steps in details.

I have the following sample input file(a.txt)

a,20-11-2015 22:07

b,17-08-2015 09:45

I created the table in hive

hive> create table mytime(a string, b timestamp) row format delimited fields terminated by ‘,’;

hive> load data local inpath ‘/root/a.txt’ overwrite into table mytime;

hive> select* from mytime;

7404-untitled

So now you saw we will get null value only. To overcome this situation we need an additional, temporary table to read your input file, and then some date conversion:

hive> create table tmp(a string, b string) row format delimited fields terminated by ‘,’;
hive> load data local inpath ‘a.txt’ overwrite into table tmp;
hive> create table mytime(a string, b timestamp);
hive> insert into table mytime select a, from_unixtime(unix_timestamp(b, ‘dd-MM-yyyy HH:mm’)) from tmp;
hive> select * from mytime;
a 2015-11-20 22:07:00
b 2015-08-17 09:45:00

I hope this article will help you to get your work done in effective manner.


  • 1

Compression in Hadoop

Category : HDFS

File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.

1. What to compress?

1) Compressing input files
If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. This time conservation is beneficial to the performance of job execution.

If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the filename extension to determine which codec to use. For example, a file ending in .gz can be identified as gzip-compressed file and thus read with GzipCodec.

2) Compressing output files
Often we need to store the output as history files. If the amount of output per day is extensive, and we often need to store history results for future use, then these accumulated results will take extensive amount of HDFS space. However, these history files may not be used very frequently, resulting in a waste of HDFS space. Therefore, it is necessary to compress the output before storing on HDFS.

3) Compressing map output
Even if your MapReduce application reads and writes uncompressed data, it may benefit from compressing the intermediate output of the map phase. Since the map output is written to disk and transferred across the network to the reducer nodes, by using a fast compressor such as LZO or Snappy, you can get performance gains simply because the volume of data to transfer is reduced.

2. Common input format

Compression format Tool Algorithm File extention Splittable
gzip gzip DEFLATE .gz No
bzip2 bizp2 bzip2 .bz2 Yes
LZO lzop LZO .lzo Yes if indexed
Snappy N/A Snappy .snappy No

gzip:
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman Coding.

bzip2:
bzip2 is a freely available, patent free (see below), high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques (the PPM family of statistical compressors), whilst being around twice as fast at compression and six times faster at decompression.

LZO:
The LZO compression format is composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries.  Moreover, it was designed with speed in mind: it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive read speeds.  It doesn’t compress quite as well as gzip — expect files that are on the order of 50% larger than their gzipped version.  But that is still 20-50% of the size of the files without any compression at all, which means that IO-bound jobs complete the map phase about four times faster.

Snappy:
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.

Some tradeoffs:
All compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of smaller space savings. The tools listed in above table typically give some control over this trade-off at compression time by offering nine different options: –1 means optimize for speed and -9 means optimize for space.

The different tools have very different compression characteristics. Gzip is a general purpose compressor, and sits in the middle of the space/time trade-off. Bzip2 compresses more effectively than gzip, but is slower. Bzip2’s decompression speed is faster than its compression speed, but it is still slower than the other formats. LZO and Snappy, on the other hand, both optimize for speed and are around an order of magnitude faster than gzip, but compress less effectively. Snappy is also significantly faster than LZO for decompression.

3. Issues about compression and input split

When considering how to compress data that will be processed by MapReduce, it is important to understand whether the compression format supports splitting. Consider an uncompressed file stored in HDFS whose size is 1 GB. With an HDFS block size of 64 MB, the file will be stored as 16 blocks, and a MapReduce job using this file as input will create 16 input splits, each processed independently as input to a separate map task.

Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. As before, HDFS will store the file as 16 blocks. However, creating a split for each block won’t work since it is impossible to start reading at an arbitrary point in the gzip stream and therefore impossible for a map task to read its split independently of the others. The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way that would allow a reader positioned at an arbitrary point in the stream to advance to the beginning of the next block, thereby synchronizing itself with the stream. For this reason, gzip does not support splitting.

In this case, MapReduce will do the right thing and not try to split the gzipped file, since it knows that the input is gzip-compressed (by looking at the filename extension) and that gzip does not support splitting. This will work, but at the expense of locality: a single map will process the 16 HDFS blocks, most of which will not be local to the map. Also, with fewer maps, the job is less granular, and so may take longer to run.

If the file in our hypothetical example were an LZO file, we would have the same problem since the underlying compression format does not provide a way for a reader to synchronize itself with the stream. However, it is possible to preprocess LZO files using an indexer tool that comes with the Hadoop LZO libraries. The tool builds an index of split points, effectively making them splittable when the appropriate MapReduce input format is used.

A bzip2 file, on the other hand, does provide a synchronization marker between blocks (a 48-bit approximation of pi), so it does support splitting.

4. IO-bound and CPU bound

Storing compressed data in HDFS allows your hardware allocation to go further since compressed data is often 25% of the size of the original data.  Furthermore, since MapReduce jobs are nearly always IO-bound, storing compressed data means there is less overall IO to do, meaning jobs run faster.  There are two caveats to this, however: some compression formats cannot be split for parallel processing, and others are slow enough at decompression that jobs become CPU-bound, eliminating your gains on IO.

The gzip compression format illustrates the first caveat. Imagine you have a 1.1 GB gzip file, and your cluster has a 128 MB block size.  This file will be split into 9 chunks of size approximately 128 MB.  In order to process these in parallel in a MapReduce job, a different mapper will be responsible for each chunk. But this means that the second mapper will start on an arbitrary byte about 128MB into the file.  The contextful dictionary that gzip uses to decompress input will be empty at this point, which means the gzip decompressor will not be able to correctly interpret the bytes.  The upshot is that large gzip files in Hadoop need to be processed by a single mapper, which defeats the purpose of parallelism.

Bzip2 compression format illustrates the second caveat in which jobs become CPU-bound. Bzip2 files compress well and are even splittable, but the decompression algorithm is slow and cannot keep up with the streaming disk reads that are common in Hadoop jobs.  While Bzip2 compression has some upside because it conserves storage space, running jobs now spend their time waiting on the CPU to finish decompressing data, which slows them down and offsets the other gains.

5. Summary

Reasons to compress:
a) Data is mostly stored and not frequently processed. It is usual DWH scenario. In this case space saving can be much more significant then processing overhead
b) Compression factor is very high and thereof we save a lot of IO.
c) Decompression is very fast (like Snappy) and thereof we have a some gain with little price
d) Data already arrived compressed

Reasons not to compress
a) Compressed data is not splittable. Have to be noted that many modern format are built with block level compression to enable splitting and other partial processing of the files. b) Data is created in the cluster and compression takes significant time. Have to be noted that compression usually much more CPU intensive then decompression.
c) Data has little redundancy and compression gives little gain.

 

MyData = LOAD ‘/tmp/data.csv.gz’ USING PigStorage(‘,’) AS (timestamp, user, url);

PerUser = GROUP MyData BY user; UserCount = FOREACH PerUser GENERATE group AS user, COUNT(MyData) AS count; STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(','); or another way to compress your data is to set following properties and get it done for every jobs.

Set compression method in your script.

set output.compression.enabled true;

set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;

 

ref : http://comphadoop.weebly.com/


  • 1

Change default permission of hive database

Category : Hive

When you create a database or internal tables in hive cli then by default it creates with 777 permission.Even though if you have umask in hdfs then also it will be same permission. But now you can change it with the help of following steps.

1.From the command line in the Ambari server node, edit the file

vi /var/lib/ambariserver/resources/commonservices/HIVE/0.12.0.2.0/package/scripts/hive.py

Search for hive_apps_whs_dir which should go to this block:

params.HdfsResource(params.hive_apps_whs_dir,

type=“directory”,

action=“create_on_execute”,

owner=params.hive_user,

group=params.user_group,

mode=0755

)

2. Modify the value for mode from 0777 to the desired permission, for example 0750.Save and close the file.

3. Restart the Ambari server to propagate the change to all nodes in the cluster:

ambariserver restart

4. From the Ambari UI, restart HiveServer2 to apply the new permission to the warehouse directory. If multiple HiveServer2 instances are configured, any one instance can be restarted.

hive> create database test2;

OK

Time taken: 0.156 seconds

hive> dfs -ls /apps/hive/warehouse;

Found 9 items

drwxrwxrwx   – hdpuser hdfs          0 2016-09-08 01:54 /apps/hive/warehouse/test.db

drwxr-xr-x   -hdpuser hdfs          0 2016-09-08 02:04 /apps/hive/warehouse/test1.db

drwxr-x—   -hdpuser hdfs          0 2016-09-08 02:09 /apps/hive/warehouse/test2.db

I hope this will help you to serve your purpose.