Spark job run successfully in client mode but failing in cluster mode
If you build a pyspark application which can run successfully in both the local and yarn-client modes. However, when you try to run in cluster mode, then you may receive following errors :
- Error 1: Exception: (“You must build Spark with Hive. Export ‘SPARK_HIVE=true’ and run build/sbt assembly”, Py4JJavaError(u’An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n’, JavaObject id=o52))
- Error 2: INFO Client: Deleting staging directory .sparkStaging/application_1476997468030_139760
Exception in thread “main” org.apache.spark.SparkException: Application application_1476997468030_139760 finished at org.apache.spark.deploy.yarn.Client.run(Client.scala:974) - Error 3: ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient Caused by: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory - Error 4: INFO ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
17/08/22 04:56:19 ERROR ApplicationMaster: Uncaught exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult( ThreadUtils.scala:194)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver( ApplicationMaster.scala:401)
at org.apache.spark.deploy.yarn.ApplicationMaster.run( ApplicationMaster.scala:254)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$ main$1.apply$mcV$sp( ApplicationMaster.scala:766)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run( SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run( SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged( Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java: 1866)
at org.apache.spark.deploy.SparkHadoopUtil. runAsSparkUser( SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main( ApplicationMaster.scala:764)
at org.apache.spark.deploy.yarn.ApplicationMaster.main( ApplicationMaster.scala)
Caused by: org.apache.spark.SparkUserAppException: User application exited with 1
Root Cause : If you are using HDP stack then you might be hitting a bug with HDP 2.3.2 with Ambari 2.2.1 :https://hortonworks.jira.com/browse/BUG-56393 where starting from Ambari 2.2.1 , it does not manage the spark version if HDP stack is < HDP 2.3.4.
If not then you are missing some drivers and hive parameters which you need to pass in command line during spark-submit in cluster mode.
Resolution : You can use following steps to solve this issue :
- Check the hive-site.xml contents. Should be like as below for spark.
- Add hive-site.xml to the driver-classpath so that spark can read hive configuration. Make sure —files must come before you .jar file.
- Add the datanucleus jars using –jars option when you submit
- Check the contents of hive-site.xml
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://sandbox.hortonworks.com:9083</value>
</property>
</configuration> - The Seq. of command
spark-submit \
–class <Your.class.name> \
–master yarn-cluster \
–num-executors 1 \
–driver-memory 1g \
–executor-memory 1g \
–executor-cores 1 \
–files /usr/hdp/current/spark-client/conf/hive-site.xml \
–jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \
target/YOUR_JAR-1.0.0-SNAPSHOT.jar “show tables”
Or complete command can be :
spark-submit --master yarn --deploy-mode cluster --queue di --jars /usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar --conf "spark.yarn.appMasterEnv.PATH=/opt/rh/rh-python34/root/usr/bin${PATH:+:${PATH}}" --conf "spark.yarn.appMasterEnv.PATH=/opt/rh/rh-python34/root/usr/bin${PATH:+:${PATH}}" --conf "spark.yarn.appMasterEnv.LD_LIBRARY_PATH=/opt/rh/rh-python34/root/usr/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" --conf "spark.yarn.appMasterEnv.MANPATH=/opt/rh/rh-python34/root/usr/share/man:${MANPATH}" --conf "spark.yarn.appMasterEnv.XDG_DATA_DIRS=/opt/rh/rh-python34/root/usr/share${XDG_DATA_DIRS:+:${XDG_DATA_DIRS}}" --conf "spark.yarn.appMasterEnv.PKG_CONFIG_PATH=/opt/rh/rh-python34/root/usr/lib64/pkgconfig${PKG_CONFIG_PATH:+:${PKG_CONFIG_PATH}}" --conf "spark.executorEnv.PATH=/opt/rh/rh-python34/root/usr/bin${PATH:+:${PATH}}" --conf "spark.executorEnv.PATH=/opt/rh/rh-python34/root/usr/bin${PATH:+:${PATH}}" --conf "spark.executorEnv.LD_LIBRARY_PATH=/opt/rh/rh-python34/root/usr/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" --conf "spark.executorEnv.MANPATH=/opt/rh/rh-python34/root/usr/share/man:${MANPATH}" --conf "spark.executorEnv.XDG_DATA_DIRS=/opt/rh/rh-python34/root/usr/share${XDG_DATA_DIRS:+:${XDG_DATA_DIRS}}" --conf "spark.executorEnv.PKG_CONFIG_PATH=/opt/rh/rh-python34/root/usr/lib64/pkgconfig${PKG_CONFIG_PATH:+:${PKG_CONFIG_PATH}}" hive.py
where hive.py has following value :
[adebatch@server1 ~]$ cat hive.py from pyspark import SparkContext,SparkConf from pyspark.sql import HiveContext import json import sys conf = SparkConf() sc = SparkContext(conf=conf) hiveCtx = HiveContext(sc) result = hiveCtx.sql('show databases') #result = hiveCtx.sql('select * from default.table1 limit 1') result.show() result.write.save('/tmp/pyspark', format='text', mode='overwrite')
Please feel free to give your valuable feedback.