Category Archives: Spark

  • 0

Install and configure Spark History Server (SHS) on Kubernetes K8s

We always struggle like how to install and configure SHS on Kubernetes with gas event log. So here is your solution.

Create a shs-gcs.yaml deployments file which will be used to deploy shs service. 

 

 

 

pvc:
enablePVC: false
existingClaimName: nfs-pvc
eventsDir: “/”
nfs:
enableExampleNFS: false
pvName: nfs-pv
pvcName: nfs-pvc
gcs:
enableGCS: true
secret: history-secrets
key: tc-sc-bi-bigdata-ifwk-new-dev-48a2f0a984bb.json
logDirectory: gs://tc-sc-bi-bigdata-ingestion-dev-spark-on-k8s/eventsLogs/

******************************** Step 1 ********************************

(base) saurabhkumar@Saurabhs-MacBook-Pro stats % gcloud container clusters get-credentials spark-on-gke
Fetching cluster endpoint and auth data.
kubeconfig entry generated for spark-on-gke.

(base) saurabhkumar@Saurabhs-MacBook-Pro stats % kubectl cluster-info
Kubernetes master is running at https://10.2.4.110
GLBCDefaultBackend is running at https://10.2.4.110/api/v1/namespaces/kube-system/services/default-http-backend:http/proxy
KubeDNS is running at https://10.2.4.110/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://10.2.4.110/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy

******************************** Step 2 ********************************

(base) saurabhkumar@Saurabhs-MacBook-Pro stats % kubectl get secrets
NAME TYPE DATA AGE
default-token-2v6p5 kubernetes.io/service-account-token 3 71d
spark-sa Opaque 1 70d
(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % kubectl create secret generic history-secrets –from-file=gcp-project-48a2f0a984bb.json
secret/history-secrets created
(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % kubectl get secrets

NAME TYPE DATA AGE
default-token-2v6p5 kubernetes.io/service-account-token 3 71d
history-secrets Opaque 1 5s
sh.helm.release.v1.spark-history-server-1624358382.v1 helm.sh/release.v1 1 11m
spark-history-server-1624358382-token-mlh5j kubernetes.io/service-account-token 3 11m
spark-sa Opaque 1 70d

(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % kubectl describe secrets/history-secrets
Name: history-secrets
Namespace: default
Labels: <none>
Annotations: <none>

Type: Opaque

Data
====
gcp-project-48a2f0a984bb.json: 2358 bytes

******************************** Step 3 ********************************

(base) saurabhkumar@Saurabhs-MacBook-Pro stats % helm repo add stable https://charts.helm.sh/stable
“stable” already exists with the same configuration, skipping

(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % helm list -n ifw-reloaded
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
spark-history-server-1616415984 ifw-reloaded 1 2021-03-22 17:56:34.463601 +0530 IST deployed spark-history-server-1.4.3 2.4.0

(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % helm install stable/spark-history-server –values shs-gcs.yaml –generate-name
WARNING: This chart is deprecated
NAME: spark-history-server-1624360585
LAST DEPLOYED: Tue Jun 22 16:46:32 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Get the application URL by running the following commands. Note that the UI would take a minute or two to show up after the pods and services are ready.
NOTE: It may take a few minutes for the LoadBalancer IP to be available.
You can watch the status by running ‘kubectl -n default get svc -w spark-history-server-1624360585′
export SERVICE_IP=$(kubectl get svc –namespace default spark-history-server-1624360585 -o jsonpath='{.status.loadBalancer.ingress[0].ip}’)
NOTE: If on OpenShift, run the following command instead:
export SERVICE_IP=$(oc get svc –namespace default spark-history-server-1624360585 -o jsonpath='{.status.loadBalancer.ingress[0].hostname}’)
echo http://$SERVICE_IP:map[name:http-historyport number:18080]

******************************** Step 4 ********************************
(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.1.0.1 <none> 443/TCP 71d
spark-history-server-1624360585 LoadBalancer 10.1.255.20 <pending> 18080:31739/TCP 17s

(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.1.0.1 <none> 443/TCP 71d
spark-history-server-1624360585 LoadBalancer 10.1.255.20 10.1.0.113 18080:31739/TCP 54s
******************************** Step 5 ********************************

This is to uninstall shs in one go.
(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % helm uninstall spark-history-server-1616415984 -n ifw-reloaded
Error: uninstallation completed with 2 error(s): clusterrolebindings.rbac.authorization.k8s.io “spark-history-server-1616415984-crb” is forbidden: User “system:serviceaccount:default:ifw-team” cannot delete resource “clusterrolebindings” in API group “rbac.authorization.k8s.io” at the cluster scope; clusterroles.rbac.authorization.k8s.io “spark-history-server-1616415984-cr” is forbidden: User “system:serviceaccount:default:ifw-team” cannot delete resource “clusterroles” in API group “rbac.authorization.k8s.io” at the cluster scope

 

Please feel free to give your valuable feedback.


  • 0

Attempt to add *.jar multiple times to the distributed cache

When we submit Spark2 action via oozie then we may see following exception in logs and job will fail:

exception: Attempt to add (hdfs://m1:8020/user/oozie/share/lib/lib_20171129113304/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache.

java.lang.IllegalArgumentException: Attempt to add (hdfs://m1:8020/user/oozie/share/lib/lib_20171129113304/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache.

The above error occurs because the same jar files exists in both(/user/oozie/share/lib/lib_20171129113304/oozie/ and  /user/oozie/share/lib/lib_20171129113304/spark2/) the locations.

Solution:

You need to deleted duplicate jars from Spark2 directory and will be left with only one copy in Oozie directory.

  1. Identify the oozie sharelib run the command:
    hdfs dfs -ls /user/oozie/share/lib/
  2. Use following command to list all jar files in directory Oozie:
    hdfs dfs -ls /user/oozie/share/lib/lib_<timestamp>/oozie | awk -F \/ ‘{print $8}’ > /tmp/list
  3. Use following command for deleting the jar files in Spark2 directory which matches with Oozie directory:
    for f in $(cat /tmp/list);do echo $f; hdfs dfs -rm -skipTrash /user/oozie/share/lib/lib_<timestamp>/spark2/$f;done
  4. Restart Oozie Service.

Thanks for visiting this blog, please feel free to give your valuable feedback.


  • 0

Spark job run successfully in client mode but failing in cluster mode

If you build a pyspark application which can run successfully  in both the local and yarn-client modes.  However, when you try to run in cluster mode, then you may receive following errors :

  1. Error 1:  Exception: (“You must build Spark with Hive. Export ‘SPARK_HIVE=true’ and run build/sbt assembly”, Py4JJavaError(u’An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n’, JavaObject id=o52))
  2. Error 2: INFO Client: Deleting staging directory .sparkStaging/application_1476997468030_139760
    Exception in thread “main” org.apache.spark.SparkException: Application application_1476997468030_139760 finished at org.apache.spark.deploy.yarn.Client.run(Client.scala:974)
  3. Error 3: ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
    java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient Caused by: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
  4. Error 4: INFO ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
    17/08/22 04:56:19 ERROR ApplicationMaster: Uncaught exception:
    org.apache.spark.SparkException: Exception thrown in awaitResult:
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:401)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:254)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:766)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:764)
    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
    Caused by: org.apache.spark.SparkUserAppException: User application exited with 1

Root Cause : If you are using HDP stack then you might be hitting a bug with HDP 2.3.2 with Ambari 2.2.1 :https://hortonworks.jira.com/browse/BUG-56393 where starting from Ambari 2.2.1 , it does not manage the spark version if HDP stack is < HDP 2.3.4.

If not then you are missing some drivers and hive parameters which you need to pass in command line during spark-submit in cluster mode.

Resolution : You can use following steps to solve this issue :

  • Check the hive-site.xml contents. Should be like as below for spark.
  • Add hive-site.xml to the driver-classpath so that spark can read hive configuration. Make sure —files must come before you .jar file.
  • Add the datanucleus jars using –jars option when you submit
  • Check the contents of hive-site.xml
    <configuration>
    <property>
    <name>hive.metastore.uris</name>
    <value>thrift://sandbox.hortonworks.com:9083</value>
    </property>
    </configuration>
  • The Seq. of command
    spark-submit \
    –class <Your.class.name> \
    –master yarn-cluster \
    –num-executors 1 \
    –driver-memory 1g \
    –executor-memory 1g \
    –executor-cores 1 \
    –files /usr/hdp/current/spark-client/conf/hive-site.xml \
    –jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \
    target/YOUR_JAR-1.0.0-SNAPSHOT.jar “show tables”

Or complete command can be :

spark-submit --master yarn --deploy-mode cluster --queue di --jars /usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar --conf "spark.yarn.appMasterEnv.PATH=/opt/rh/rh-python34/root/usr/bin${PATH:+:${PATH}}" --conf "spark.yarn.appMasterEnv.PATH=/opt/rh/rh-python34/root/usr/bin${PATH:+:${PATH}}" --conf "spark.yarn.appMasterEnv.LD_LIBRARY_PATH=/opt/rh/rh-python34/root/usr/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" --conf "spark.yarn.appMasterEnv.MANPATH=/opt/rh/rh-python34/root/usr/share/man:${MANPATH}" --conf "spark.yarn.appMasterEnv.XDG_DATA_DIRS=/opt/rh/rh-python34/root/usr/share${XDG_DATA_DIRS:+:${XDG_DATA_DIRS}}" --conf "spark.yarn.appMasterEnv.PKG_CONFIG_PATH=/opt/rh/rh-python34/root/usr/lib64/pkgconfig${PKG_CONFIG_PATH:+:${PKG_CONFIG_PATH}}" --conf "spark.executorEnv.PATH=/opt/rh/rh-python34/root/usr/bin${PATH:+:${PATH}}" --conf "spark.executorEnv.PATH=/opt/rh/rh-python34/root/usr/bin${PATH:+:${PATH}}" --conf "spark.executorEnv.LD_LIBRARY_PATH=/opt/rh/rh-python34/root/usr/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" --conf "spark.executorEnv.MANPATH=/opt/rh/rh-python34/root/usr/share/man:${MANPATH}" --conf "spark.executorEnv.XDG_DATA_DIRS=/opt/rh/rh-python34/root/usr/share${XDG_DATA_DIRS:+:${XDG_DATA_DIRS}}" --conf "spark.executorEnv.PKG_CONFIG_PATH=/opt/rh/rh-python34/root/usr/lib64/pkgconfig${PKG_CONFIG_PATH:+:${PKG_CONFIG_PATH}}" hive.py

where hive.py has following value :

[adebatch@server1 ~]$ cat hive.py 
from pyspark import SparkContext,SparkConf
from pyspark.sql import HiveContext
import json
import sys
conf = SparkConf()
sc = SparkContext(conf=conf)
hiveCtx = HiveContext(sc)
result = hiveCtx.sql('show databases')
#result = hiveCtx.sql('select * from default.table1 limit 1')
result.show()
result.write.save('/tmp/pyspark', format='text', mode='overwrite')

Please feel free to give your valuable feedback.


  • 2

Exception in thread “main” org.apache.spark.SparkException: Application

When you run python script on top of hive but it is failing with following error :

$ spark-submit –master yarn –deploy-mode cluster –queue ado –num-executors 60 –executor-memory 3G –executor-cores 5 –py-files argparse.py,load_iris_2.py –driver-memory 10G  load_iris.py -p ado_secure.iris_places -s ado_secure.iris_places_stg -f /user/admin/iris/places/2016-11-30-place.csv

Exception in thread “main” org.apache.spark.SparkException: Application application_1476997468030_142120 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:974)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1020)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

When I checked spark logs then I found following error.
16/12/22 07:35:49 WARN metadata.Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:193)
at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:162)
at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:415)
at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:414)
at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:296)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

Root Cause: 

It can be because of one bug (BUG-56393) in ambari and due to the format of spark job submit in cluster mode.

Resolutions: 

You can resolve it with the help of following steps:

  • Add spark.driver.extraJavaOptions =-Dhdp.version={{hdp_full_version}} -XX:MaxPermSize=1024m -XX:PermSize=256m and spark.yarn.am.extraJavaOptions=-Dhdp.version={{hdp_full_version}} as we were suspecting an old bug.
  • You were running your custom python script along without hive-site.xml due to that it was not able to connect to hive metastore. So we added –files /etc/spark/conf/hive-site.xml to make a connect to hive metastore.
  • Add the –jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar option and provided the path to datanucleus jars.
$ spark-submit –master yarn –deploy-mode cluster –conf “spark.driver.extraJavaOptions=-Dhdp.version=2.3.4.0-3485 -XX:MaxPermSize=1024m -XX:PermSize=256m” –conf “spark.yarn.am.extraJavaOptions=-Dhdp.version=2.3.4.0-3485” –queue ado –executor-memory 3G –executor-cores 5 –jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar –py-files argparse.py,load_iris_2.py –driver-memory 10G –files /etc/spark/conf/hive-site.xml load_iris.py -p ado_secure.iris_places -s ado_secure.iris_places_stg -f /user/admin/iris/places/2016-11-30-place.csv 
Please feel free to reach out to us in case of any further assistance.