Author Archives: admin

  • 0

Do you think file format does matter in big Data technology?

Category : Bigdata

Yes, Thats matter a lot because of following main reasons:

By using correct file format as per your use case you can achieve following.

1. Less storage:
if we select a proper file format with good compatibile compression technique then it’s required less storage.

2. Faster processing of data:
based on our use case if we select correct file format( like row or column based file format) we can achieve high performance while processing the data.

3. Reduce disk I/O cost:
if processing is efficient with best compression method then I/O cost also be optimized.

Also there is multiple factor which we can think of while selecting file format for our use case.
• file is splittable or not
• schema evaluation support
• Predicate Pushdown / Filter Pushdown
• compression technique
• row based or column based
• support for serialization/deserialization
• support for metadata
• whether file format is supported by source and target system
• support for column types
• Ingestion, latency

 


  • 0

Install and configure Spark History Server (SHS) on Kubernetes K8s

We always struggle like how to install and configure SHS on Kubernetes with gas event log. So here is your solution.

Create a shs-gcs.yaml deployments file which will be used to deploy shs service. 

 

 

 

pvc:
enablePVC: false
existingClaimName: nfs-pvc
eventsDir: “/”
nfs:
enableExampleNFS: false
pvName: nfs-pv
pvcName: nfs-pvc
gcs:
enableGCS: true
secret: history-secrets
key: tc-sc-bi-bigdata-ifwk-new-dev-48a2f0a984bb.json
logDirectory: gs://tc-sc-bi-bigdata-ingestion-dev-spark-on-k8s/eventsLogs/

******************************** Step 1 ********************************

(base) saurabhkumar@Saurabhs-MacBook-Pro stats % gcloud container clusters get-credentials spark-on-gke
Fetching cluster endpoint and auth data.
kubeconfig entry generated for spark-on-gke.

(base) saurabhkumar@Saurabhs-MacBook-Pro stats % kubectl cluster-info
Kubernetes master is running at https://10.2.4.110
GLBCDefaultBackend is running at https://10.2.4.110/api/v1/namespaces/kube-system/services/default-http-backend:http/proxy
KubeDNS is running at https://10.2.4.110/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://10.2.4.110/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy

******************************** Step 2 ********************************

(base) saurabhkumar@Saurabhs-MacBook-Pro stats % kubectl get secrets
NAME TYPE DATA AGE
default-token-2v6p5 kubernetes.io/service-account-token 3 71d
spark-sa Opaque 1 70d
(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % kubectl create secret generic history-secrets –from-file=gcp-project-48a2f0a984bb.json
secret/history-secrets created
(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % kubectl get secrets

NAME TYPE DATA AGE
default-token-2v6p5 kubernetes.io/service-account-token 3 71d
history-secrets Opaque 1 5s
sh.helm.release.v1.spark-history-server-1624358382.v1 helm.sh/release.v1 1 11m
spark-history-server-1624358382-token-mlh5j kubernetes.io/service-account-token 3 11m
spark-sa Opaque 1 70d

(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % kubectl describe secrets/history-secrets
Name: history-secrets
Namespace: default
Labels: <none>
Annotations: <none>

Type: Opaque

Data
====
gcp-project-48a2f0a984bb.json: 2358 bytes

******************************** Step 3 ********************************

(base) saurabhkumar@Saurabhs-MacBook-Pro stats % helm repo add stable https://charts.helm.sh/stable
“stable” already exists with the same configuration, skipping

(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % helm list -n ifw-reloaded
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
spark-history-server-1616415984 ifw-reloaded 1 2021-03-22 17:56:34.463601 +0530 IST deployed spark-history-server-1.4.3 2.4.0

(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % helm install stable/spark-history-server –values shs-gcs.yaml –generate-name
WARNING: This chart is deprecated
NAME: spark-history-server-1624360585
LAST DEPLOYED: Tue Jun 22 16:46:32 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Get the application URL by running the following commands. Note that the UI would take a minute or two to show up after the pods and services are ready.
NOTE: It may take a few minutes for the LoadBalancer IP to be available.
You can watch the status by running ‘kubectl -n default get svc -w spark-history-server-1624360585′
export SERVICE_IP=$(kubectl get svc –namespace default spark-history-server-1624360585 -o jsonpath='{.status.loadBalancer.ingress[0].ip}’)
NOTE: If on OpenShift, run the following command instead:
export SERVICE_IP=$(oc get svc –namespace default spark-history-server-1624360585 -o jsonpath='{.status.loadBalancer.ingress[0].hostname}’)
echo http://$SERVICE_IP:map[name:http-historyport number:18080]

******************************** Step 4 ********************************
(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.1.0.1 <none> 443/TCP 71d
spark-history-server-1624360585 LoadBalancer 10.1.255.20 <pending> 18080:31739/TCP 17s

(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.1.0.1 <none> 443/TCP 71d
spark-history-server-1624360585 LoadBalancer 10.1.255.20 10.1.0.113 18080:31739/TCP 54s
******************************** Step 5 ********************************

This is to uninstall shs in one go.
(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % helm uninstall spark-history-server-1616415984 -n ifw-reloaded
Error: uninstallation completed with 2 error(s): clusterrolebindings.rbac.authorization.k8s.io “spark-history-server-1616415984-crb” is forbidden: User “system:serviceaccount:default:ifw-team” cannot delete resource “clusterrolebindings” in API group “rbac.authorization.k8s.io” at the cluster scope; clusterroles.rbac.authorization.k8s.io “spark-history-server-1616415984-cr” is forbidden: User “system:serviceaccount:default:ifw-team” cannot delete resource “clusterroles” in API group “rbac.authorization.k8s.io” at the cluster scope

 

Please feel free to give your valuable feedback.


  • 0

Install Airflow in your local Macbook

****************************** Step 1 *****************************

Create a new airflow directory anywhere in your laptop

(base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % cd ~/Documents

(base) saurabhkumar@Saurabhs-MacBook-Pro Documents % mkdir airflow-tutorial

(base) saurabhkumar@Saurabhs-MacBook-Pro Documents % cd airflow-tutorial

 

************************** Step 2 *******************************

Create a python virtual env

(base) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % conda create –name airflow-tutorial1 python=3.7

Collecting package metadata (current_repodata.json): done

 

************************* Step 3 **************************

Activate python virtual env

(base) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % conda activate airflow-tutorial1

(airflow-tutorial1) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % pwd

/Users/saurabhkumar/Documents/airflow-tutorial

 

*********************** Step 4 *****************************

export the absolute path

(airflow-tutorial1) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % export AIRFLOW_HOME=/Users/saurabhkumar/Documents/airflow-tutorial

 

************************* Step 5 *****************************

Now install airflow 1.10.10

 

(airflow-tutorial1) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % pip install ‘apache-airflow[gcp,statsd,sentry]’==1.10.10

Collecting apache-airflow[gcp,sentry,statsd]==1.10.10

Downloading apache_airflow-1.10.10-py2.py3-none-any.whl (4.7 MB)

|████████████████████████████████| 4.7 MB 554 kB/s

Collecting pandas<1.0.0,>=0.17.1

done..

install successfully

 

************************* Step 6 *********************************

Install SQLAlchemy==1.3.23 and Flask-SQLAlchemy==2.4.4

(airflow-tutorial1) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % pip install SQLAlchemy==1.3.23

Collecting SQLAlchemy==1.3.23

Downloading SQLAlchemy-1.3.23-cp37-cp37m-macosx_10_14_x86_64.whl (1.2 MB)

|████████████████████████████████| 1.2 MB 2.5 MB/s

Installing collected packages: SQLAlchemy

Attempting uninstall: SQLAlchemy

Found existing installation: SQLAlchemy 1.4.9

Uninstalling SQLAlchemy-1.4.9:

Successfully uninstalled SQLAlchemy-1.4.9

Successfully installed SQLAlchemy-1.3.23

(airflow-tutorial1) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % pip install Flask-SQLAlchemy==2.4.4

Collecting Flask-SQLAlchemy==2.4.4

Downloading Flask_SQLAlchemy-2.4.4-py2.py3-none-any.whl (17 kB)

Requirement already satisfied: Flask>=0.10 in /Users/saurabhkumar/opt/anaconda3/envs/airflow-tutorial1/lib/python3.7/site-packages (from Flask-SQLAlchemy==2.4.4) (1.1.2)

Requirement already satisfied: SQLAlchemy>=0.8.0 in /Users/saurabhkumar/opt/anaconda3/envs/airflow-tutorial1/lib/python3.7/site-packages (from Flask-SQLAlchemy==2.4.4) (1.3.23)

Requirement already satisfied: Werkzeug>=0.15 in /Users/saurabhkumar/opt/anaconda3/envs/airflow-tutorial1/lib/python3.7/site-packages (from Flask>=0.10->Flask-SQLAlchemy==2.4.4) (0.16.1)

Requirement already satisfied: Jinja2>=2.10.1 in /Users/saurabhkumar/opt/anaconda3/envs/airflow-tutorial1/lib/python3.7/site-packages (from Flask>=0.10->Flask-SQLAlchemy==2.4.4) (2.10.3)

Requirement already satisfied: click>=5.1 in /Users/saurabhkumar/opt/anaconda3/envs/airflow-tutorial1/lib/python3.7/site-packages (from Flask>=0.10->Flask-SQLAlchemy==2.4.4) (7.1.2)

Requirement already satisfied: itsdangerous>=0.24 in /Users/saurabhkumar/opt/anaconda3/envs/airflow-tutorial1/lib/python3.7/site-packages (from Flask>=0.10->Flask-SQLAlchemy==2.4.4) (1.1.0)

Requirement already satisfied: MarkupSafe>=0.23 in /Users/saurabhkumar/opt/anaconda3/envs/airflow-tutorial1/lib/python3.7/site-packages (from Jinja2>=2.10.1->Flask>=0.10->Flask-SQLAlchemy==2.4.4) (1.1.1)

Installing collected packages: Flask-SQLAlchemy

Attempting uninstall: Flask-SQLAlchemy

Found existing installation: Flask-SQLAlchemy 2.5.1

Uninstalling Flask-SQLAlchemy-2.5.1:

Successfully uninstalled Flask-SQLAlchemy-2.5.1

Successfully installed Flask-SQLAlchemy-2.4.4

 

************************ Step 7 *******************************

initialize the airflow database

(airflow-tutorial1) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % airflow initdb   

DB: sqlite:////Users/saurabhkumar/Documents/airflow-tutorial/airflow.db

[2021-04-18 21:15:50,507] {db.py:378} INFO – Creating tables

INFO  [alembic.runtime.migration] Context impl SQLiteImpl.

INFO  [alembic.runtime.migration] Will assume non-transactional DDL.

INFO  [alembic.runtime.migration] Running upgrade  -> e3a246e0dc1, current schema

INFO  [alembic.runtime.migration] Running upgrade e3a246e0dc1 -> 1507a7289a2f, create is_encrypted

/Users/saurabhkumar/opt/anaconda3/envs/airflow-tutorial1/lib/python3.7/site-packages/alembic/ddl/sqlite.py:44: UserWarning: Skipping unsupported ALTER for creation of implicit constraintPlease refer to the batch mode feature which allows for SQLite migrations using a copy-and-move strategy.

“Skipping unsupported ALTER for ”

INFO  [alembic.runtime.migration] Running upgrade 1507a7289a2f -> 13eb55f81627, maintain history for compatibility with earlier migrations

INFO  [alembic.runtime.migration] Running upgrade 13eb55f81627 -> 338e90f54d61, More logging into task_instance

INFO  [alembic.runtime.migration] Running upgrade 338e90f54d61 -> 52d714495f0, job_id indices

INFO  [alembic.runtime.migration] Running upgrade 52d714495f0 -> 502898887f84, Adding extra to Log

INFO  [alembic.runtime.migration] Running upgrade 502898887f84 -> 1b38cef5b76e, add dagrun

INFO  [alembic.runtime.migration] Running upgrade 1b38cef5b76e -> 2e541a1dcfed, task_duration

INFO  [alembic.runtime.migration] Running upgrade 2e541a1dcfed -> 40e67319e3a9, dagrun_config

INFO  [alembic.runtime.migration] Running upgrade 40e67319e3a9 -> 561833c1c74b, add password column to user

INFO  [alembic.runtime.migration] Running upgrade 561833c1c74b -> 4446e08588, dagrun start end

INFO  [alembic.runtime.migration] Running upgrade 4446e08588 -> bbc73705a13e, Add notification_sent column to sla_miss

INFO  [alembic.runtime.migration] Running upgrade bbc73705a13e -> bba5a7cfc896, Add a column to track the encryption state of the ‘Extra’ field in connection

INFO  [alembic.runtime.migration] Running upgrade bba5a7cfc896 -> 1968acfc09e3, add is_encrypted column to variable table

INFO  [alembic.runtime.migration] Running upgrade 1968acfc09e3 -> 2e82aab8ef20, rename user table

INFO  [alembic.runtime.migration] Running upgrade 2e82aab8ef20 -> 211e584da130, add TI state index

INFO  [alembic.runtime.migration] Running upgrade 211e584da130 -> 64de9cddf6c9, add task fails journal table

INFO  [alembic.runtime.migration] Running upgrade 64de9cddf6c9 -> f2ca10b85618, add dag_stats table

INFO  [alembic.runtime.migration] Running upgrade f2ca10b85618 -> 4addfa1236f1, Add fractional seconds to mysql tables

INFO  [alembic.runtime.migration] Running upgrade 4addfa1236f1 -> 8504051e801b, xcom dag task indices

INFO  [alembic.runtime.migration] Running upgrade 8504051e801b -> 5e7d17757c7a, add pid field to TaskInstance

INFO  [alembic.runtime.migration] Running upgrade 5e7d17757c7a -> 127d2bf2dfa7, Add dag_id/state index on dag_run table

INFO  [alembic.runtime.migration] Running upgrade 127d2bf2dfa7 -> cc1e65623dc7, add max tries column to task instance

INFO  [alembic.runtime.migration] Running upgrade cc1e65623dc7 -> bdaa763e6c56, Make xcom value column a large binary

INFO  [alembic.runtime.migration] Running upgrade bdaa763e6c56 -> 947454bf1dff, add ti job_id index

INFO  [alembic.runtime.migration] Running upgrade 947454bf1dff -> d2ae31099d61, Increase text size for MySQL (not relevant for other DBs’ text types)

INFO  [alembic.runtime.migration] Running upgrade d2ae31099d61 -> 0e2a74e0fc9f, Add time zone awareness

INFO  [alembic.runtime.migration] Running upgrade d2ae31099d61 -> 33ae817a1ff4, kubernetes_resource_checkpointing

INFO  [alembic.runtime.migration] Running upgrade 33ae817a1ff4 -> 27c6a30d7c24, kubernetes_resource_checkpointing

INFO  [alembic.runtime.migration] Running upgrade 27c6a30d7c24 -> 86770d1215c0, add kubernetes scheduler uniqueness

INFO  [alembic.runtime.migration] Running upgrade 86770d1215c0, 0e2a74e0fc9f -> 05f30312d566, merge heads

INFO  [alembic.runtime.migration] Running upgrade 05f30312d566 -> f23433877c24, fix mysql not null constraint

INFO  [alembic.runtime.migration] Running upgrade f23433877c24 -> 856955da8476, fix sqlite foreign key

INFO  [alembic.runtime.migration] Running upgrade 856955da8476 -> 9635ae0956e7, index-faskfail

INFO  [alembic.runtime.migration] Running upgrade 9635ae0956e7 -> dd25f486b8ea, add idx_log_dag

INFO  [alembic.runtime.migration] Running upgrade dd25f486b8ea -> bf00311e1990, add index to taskinstance

INFO  [alembic.runtime.migration] Running upgrade 9635ae0956e7 -> 0a2a5b66e19d, add task_reschedule table

INFO  [alembic.runtime.migration] Running upgrade 0a2a5b66e19d, bf00311e1990 -> 03bc53e68815, merge_heads_2

INFO  [alembic.runtime.migration] Running upgrade 03bc53e68815 -> 41f5f12752f8, add superuser field

INFO  [alembic.runtime.migration] Running upgrade 41f5f12752f8 -> c8ffec048a3b, add fields to dag

INFO  [alembic.runtime.migration] Running upgrade c8ffec048a3b -> dd4ecb8fbee3, Add schedule interval to dag

INFO  [alembic.runtime.migration] Running upgrade dd4ecb8fbee3 -> 939bb1e647c8, task reschedule fk on cascade delete

INFO  [alembic.runtime.migration] Running upgrade 939bb1e647c8 -> 6e96a59344a4, Make TaskInstance.pool not nullable

INFO  [alembic.runtime.migration] Running upgrade 6e96a59344a4 -> d38e04c12aa2, add serialized_dag table

Revision ID: d38e04c12aa2

Revises: 6e96a59344a4

Create Date: 2019-08-01 14:39:35.616417

INFO  [alembic.runtime.migration] Running upgrade d38e04c12aa2 -> b3b105409875, add root_dag_id to DAG

Revision ID: b3b105409875

Revises: d38e04c12aa2

Create Date: 2019-09-28 23:20:01.744775

INFO  [alembic.runtime.migration] Running upgrade 6e96a59344a4 -> 74effc47d867, change datetime to datetime2(6) on MSSQL tables

INFO  [alembic.runtime.migration] Running upgrade 939bb1e647c8 -> 004c1210f153, increase queue name size limit

INFO  [alembic.runtime.migration] Running upgrade c8ffec048a3b -> a56c9515abdc, Remove dag_stat table

INFO  [alembic.runtime.migration] Running upgrade a56c9515abdc, 004c1210f153, 74effc47d867, b3b105409875 -> 08364691d074, Merge the four heads back together

INFO  [alembic.runtime.migration] Running upgrade 08364691d074 -> fe461863935f, increase_length_for_connection_password

INFO  [alembic.runtime.migration] Running upgrade fe461863935f -> 7939bcff74ba, Add DagTags table

INFO  [alembic.runtime.migration] Running upgrade 7939bcff74ba -> a4c2fd67d16b, add pool_slots field to task_instance

INFO  [alembic.runtime.migration] Running upgrade a4c2fd67d16b -> 852ae6c715af, Add RenderedTaskInstanceFields table

INFO  [alembic.runtime.migration] Running upgrade 852ae6c715af -> 952da73b5eff, add dag_code table

Done.

 

********************* Step 8 ****************************

Start the webserver by typing airflow webserver

 

(airflow-tutorial1) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % airflow webserver

____________       _____________

____    |__( )_________  __/__  /________      __

____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /

___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /

_/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/

[2021-04-18 21:16:15,198] {__init__.py:51} INFO – Using executor SequentialExecutor

[2021-04-18 21:16:15,199] {dagbag.py:396} INFO – Filling up the DagBag from /Users/saurabhkumar/Documents/airflow-tutorial/dags

Running the Gunicorn Server with:

Workers: 4 sync

Host: 0.0.0.0:8080

Timeout: 120

Logfiles: – –

=================================================================

[2021-04-18 21:16:17 +0530] [58546] [INFO] Starting gunicorn 19.10.0

[2021-04-18 21:16:17 +0530] [58546] [INFO] Listening at: http://0.0.0.0:8080 (58546)

[2021-04-18 21:16:17 +0530] [58546] [INFO] Using worker: sync

[2021-04-18 21:16:17 +0530] [58552] [INFO] Booting worker with pid: 58552

[2021-04-18 21:16:17 +0530] [58553] [INFO] Booting worker with pid: 58553

[2021-04-18 21:16:17 +0530] [58554] [INFO] Booting worker with pid: 58554

[2021-04-18 21:16:17 +0530] [58555] [INFO] Booting worker with pid: 58555

[2021-04-18 21:16:17,371] {__init__.py:51} INFO – Using executor SequentialExecu

 

 

 

********************** Step 9 ***********************************

open another Tab and start airflow schedular

 

(base) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % export AIRFLOW_HOME=/Users/saurabhkumar/Documents/airflow-tutorial

(base) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % pwd

/Users/saurabhkumar/Documents/airflow-tutorial

(base) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % conda activate airflow-tutorial1

(airflow-tutorial1) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % airflow scheduler

 

*********************** create user and password in airflow *********************

airflow users create -r Admin -u saukumar -e saurabhmcakiet@gmail.com -f Saurabh -l Singh -p saukumar


  • 0

Google Container Registry (GCR) with Minikube or K8s

When you use Google Container Registry (GCR) and seeing the dreaded ImagePullBackoff status on your pods in minikube/K8s Then this article can help you to solve that error.

Error :

(base) saurabhkumar@Saurabhs-MacBook-Pro ~ % kubectl describe pod airflow-postgres-694899d6fd-lqp2c -n airflow

Events:
Type Reason Age From Message
—- —— —- —- ——-
Normal Scheduled 56s default-scheduler Successfully assigned airflow/airflow-postgres-694899d6fd-lqp2c to minikube
Warning Failed 29s (x2 over 48s) kubelet Failed to pull image “gcr.io/<gcp_project>/postgres-airflow”: rpc error: code = Unknown desc = Error response from daemon: unauthorized: You don’t have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
Warning Failed 29s (x2 over 48s) kubelet Error: ErrImagePull
Normal BackOff 13s (x2 over 48s) kubelet Back-off pulling image “gcr.io/<gcp_project>/postgres-airflow”
Warning Failed 13s (x2 over 48s) kubelet Error: ImagePullBackOff
Normal Pulling 1s (x3 over 55s) kubelet Pulling image “gcr.io/<gcp_project>/postgres-airflow”

 

Solution :

************************ this is to create secrate for image pull *****************************************
Stpe 1: Create the Kubernetes secret using the kubectl create secret docker-registry command. The docker-registry secret subcommand makes it easy to create secrets that work with ImagePullSecrets

kubectl -n airflow create secret docker-registry spark-pullimage –docker-server=https://gcr.io –docker-username=_json_key –docker-email=airflow-gcsfuse@<gcp_project>.iam.gserviceaccount.com –docker-password=”$(cat /Users/saurabhkumar/Downloads/<gcp_project>-edae8efc166c.json)”

Stpe 2:Now update the default service account for the namespace with ImagePullSecrets

kubectl –namespace=airflow patch serviceaccount default -p ‘{“imagePullSecrets”: [{“name”: “spark-pullimage”}]}

Stpe 3: Deploy  your pod (.yaml) file  

(base) saurabhkumar@Saurabhs-MacBook-Pro ~ % kubectl get pods -n airflow
No resources found in airflow namespace.
(base) saurabhkumar@Saurabhs-MacBook-Pro ~ % kubectl –namespace=airflow apply -f /Users/saurabhkumar/Downloads/postgres.yaml
service/airflow-postgres-svc created
deployment.apps/airflow-postgres created
(base) saurabhkumar@Saurabhs-MacBook-Pro ~ % kubectl get pods -n airflow
NAME READY STATUS RESTARTS AGE
airflow-postgres-6d85bf7599-s6bdv 0/1 ContainerCreating 0 3s

(base) saurabhkumar@Saurabhs-MacBook-Pro ~ % kubectl get pods -n airflow
NAME READY STATUS RESTARTS AGE
airflow-postgres-6d85bf7599-s6bdv 1/1 Running 0 72s

Stpe 4: Login to Pod and test it. 

 

(base) saurabhkumar@Saurabhs-MacBook-Pro ~ % kubectl -n airflow exec -ti airflow-postgres-6d85bf7599-s6bdv /bin/sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] — [COMMAND] instead.
# psql -U postgres
psql (13.2 (Debian 13.2-1.pgdg100+1))
Type “help” for help.

postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
———–+———-+———-+————+————+———————–
airflow | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =Tc/postgres +
| | | | | postgres=CTc/postgres+
| | | | | airflow=CTc/postgres
celery | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =Tc/postgres +
| | | | | postgres=CTc/postgres+
| | | | | celery=CTc/postgres
postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 |
template0 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
(5 rows)

postgres=# \c airflow
You are now connected to database “airflow” as user “postgres”.
airflow=# \dt
Did not find any relations.
airflow=# \q
# exit

***************************************************************************************************************************


  • 1

Insert overwrite query Failed with exception Unable to move source

If you have explicitly setup hive.exec.stagingdir to some location like /tmp/ or some other location then whenever you will run insert overwrite statment then you will get following error.

ERROR exec.Task (SessionState.java:printError(989)) – Failed with exception Unable to move source hdfs://clustername/apps/finance/nest/nest_audit_log_final/
.hive-staging_hive_2017-12-12_19-15-30_008_33149322272174981-1/-ext-10000 to
destination hdfs://clustername/apps/finance/nest/nest_audit_log_final

Example: 

INSERT OVERWRITE TABLE nest.nest_audit_log_final
SELECT
project_name
, application
, module_seq_num
, module_name
, script_seq_num
, script_name
, run_session_id
, load_ts
, max_posted_date
, currency
, processor
, load_date
FROM nest.nest_audit_log_final;

Then you will get following error:

INFO common.FileUtils (FileUtils.java:mkdir(519)) - 
Creating directory if it doesn't exist: 
hdfs://clustername/apps/finance/nest/nest_audit_log_final
2017-12-12 19:21:42,508 ERROR hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) 
- Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
2017-12-12 19:21:42,525 ERROR exec.Task (SessionState.java:printError(989)) 
- Failed with exception Unable to move source 
hdfs://clustername/apps/finance/nest/nest_audit_log_final/
.hive-staging_hive_2017-12-12_19-15-30_008_33149322272174981-1/-ext-10000 to 
destination hdfs://clustername/apps/finance/nest/nest_audit_log_final
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
hdfs://clustername/apps/finance/nest/nest_audit_log_final/
.hive-staging_hive_2017-12-12_19-15-30_008_33149322272174981-1/-ext-10000 
to destination hdfs://clustername/apps/finance/nest/nest_audit_log_final
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2900)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3140)
at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1727)
at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:353)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1745)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1491)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1289)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1156)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1146)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:217)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:169)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:380)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:315)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:413)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:429)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:718)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:685)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: java.io.IOException: rename for src path: 
hdfs://clustername/apps/finance/nest/nest_audit_log_final/
.hive-staging_hive_2017-12-12_19-15-30_008_33149322272174981-1/-ext-10000/000000_0 
to dest:hdfs://clustername/apps/finance/nest/nest_audit_log_final returned false
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2849)

Root Cause: It is because of a bug (HIVE-17063). 

Workaround:

set hive.exec.stagingdir=.hive-staging;

  • 1

last access time of a table is showing zero

If you many hundreds or thousands tables and you want to know when was the last time your hive table accessed then you can run following mysql query in mysql under hive database.

mysql> use hive;

mysql> select TBL_NAME,LAST_ACCESS_TIME from TBLS where DB_ID=<db_id>;
+—————————————————————————————————-+——————+
| TBL_NAME | LAST_ACCESS_TIME |
+—————————————————————————————————-+——————+
| df_nov_4 | 0 |
| google_feed_20151111 | 0 |
| null_recos | 0 |
| taxonomy_20160107 | 0 |

Note : but unfortunately due to bug https://issues.apache.org/jira/browse/HIVE-2526 you can not do that without making some configuration changes like follow.

1. From Ambari > Hive > Advanced > Custom hive-site, edit (if it exists) or add a new property:
hive.security.authorization.sqlstd.confwhitelist=hive\.exec\.pre\.hooks

2. From Ambari > Hive > Advanced > General, edit hive.exec.pre.hooks property and append the following to the end of the field value (comma separated):
org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec

3. Restart the affected Hive services after saving the config changes in Ambari.

4. After the restart, try running the query (select TBL_NAME,LAST_ACCESS_TIME from TBLS where DB_ID=<db_id>;) again and this time you should be able to see the last access time.

 

I hope it will help you to get your work done, feel free to give your valuable feedback.


  • 1

kill hive query where application id was not created

Tags :

Category : Hive

Sometime when you run hive queries then it does not launch application or get hung due to some resources or any other reason.

Now in this case you have to kill query to resubmit it. So, please use following steps to kill hive query itself.

 

  • hive> select * from table1;
    Query ID = mapr_201804547_2ad87f0f5627
    Total jobs = 1
    Launching Job 1 out of 1
  • Use “kill query” command, which is available from HDP 2.6.3:
    KILL QUERY <queryid1>

 

Please feel free to give your feedback.


  • 0

Purging history/old data in oozie database

After some period of time your oozie db will be big and it may start throwing space issue or might be some slowness during oozie UI load. There are some properties which will help you to purge your oozie data but sometime, the oozie purge service does not function as expected. It result to a huge oozie database size which leads to slowdown your oozie UI.

To reduce size of the tables, you can run the below query to delete some old historical records:

  • Backup the database (highly recommend).
    mysqldump -u root -p oozie > /tmp/oozie.sql 
  • Login to the oozie database.
    mysql -u root -p <password> 
  • Run the below queries to clean up the historical records old than specific date (please adjust the date accordingly):
DELETE FROM OOZIE.WF_ACTIONS where WF_ID IN (SELECT ID from OOZIE.WF_JOBS where end_time < timestamp('2015-09-01 00:00:00'));
DELETE FROM oozie.wf_jobs where end_time < timestamp('2015-09-01 00:00:00');

DELETE from oozie.coord_actions where JOB_ID in (select ID from oozie.coord_jobs where END_TIME < timestamp('2015-09-01 00:00:00'));

DELETE from oozie.coord_jobs where END_TIME < timestamp('2015-09-01 00:00:00'); 
  • If you are using mysql then run the following command to reduce the database size:
    mysqlcheck -u root -p<password> -o oozie

And now you need to happy as you have purge old data in oozie db.

You also can update oozie configuration to auto purge:

oozie.service.PurgeService.coord.older.than = 7
oozie.service.PurgeService.bundle.older.than = 7
oozie.service.PurgeService.purge.limit = 100
oozie.service.PurgeService.older.than = 7
oozie.service.PurgeService.purge.interval = 3600
oozie.service.PurgeService.purge.old.coord.action = true

I hope this article helped you, please feel free to give your valuable feedback.


  • 0

Attempt to add *.jar multiple times to the distributed cache

When we submit Spark2 action via oozie then we may see following exception in logs and job will fail:

exception: Attempt to add (hdfs://m1:8020/user/oozie/share/lib/lib_20171129113304/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache.

java.lang.IllegalArgumentException: Attempt to add (hdfs://m1:8020/user/oozie/share/lib/lib_20171129113304/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache.

The above error occurs because the same jar files exists in both(/user/oozie/share/lib/lib_20171129113304/oozie/ and  /user/oozie/share/lib/lib_20171129113304/spark2/) the locations.

Solution:

You need to deleted duplicate jars from Spark2 directory and will be left with only one copy in Oozie directory.

  1. Identify the oozie sharelib run the command:
    hdfs dfs -ls /user/oozie/share/lib/
  2. Use following command to list all jar files in directory Oozie:
    hdfs dfs -ls /user/oozie/share/lib/lib_<timestamp>/oozie | awk -F \/ ‘{print $8}’ > /tmp/list
  3. Use following command for deleting the jar files in Spark2 directory which matches with Oozie directory:
    for f in $(cat /tmp/list);do echo $f; hdfs dfs -rm -skipTrash /user/oozie/share/lib/lib_<timestamp>/spark2/$f;done
  4. Restart Oozie Service.

Thanks for visiting this blog, please feel free to give your valuable feedback.


  • 0

hive jdbc in zeppelin throwing permission error to anonymous user

When users run hive query in zeppelin via jdbc interperator then it is going to some anonymous user not an actual user.

INFO [2017-11-02 03:18:20,405] ({pool-2-thread-2} RemoteInterpreter.java[pushAngularObjectRegistryToRemote]:546) – Push local angular object registry from ZeppelinServer to remote interpreter group 2CNQZ1ES5:shared_process
WARN [2017-11-02 03:18:21,825] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2058) – Job 20171031-075630_2029577092 is finished, status: ERROR, exception: null, result: %text org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException Unable to fetch table ushi_gl. org.apache.hadoop.security.AccessControlException: Permission denied: user=anonymous, access=EXECUTE, inode=”/apps/hive/warehouse/adodb.db/ushi_gl”:hive:hdfs:drwxr-x— 
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkDefaultEnforcer(RangerHdfsAuthorizer.java:381)
at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:338)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1955)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:109)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4111)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1137)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:854)

 

RootCause: It is bug in zeppelin 0.7.0.2 and is going to fix in newer version of zeppelin.

Resolution:  Add your username and password in credential option in zeppelin.