Hello Guys,
I want to share my recent experiences with DataHub deployment with the intention that I could help someone.
To be more precise, I want to show some characteristics of our use case.
Data Stack:
- Data Warehouse on Google BigQuery.
- Metabase as a BI platform.
- Airflow is our principal ETL tool.
- Everything is deployed on GKE.
- We use Helm to deploy DataHub on our cluster.
The volume:
- More than 1k database tables.
- About 300 views.
- About 500 Dashboards ( including personal dashboards).
- About 7k charts.
The problem:
- Using the default implantation with the DataHub chart, a lot of errors occurred.
- We could not configure well the standalone consumers.
- The frontend experience was awful because of the volume of errors (500 status code error messages).
With that, what we decided to do:
- Implement our Helm chart with a different structure and configuration options (and with CI/CD workflows).
- Do not use standalone consumers, configure ingestion via Airflow and KubernetesPodOperator and disable frontend ingestion.
- Change the Elasticsearch default parameters (2 Gb wasn't enough for our volume): JVM heap size is now 4 Gb, and we use 3 replicas with 1 master node.
- Incresed all timeout variables, elasticsearch threads, reduce GMS max connections to 8 and play memory buffer size to 50 Mb (The post requests is quite large).
With the actions above, we deployed a stable platform. Now we are discovering the next steps and expecting to help the community.
Feel free to ask me anything. I will try to help!
thanks bear 1
🆒 8
💯 2