Hello Guys, I want to share my recent experiences ...
# getting-started
g
Hello Guys, I want to share my recent experiences with DataHub deployment with the intention that I could help someone. To be more precise, I want to show some characteristics of our use case. Data Stack: - Data Warehouse on Google BigQuery. - Metabase as a BI platform. - Airflow is our principal ETL tool. - Everything is deployed on GKE. - We use Helm to deploy DataHub on our cluster. The volume: - More than 1k database tables. - About 300 views. - About 500 Dashboards ( including personal dashboards). - About 7k charts. The problem: - Using the default implantation with the DataHub chart, a lot of errors occurred. - We could not configure well the standalone consumers. - The frontend experience was awful because of the volume of errors (500 status code error messages). With that, what we decided to do: - Implement our Helm chart with a different structure and configuration options (and with CI/CD workflows). - Do not use standalone consumers, configure ingestion via Airflow and KubernetesPodOperator and disable frontend ingestion. - Change the Elasticsearch default parameters (2 Gb wasn't enough for our volume): JVM heap size is now 4 Gb, and we use 3 replicas with 1 master node. - Incresed all timeout variables, elasticsearch threads, reduce GMS max connections to 8 and play memory buffer size to 50 Mb (The post requests is quite large). With the actions above, we deployed a stable platform. Now we are discovering the next steps and expecting to help the community. Feel free to ask me anything. I will try to help!
thanks bear 1
🆒 8
💯 2
b
cc @orange-night-91387 @incalculable-ocean-74010 anything actionable we can take on the helm side from Patrick?