What I will write below is just a cost estimate. It is worth mentioning that we are in the process of implementing the DataHub internally.
Let's get to the numbers:
• Only 1 engineer is responsible for the infrastructure (me)
• Approximately 20 active users every month (beta testers)
• Time spent to maintain the DataHub is 0h (I take about 10 minutes everyday monitoring via Grafana if everything is ok. Extra: We use Cloud Monitoring to send alerts based on log metrics and Grafana to send alerts based on resource usage)
• We had about 2 months of POC using quickstart and now we are 1 month into a definitive version.
About the infra:
• It's all deployed on Google Cloud.
• We use MySQL on Cloud SQL
• We developed our own Helm chart and deployed it on GKE.
About infrastructure costs:
To do the analysis, I used the Cloud SQL price table, the number of replicas of the Deployments, and the resources
declared in values.yaml.
We currently use approximately one e2-standard-4 instance in GKE. With this, added to the value of the Cloud SQL instance, we have approximately
monthly spend
US$ 149.61.
About time cost:
We spend more time learning about the tool and how to use and improve the ingestion process than actually doing maintenance or other tasks.
Extras:
• We are only working with Elastcsearch and no standalone consumers. In this sense, in the production environment, we have three replicas of GMS and Elasticsearch and two replicas of Frontend and Kafka + Zookeeper.
• We do not use environment isolation in the ingestion process. We have two DataHub deployments in different namespaces in Kubernetes. This can lead to increased infrastructure costs, but we chose this to keep our metadata more secure (avoid test failures and metadata loss).
References:
https://cloud.google.com/compute/vm-instance-pricing
https://cloud.google.com/compute/docs/general-purpose-machines
https://cloud.google.com/sql/pricing