:question: Question(s) of the Week: What are your ...
# random
l
Question(s) of the Week: What are your monthly infrastructure costs to run DataHub? How many Engineering Hours do you spend per month to maintain DataHub? @calm-autumn-56629 & I are excited to hear what factors come into play here; this is also a SUPER common question from new Community Members, so the more context you can provide, the better! Please leave your comments in the 🧵 and we’ll pick a random DataHub Swag winner next Monday, July 18th 🦠
plus1 2
👀 10
l
My team might be a bit of a special snowflake since we deployed datahub onto our own private cloud but i’ll try to chime in here. Monthly Cost: • $0. (private cloud ftw) DataHub Stack: • Postgresql (3x VMs with patroni to help with high availability) • Consul (3x VMs) for service discovery • Nomad (3x nomadmasters and 3x nomadclients) for container orchestration (Datahub frontend & gms containers are deployed here. As well as our schema registry for kafka & haproxy for load balancing.) • Kafka (3x brokers and 3x zookeeper VMs) • Elasticsearch (3x VMs) for graph and search • Self hosted GitHub Runners (deployed on nomad) for metadata ingestion Monthly Engineering Hours: • <10 to maintain + some extras w/ regards to onboarding sourcing and setting up pipelines (variable) We found the most expensive thing to be the initial standup (took about a month to build something thats highly available and durable) but we already had a lot of maturity with running all the other moving parts (kafka, elastic etc.) which helped shave off some time. Hope this helps! Feel free to reach out to me if you want to hear a bit more
👍 1
teamwork 1
g
What I will write below is just a cost estimate. It is worth mentioning that we are in the process of implementing the DataHub internally. Let's get to the numbers: • Only 1 engineer is responsible for the infrastructure (me) • Approximately 20 active users every month (beta testers) • Time spent to maintain the DataHub is 0h (I take about 10 minutes everyday monitoring via Grafana if everything is ok. Extra: We use Cloud Monitoring to send alerts based on log metrics and Grafana to send alerts based on resource usage) • We had about 2 months of POC using quickstart and now we are 1 month into a definitive version. About the infra: • It's all deployed on Google Cloud. • We use MySQL on Cloud SQL • We developed our own Helm chart and deployed it on GKE. About infrastructure costs: To do the analysis, I used the Cloud SQL price table, the number of replicas of the Deployments, and the resources declared in values.yaml. We currently use approximately one e2-standard-4 instance in GKE. With this, added to the value of the Cloud SQL instance, we have approximately monthly spend US$ 149.61. About time cost: We spend more time learning about the tool and how to use and improve the ingestion process than actually doing maintenance or other tasks. Extras: • We are only working with Elastcsearch and no standalone consumers. In this sense, in the production environment, we have three replicas of GMS and Elasticsearch and two replicas of Frontend and Kafka + Zookeeper. • We do not use environment isolation in the ingestion process. We have two DataHub deployments in different namespaces in Kubernetes. This can lead to increased infrastructure costs, but we chose this to keep our metadata more secure (avoid test failures and metadata loss). References: https://cloud.google.com/compute/vm-instance-pricing https://cloud.google.com/compute/docs/general-purpose-machines https://cloud.google.com/sql/pricing
👍 1
a
Thanks Patrick, really great to have this info. Any chance you could open source your deployment script(s)?
g
Of course, @astonishing-lizard-90580! We are planning to open our Helm Chart to the community. But, as I said, we are testing the chart in our environment. Soon as possible we will open to the DataHub community as another option for DataHub deployment on K8S.
plus1 1
i
I’m curious @gentle-camera-33498 what are you doing differently from the standard helm chart? Anything we should incorporate? I know the helm repo doesn't get as much love but I'm looking to change that!
g
I didn't make as many changes compared to the original Chart. I just refactored the repository structure (the organization in sub-templates was confusing and difficult to do automated tests the way I did in Apache Airflow charts) and added a few things like: - Possibility to change the Home ​​logo with values parameter. - Optionally, all services (frontend, gms, ..) use a global version tag for images. So I just need to change in one place to update all services to a new version. - I documented all the configurations that I could based on the documentation found on GitHub (including the Containers environment variables).
Also a few things like: - GMS configuration environment variables that did not exist were added and configured via values file. - An improvement in values ​​to make configuring OIDC access easier. - Allow the creation of a restore indices job at the time of upgrade (post-upgrade hook in case new versions are released) - All variables in the values file now are following the Helm best practices guide.
i
I think some of those are actually open PRs in datahub-helm, if you see anything that makes sense to put in the helm-charts let me know!
thank you 1
teamwork 1
l
Hey folks! I’m going to hold off on swag randomizer this time around since @lemon-hydrogen-83671 & @gentle-camera-33498 have been our recent winners for Question of the Week. Nonetheless, THANK YOUUU for your amazing feedback!!
(really, this just means we need more swag options… i’ll see what i can do teamwork )
l
I nominate @better-orange-49102 if they haven't gotten a swag bundle yet. Very active community member :)
b
I got mine last year, it's fine😅
🎉 1
l
I feel like @better-orange-49102 was the first person we sent swag to. DataHub Community Member of the centuryyyy
teamwork 2