Hi! I'm looking to modernize data ingestion, trans...
# getting-started
s
Hi! I'm looking to modernize data ingestion, transformation, and metadata management. I'm curious and looking to connect quickly anyone who evaluated or adopted the stack of NiFi (Processing), Airflow (Orchestration), and DataHub (Governance)?
g
We use Airflow + Datahub + custom built compute platform on K8s. Happy to share learnings if it's useful where applicable. NiFi does something totally different, so can't speak to that project as much. It's a bit stale. but here's a quick overview from a year+ ago about our data platform: https://includedhealth.com/blog/tech/our-journey-to-a-democratized-data-platform/ We've evolved quite a bit from then, so an update is much overdue to that blog post. i'd recommend looking into projects like DBT or Meltano that are great open source platforms for ingestion and transformation. And perhaps a paid platform like Ascend for your compute + orchestration?
s
Thank you @gorgeous-dinner-4055! Question about the ingestion and projection tasks. I see you are using Glue for table registry. Was Glue also considered for ingestion and projection tasks?
g
We didn't consider glue as a way to process data in our architecture, because we already had K8s for our service orchestration, so Spark on K8s was the natural next step for our data computation. Glue for us simply acts as the registry for connecting to tools like Athena. Spark + K8s has enabled us in writing scala & pyspark jobs & to update our spark version independent of our compute store & to have the compute instances vary in types. We use Karpenter, and can launch pretty massive jobs(hundreds of cores per job & TB+ of memory) & vary instance types(I.e. graviton + mx instance types & network optimized vs memory optimized) for cost benefits. I think Glue has limitations around instance types & sizes IIRC.
Spark + K8s is non trivial to implement, so that may not be the right solution for your team
s
Yeah we write most of our jobs in java today which has its drawbacks, when really many of the tasks can be done with your garden variety visual ETL tools. Because we're not really doing complex transformations or aggregations.
90% is field mapping, with the remaining being translationary.
g
Interesting, if it's simple jobs, the above(spark + k8s + airflow + datahub) definitely feels like overkill. Could always create a simple DSL for your users, and in the backend swap out the different tools in the future. Plus if you ever needed to, you can autogen any code to migrate to new platforms as your org scales. A managed service like ascend(or equivalent) might be the simplest way to make progress, unless there's appetite to do a lot more with the platform
Airflow is a bit finicky from our experience, there's always more things than you want to manage with that project