Hi I m looking to modernize data ingestion transformation an DataHub #getting-started

Hi! I'm looking to modernize data ingestion, trans...

salmon-afternoon-92550

10/11/2022, 12:04 AM

Hi! I'm looking to modernize data ingestion, transformation, and metadata management. I'm curious and looking to connect quickly anyone who evaluated or adopted the stack of NiFi (Processing), Airflow (Orchestration), and DataHub (Governance)?

gorgeous-dinner-4055

10/11/2022, 12:17 AM

We use Airflow + Datahub + custom built compute platform on K8s. Happy to share learnings if it's useful where applicable. NiFi does something totally different, so can't speak to that project as much. It's a bit stale. but here's a quick overview from a year+ ago about our data platform: https://includedhealth.com/blog/tech/our-journey-to-a-democratized-data-platform/ We've evolved quite a bit from then, so an update is much overdue to that blog post. i'd recommend looking into projects like DBT or Meltano that are great open source platforms for ingestion and transformation. And perhaps a paid platform like Ascend for your compute + orchestration?

salmon-afternoon-92550

10/11/2022, 1:12 AM

Thank you @gorgeous-dinner-4055! Question about the ingestion and projection tasks. I see you are using Glue for table registry. Was Glue also considered for ingestion and projection tasks?

gorgeous-dinner-4055

10/11/2022, 1:43 AM

We didn't consider glue as a way to process data in our architecture, because we already had K8s for our service orchestration, so Spark on K8s was the natural next step for our data computation. Glue for us simply acts as the registry for connecting to tools like Athena. Spark + K8s has enabled us in writing scala & pyspark jobs & to update our spark version independent of our compute store & to have the compute instances vary in types. We use Karpenter, and can launch pretty massive jobs(hundreds of cores per job & TB+ of memory) & vary instance types(I.e. graviton + mx instance types & network optimized vs memory optimized) for cost benefits. I think Glue has limitations around instance types & sizes IIRC.

gorgeous-dinner-4055

10/11/2022, 1:44 AM

Spark + K8s is non trivial to implement, so that may not be the right solution for your team

salmon-afternoon-92550

10/11/2022, 1:47 AM

Yeah we write most of our jobs in java today which has its drawbacks, when really many of the tasks can be done with your garden variety visual ETL tools. Because we're not really doing complex transformations or aggregations.

salmon-afternoon-92550

10/11/2022, 1:48 AM

90% is field mapping, with the remaining being translationary.

gorgeous-dinner-4055

10/11/2022, 2:07 AM

Interesting, if it's simple jobs, the above(spark + k8s + airflow + datahub) definitely feels like overkill. Could always create a simple DSL for your users, and in the backend swap out the different tools in the future. Plus if you ever needed to, you can autogen any code to migrate to new platforms as your org scales. A managed service like ascend(or equivalent) might be the simplest way to make progress, unless there's appetite to do a lot more with the platform

gorgeous-dinner-4055

10/11/2022, 2:08 AM

Airflow is a bit finicky from our experience, there's always more things than you want to manage with that project

Open in Slack

Previous Next