Hello. I would like to ask for your opinions. In a...
# ingestion
r
Hello. I would like to ask for your opinions. In a scenario where I have ingested table from Glue for the first time, I want to ingest new Glue tables into Datahub in real-time after the initial ingestion. I was thinking about using an airflow DAG to re-ingest all glue tables once a day, but I would prefer a way to ingest only the new table immediately after they are created in Glue. Currently, I am considering writing a recipe that uses a aws lambda function to retrieve the new tables as a variable and then call "datahub ingest -c". What do you think is the best approach? Or I would like to know the thoughts of someone who has had this experience. The main point is how to add new tables after the initial ingestion in the best way possible. thanks in advanced!
1
📖 1
🔍 1
l
Hey there 👋 I'm The DataHub Community Support bot. I'm here to help make sure the community can best support you with your request. Let's double check a few things first: ✅ There's a lot of good information on our docs site: www.datahubproject.io/docs, Have you searched there for a solution? ✅ button ✅ It's not uncommon that someone has run into your exact problem before in the community. Have you searched Slack for similar issues? ✅ button Did you find a solution to your issue? ❌ Sorry you weren't able to find a solution. I'm sending you some tips on info you can provide to help the community troubleshoot. Whenever you feel your issue is solved, please react ✅ to your original message to let us know!
a
Hmm, you could use the GraphQL api to start a new ingestion from lambda as well, is it not working well with the CLI in this fashion?
How do you plan to get the notification from Glue that a new table has arrived?
Does glue publish some sort of EventBridge or SNS notification out of the box?
CC: @delightful-ram-75848
We’d be really excited to accept a contribution here if you figure this out!
r
@astonishing-answer-96712 I am considering use eventbridge to get noti from glue. and I havent use lambda yet cuz another task 😂 . I have another question, too. Can I use Datahub CLI (instead of GraphQL) from lambda? I prefer to use datahub Cli than GraphQL. I deploy datahub on eks.
a
Of course - pro tip, you can execute datahub cli commands via python, using
datahub.cli
. example docs : https://datahubproject.io/docs/api/tutorials/datasets#delete-dataset