Hi all :wave:, one of my customers has just start...
# ingestion
m
Hi all 👋, one of my customers has just started to import datasets into their brand new lakehouse. They'd like to prioritise their ingestion roadmap by request/popularity. The current plan is to connect data hub up to swagger hub and ingest "tables" from the API GET methods, rather than ingest the operational database schemas. Long-term we plan to kafka/stream in operational data, rather than clone the DBs each day, depending on the designed API/topic schemas rather than the ORM controlled ones, so this feels like the best place to start. The aim is to give the data community visibility of available data, without having to ingest everything first. Has anyone else done similar? I spotted a few swagger comments, but it wasn't clear that it was quite the same thing.
m
This is a great idea @mammoth-sugar-1353. Happy to brainstorm on this
m
How do you want to collate ideas?
Right now I'm not sure we need to go as far as adding an API entity, as we will be treating them as feature equivalent to a dataset.
Regards push vs pull.. either would be possible. Some of their APIs are connected to the hub (autogenerated docs via CICD), others are manually updated (😱). I guess it could make sense to have a webhook attached to the swagger hub publish event to push the MCE, but then we'd have to provide an endpoint/app. Obviously swagger has an API to read the API docs, which has entity descriptions.. so a pull version may be simpler.
m
Right.. you could look at the open PR here (https://github.com/linkedin/datahub/pull/2706) and provide a modified PR that does what you are thinking of.
This would be a pull based ingestion to begin with