In 2 threads here <https://datahubspace.slack.com/...
# ingestion
m
In 2 threads here https://datahubspace.slack.com/archives/CUMUWQU66/p1629897623476100 and https://datahubspace.slack.com/archives/CUMUWQU66/p1620136722311400 I understand that datahub supports openAPI based metadata ingestion. Could you please guide as to how this is supported and how can the API metadata be ingested. Also is OData supported?
s
Hi Thomas, I'm the one who developed it. Actually there is a non-merged branch where this feature is available. What is missing is few formalities, but the feature is fully working. You can give it a try, I'm happy to help if you need. The branch is at https://github.com/CIPEO/datahub/tree/adding_openapi_ingestion Instructions at https://github.com/CIPEO/datahub/blob/adding_openapi_ingestion/metadata-ingestion/source_docs/openapi.md
b
@microscopic-musician-99632 Currently we have no integration for OData. What we expose is a public GraphQL API which I will actually be releasing better documentation for very soon 😛
m
Thank you @stale-jewelry-2440. I am currently working on POC with the quickstart docker images. I am assuming to test it I would need to git clone your branch and build the images ? Could you kindly give me the steps. Also one question I had about the implementation if swagger.json is not provided you are depending on the forced-example I believe , in which case aren't you highly dependent on the implementation logic and your metadata (as a single source of truth) might not be correct ?
s
sure, in order: • after your docker quickstart is up, you just have to clone the branch
adding_openapi_ingestion
, and the new module will be installed with
python3 -m pip install --upgrade datahub/metadata_ingestion
; now you can ingest the yml file with the informations on your OpenApi endpoint • not really: the forced-examples is optional, and it's used when the tool does not find suitable informations on the swagger file. If the swagger file contains the metadata, then you don't need that option. I'm updating the openapi.md file to make it clearer
b
Guys - curious to understand: Why is Open API especially important to you guys? I'm trying to understand if this is something DataHub should support natively at some point
s
here we have two usecases: • there are teams which need access to data, and the urbanist decided to do it via api endpoints, instead of giving access to the whole data wharehouse • there are external, third party services which do the same. As example, JAMF, a tool we use to keep control of the deployed machines. Here too, you don't have access to the DB, and you can get data only via their API
b
So we are planning to evolve our GraphQL API into that general purpose interface for programmatic interaction with the Metadata Graph. Assuming the capabilities exposed supported your needs, would this work for you? Read more here: https://datahubproject.io/docs/api/graphql/overview
s
Not sure I understood you point. I (we?) need to perform the standard operation of metadata ingestion, so the easiest way is to use the machinery already in place. Or maybe I missed something?
b
Today thats the case. 3 months from now, you will likely be able to create metadata through the GraphQL API as well. Are you intending to ingest large batches of metadata from one of the preexisting sources?
Or do you want to ingest custom metadata on a one-off basis?
s
I don't expect those service to change metadata so often. Let's say once a month or so. And each service metadata is relatively small, between 20 and 100 datasets