A complete solution for open data platforms, enterprise data catalogs, data lakes and data management. Open source, mature, fully-featured and production ready.

DataHub

Hello, I hope everyone is having a great day today

I wanted to ask - do you guys have some code examples in github on how to use your SDK for ingesting a data?

Currently I am ingesting via the curl command and it looks like it can be achieved in much simpler way using the SDK. I have found some code in your repository for integration tests i.e:
<https://github.com/datahub-project/datahub/blob/85a55ffac7b4cfa4594bb93cc960656886bbc440/metadata-ingestion/tests/integration/kafka/test_kafka.py>

This example above uses the mce_helpers and I am looking for some example that is not a test framework, do you have that somewhere?

Thanks a lot for a reply in an advance! : )

Hi <@U03440M7TCZ> you are right. It is simpler to ingest using SDK. Please refer <https://datahubproject.io/docs/metadata-ingestion/as-a-library|Python Emitter> docs on Datahub docs site . It contains example of how to ingest dataset properties using python sdk.

You'll find more examples <https://github.com/datahub-project/datahub/tree/master/metadata-ingestion/examples/library|here> in datahub github repo.

Hi <@U02G4B6ADL6> thanks!

from the datahub perspective is it better to ingest via curl or via the sdk?

Do you have somewhere an example on how to use the emitter with s3 as a source?

&gt; from the datahub perspective is it better to ingest via curl or via the sdk?
There isn't much difference from datahub point of view when using rest vs using Python rest emitter SDK. Its mainly about sender's convenience and familiarity.

&gt; Do you have somewhere an example on how to use the emitter with s3 as a source?
I am not very sure I understand the question, however, let me attempt to mention some information that may help. First of all, please have a look at <https://datahubproject.io/docs/generated/ingestion/sources/s3|S3 data lake source>, if that is what you are looking for.  DataHub already has integrations with some sources such as S3 and their metadata can be ingested using metadata ingestion recipe as described <https://datahubproject.io/docs/metadata-ingestion|here>. However, if there are any customizations needed, you can write your own <https://datahubproject.io/docs/metadata-ingestion/transformers|transformer> to supplement existing ingestion source.

Thanks <@U02G4B6ADL6>, I was actually looking for some example like kafka on your webpage but for s3 as a source instead of kafka:
<https://datahubproject.io/docs/metadata-ingestion/as-a-library#kafka-emitter>

One more question regarding urns and file ingestion.

We are trying now to ingest some csv from filesystem (not any db, s3 or anything database related, just plain PC), so we are parsing the CSV columns to make a json that can be ingested to the datahub using the file as a source:
<https://datahubproject.io/docs/generated/ingestion/sources/file>

In our case example we need to define the urn type, in many examples the urns are like
urn:li:dataPlatform:s3, urn:li:dataPlatform:hsdfs, urn:li:dataPlatform:mongoDB. Is it safe from datahub point of view to define my own dataPlatorm for example urn:li:dataPlatform:MySuperGibberishPlatform ?

Additionally, is there some easier way to ingest the CSV from filesystem instead of just manually parsing and constructing json?