How can I add a hello world dataset to a custom dataplatform DataHub #getting-started

How can I add a "hello world" dataset to a custom ...

colossal-balloon-71404

06/13/2023, 3:01 PM

How can I add a "hello world" dataset to a custom dataplatform? I've managed to add a custom dataplatform, but I can't figure out what to do next. I've tried the CLI approach with file ingestion, but that just fails extremely cryptically and I can't find a simple example anywhere. Spent a lot of time searching through Slack here, as well as github issues and code. Help?

colossal-balloon-71404

06/13/2023, 3:01 PM

I did this: https://datahubproject.io/docs/how/add-custom-data-platform/#using-the-cli

colossal-balloon-71404

06/13/2023, 3:03 PM

And then attempted to use some examples from here: https://github.com/datahub-project/datahub/tree/master/metadata-ingestion/examples/mce_files Tried this

custom_dataset.json

file:

Copy code

[
  {
    "auditHeader":null,
    "entityType":"dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:MyCustomDataPlatform,test,PROD)",
    "changeType":"UPSERT",
    "systemMetadata":null
  }
]

colossal-balloon-71404

06/13/2023, 3:04 PM

Using this file:

Copy code

source:
  type: file
  config:
    # Source-type specifics config
    filename: ./custom-dataset.json

Getting this result:

Copy code

Source (file) report:
{'events_produced': 0,
 'events_produced_per_sec': 0,
 'entities': {},
 'aspects': {},
 'warnings': {},
 'failures': {'path-0': ['com.linkedin.pegasus2avro.usage.UsageAggregation is missing required field: bucket']},
 'total_num_files': 1,
 'num_files_completed': 1,
 'files_completed': ['custom-dataset.json'],
 'percentage_completion': '0%',
 'estimated_time_to_completion_in_minutes': -1,
 'total_bytes_read_completed_files': 209,
 'current_file_size': 209,
 'total_parse_time_in_seconds': 0.0,
 'total_count_time_in_seconds': 0.0,
 'total_deserialize_time_in_seconds': 0,
 'aspect_counts': {},
 'entity_type_counts': {},
 'start_time': '2023-06-13 14:37:07.777117 (now)',
 'running_time': '0.77 seconds'}

Pipeline finished with at least 1 failures; produced 0 events in 0.77 seconds.

billions-baker-82097

06/13/2023, 3:04 PM

logoUrl is necessary...without putting logoUrl it will not come on UI. Just put some logoUrl, it will work fine.

colossal-balloon-71404

06/13/2023, 3:06 PM

I actually ran this:

Copy code

datahub put platform --name Zendesk --display_name "Zendesk" --logo "<https://assets.website-files.com/5a0242c3d47dd70001e5b2e9/5a054c7012148e00015864fc_zmark%401x.svg>"

It succeeded, but doesn't show up in UI anywhere. I thought it might be because it doesn't have any datasets associated with it

colossal-balloon-71404

06/13/2023, 3:07 PM

Wait, do you mean for the dataset or the platform?

billions-baker-82097

06/13/2023, 3:09 PM

I meant for dataplatform only, actually after this you need to insert some data then it will come up on the UI.

colossal-balloon-71404

06/13/2023, 3:10 PM

Yup. So that's where I am - unable to figure out how to add a dataset

mammoth-bear-12532

06/13/2023, 4:27 PM

https://datahubproject.io/docs/api/tutorials/datasets/#create-dataset

colossal-balloon-71404

06/14/2023, 6:58 AM

Thanks, will try that out

colossal-balloon-71404

06/14/2023, 12:13 PM

Ok, that works.

mammoth-bear-12532

06/14/2023, 2:16 PM

Nice! That dataset is looking slick

colossal-balloon-71404

06/15/2023, 6:19 AM

Just for background - we're going to use DataHub as a data asset management tool in addition to other functionality. Register all our (the whole company) tools there that contain any data, create some lineage, assign owners, etc. Step 2 would be that we tag/categorize/assign domains to these tools according to which type of customer data they contain and then it will also function as a GDPR helper.

colossal-balloon-71404

06/15/2023, 6:20 AM

With the python approach we'll be able to put all of this into a file (or multiple files) in a repository and have a CI/CD push it to DataHub via CLI every time it's updated

mammoth-bear-12532

06/15/2023, 8:48 PM

that sounds very doable

mammoth-bear-12532

06/15/2023, 8:48 PM

I guess having a yaml based spec for a dataset (similar to what we've done for Data Product and Glossary) would make it even easier for you?

colossal-balloon-71404

06/16/2023, 6:11 AM

Sure - less code, more content. As long as the custom platforms and datasets can both be created with the same approach. And lineage as well, but I believe that's already available. Then again, we'll probably stick with python anyways as we will be creating manual lineage between Mssql and Databricks (Unity Catalog) tables manually - we have a custom naming scheme.

Copy code

server.database.schema.table -> raw_server.database_schema.table

It's based on our own config files for our own data loader written in Python. The idea is to run the lineage part after every time a table is loaded. As long as we're doing it in Python in one instance it also makes sense to keep it in Python in other instances. Anyways, it's gone far enough off topic now. Thank you for your support 🙂

colossal-balloon-71404

07/10/2023, 1:38 PM

I have a follow-up question 🙂 How are containers created? I understand how to create a custom data platform and a custom dataset within that platform. But ingested sources have containers as well, like a database schema or some such. How can I do that with custom datasets?

Open in Slack

Previous Next