How can I add a "hello world" dataset to a custom ...
# getting-started
c
How can I add a "hello world" dataset to a custom dataplatform? I've managed to add a custom dataplatform, but I can't figure out what to do next. I've tried the CLI approach with file ingestion, but that just fails extremely cryptically and I can't find a simple example anywhere. Spent a lot of time searching through Slack here, as well as github issues and code. Help?
And then attempted to use some examples from here: https://github.com/datahub-project/datahub/tree/master/metadata-ingestion/examples/mce_files Tried this
custom_dataset.json
file:
Copy code
[
  {
    "auditHeader":null,
    "entityType":"dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:MyCustomDataPlatform,test,PROD)",
    "changeType":"UPSERT",
    "systemMetadata":null
  }
]
Using this file:
Copy code
source:
  type: file
  config:
    # Source-type specifics config
    filename: ./custom-dataset.json
Getting this result:
Copy code
Source (file) report:
{'events_produced': 0,
 'events_produced_per_sec': 0,
 'entities': {},
 'aspects': {},
 'warnings': {},
 'failures': {'path-0': ['com.linkedin.pegasus2avro.usage.UsageAggregation is missing required field: bucket']},
 'total_num_files': 1,
 'num_files_completed': 1,
 'files_completed': ['custom-dataset.json'],
 'percentage_completion': '0%',
 'estimated_time_to_completion_in_minutes': -1,
 'total_bytes_read_completed_files': 209,
 'current_file_size': 209,
 'total_parse_time_in_seconds': 0.0,
 'total_count_time_in_seconds': 0.0,
 'total_deserialize_time_in_seconds': 0,
 'aspect_counts': {},
 'entity_type_counts': {},
 'start_time': '2023-06-13 14:37:07.777117 (now)',
 'running_time': '0.77 seconds'}
Pipeline finished with at least 1 failures; produced 0 events in 0.77 seconds.
b
logoUrl is necessary...without putting logoUrl it will not come on UI. Just put some logoUrl, it will work fine.
c
I actually ran this:
Copy code
datahub put platform --name Zendesk --display_name "Zendesk" --logo "<https://assets.website-files.com/5a0242c3d47dd70001e5b2e9/5a054c7012148e00015864fc_zmark%401x.svg>"
It succeeded, but doesn't show up in UI anywhere. I thought it might be because it doesn't have any datasets associated with it
Wait, do you mean for the dataset or the platform?
b
I meant for dataplatform only, actually after this you need to insert some data then it will come up on the UI.
c
Yup. So that's where I am - unable to figure out how to add a dataset
c
Thanks, will try that out
Ok, that works.
m
Nice! That dataset is looking slick
c
Just for background - we're going to use DataHub as a data asset management tool in addition to other functionality. Register all our (the whole company) tools there that contain any data, create some lineage, assign owners, etc. Step 2 would be that we tag/categorize/assign domains to these tools according to which type of customer data they contain and then it will also function as a GDPR helper.
With the python approach we'll be able to put all of this into a file (or multiple files) in a repository and have a CI/CD push it to DataHub via CLI every time it's updated
m
that sounds very doable
I guess having a yaml based spec for a dataset (similar to what we've done for Data Product and Glossary) would make it even easier for you?
c
Sure - less code, more content. As long as the custom platforms and datasets can both be created with the same approach. And lineage as well, but I believe that's already available. Then again, we'll probably stick with python anyways as we will be creating manual lineage between Mssql and Databricks (Unity Catalog) tables manually - we have a custom naming scheme.
Copy code
server.database.schema.table -> raw_server.database_schema.table
It's based on our own config files for our own data loader written in Python. The idea is to run the lineage part after every time a table is loaded. As long as we're doing it in Python in one instance it also makes sense to keep it in Python in other instances. Anyways, it's gone far enough off topic now. Thank you for your support 🙂
I have a follow-up question 🙂 How are containers created? I understand how to create a custom data platform and a custom dataset within that platform. But ingested sources have containers as well, like a database schema or some such. How can I do that with custom datasets?