Hi, I was interested in using datahub without conn...
# getting-started
h
Hi, I was interested in using datahub without connecting directly to the raw data, but instead providing the files containing the metadata itself - for instance, a JSON file containing all the tables in a database, as well as the schema for each table. Is there documentation on the recommended way of doing this? I was thinking of using JSONSchema as the datasource, but wasn’t sure if that was the best/recommended way.
This is what it looks like right now using the two scripts below.
test_schema.json
test_ingest.py
m
You probably have to use both json schema + csv enricher if you want to have tags, terms, owner etc. Why dont you want to have datahub connect to source?
h
Oh okay thanks, is there a good guide on how this is done? The main reason why I don’t want to connect it to the raw data itself is to avoid dealing with permissions related issues. At some point in the future we might switch to connecting to the data itself, but for now it would be easier to adopt if we could only provide metadata directly since that is already being maintained.
a
h
@astonishing-answer-96712 yeah, but it looks like it’s limited in its functionality (it can’t extract tags/ownership)
m
You have to do 2 passes. First pass, use the json schema to create the dataset, and then use csv-enricher to populate tags/owner etc. So you metadata need to be split into 2 files... At least that is how i see.