I'm trying to generate a file to be used in File m...
# getting-started
s
I'm trying to generate a file to be used in File metadata ingestion. For this I used the following yml source: type: file config: filename: ./data.json sink: type: file config: filename: ./output.json
m
@swift-lion-29806, how have you generated the data.json input file?
s
@miniature-tiger-96062 data.json contains raw data from kaggle here
I was expecting the .yml file to convert this into a format which I can then use to sink to datahub
@miniature-tiger-96062 or should I have the data in a source like a database and then extract metadata automatically from there?
m
Hi @swift-lion-29806, The input file should be in a specific format that datahub can understand. Checkout the capabilities section here - https://datahubproject.io/docs/metadata-ingestion/source_docs/file
s
@miniature-tiger-96062 It is a bit unclear whether the input file is the raw data itself or the metadata?
can you clarify
m
@swift-lion-29806: the input file is the metadata, not the raw data.
s
@mammoth-bear-12532 is there any documentation/template for this input file? The examples here seem to differ a lot
m
Hi @swift-lion-29806 what are the kinds of metadata about the dataset that you are trying to capture?
something like : dataset-name, schema, column-level samples ?
s
@mammoth-bear-12532 yeah, exactly. plus data policy, size, number of fields in the data etc.,
m
if you are comfortable with python, programmatically generating this metadata should be pretty easy, you shouldn’t need to write the file by inspecting other checked in example files.. however, not everything you are asking for is modeled in the metadata yet.
s
@mammoth-bear-12532 Thank you. I will check it out.