Hi everyone! I am currently scoping the work that...
# contribute-code
b
Hi everyone! I am currently scoping the work that needs to be done in order to support ingestion of virtual datasets in the Superset ingestion source. After looking at the source, there are a few things I can tell that need to be done: • I need to create another function similar to
emit_chart_mces
which will grab all of the datasets in Superset, and then filter out the non-virtual datasets • Looker
explore
(which seem to be equivalent to Superset virtual datasets) seems to be my best bet in terms of what to follow when creating support for ingesting virtual datasets I can also think of some blockers/pain points: • Physical datasets have a single underlying database id and table name, however, virtual datasets can have multiple (or none at all). Would the
DatasetSnapshot
class allow me to create a dataset with multiple underlying tables/databases or with no underlying tables/databases? • I'm not too sure how the lineage will work for virtual datasets, is it created automatically when you create a
MetadataChangeEvent
? If not, can someone point me to an example of an ingestion source/line which does it? • What is the minimum set of parameters I need to create an object of the
DatasetSnapshot
class. I tried looking at other sources as examples but I couldn't find anything useful in regards to what the minimum/default set of parameters to send it is (and/or is there a list which states all of the possible parameters?) Are there any assumptions I made which are not correct? What do you think about the blockers/pain points? Is there anything else you think I should know before I start making modifications to that ingestion source?
🧠 1
This is related to a previous post that can be found here: https://datahubspace.slack.com/archives/CUMUWQU66/p1642001412121100
a
@orange-night-91387 can provide some context here!
o
Hey Dustin! We appreciate you looking into contributing to our ingestion sources!
• What is the minimum set of parameters I need to create an object of the
DatasetSnapshot
class. I tried looking at other sources as examples but I couldn't find anything useful in regards to what the minimum/default set of parameters to send it is (and/or is there a list which states all of the possible parameters?)
We've generally moved away from the Snapshot based approach towards MCPs which are aspect oriented. Each aspect has its own set of required properties. To create these you can utilize the MetadataChangeProposalWrapper class like this example in the redshift connector.
• Physical datasets have a single underlying database id and table name, however, virtual datasets can have multiple (or none at all). Would the
DatasetSnapshot
class allow me to create a dataset with multiple underlying tables/databases or with no underlying tables/databases?
There are a few different choices for modeling here, from looking at what a virtual dataset is in Superset, this seems like it would be modeled as a dataset with lineage to the underlying physical datasets, but containers might also make sense if they're designed more as a logical collection of data assets within Superset. This might warrant further discussion with our ingestion team though.
• I'm not too sure how the lineage will work for virtual datasets, is it created automatically when you create a
MetadataChangeEvent
? If not, can someone point me to an example of an ingestion source/line which does it?
Lineage is created through the UpstreamLineage aspect, examples of how this gets constructed are available in the ingestion code.
b
Hi @orange-night-91387! Thank you very much for the reply. In terms of which model to choose, would you be able to tag someone from the ingestion team in this thread so that they can provide their opinion on it? The only thing I want to make sure of is that it uses a model that can support 0+ (e.g., 0, 1, 2, 3, etc.) underlying physical datasets. Also, you mentioned that you've generally moved away from the snapshot based approach. So would it be better to refactor the Superset ingestion source to not use snapshots before the implementation of the virtual dataset feature begins? Or can we leave it as is, and implement this new feature using the snapshot method?
g
Some other thoughts 1. For virtual datasets, they’re actually native to superset and hence their platform should be set to superset. They should have an UpstreamLineage relationship to the actual physical datasets that they rely on. That way, they can have an arbitrary number of underlying physical datasets 2. To make sure that these virtual datasets are marked properly in the UI, you should emit a SubTypeClass aspect with subtypes = [“Virtual Dataset”, “View”] (we have a class with constants somewhere, and you can add virtual dataset to that list) 3. Ideally we’d also emit a ViewProperties aspect with the SQL definition of the virtual dataset 4. I’d prefer that any new code you write uses MCPWs instead of snapshots, but it’s definitely not required that you refactor the existing code to move from snapshots -> MCPs. It’s fine to mix the two 5. All of our classes are documented here https://datahubproject.io/docs/python-sdk/models and here https://datahubproject.io/docs/python-sdk/builder, and the refs that Ryan put in are great too
b
Thank you @gray-shoe-75895 & @orange-night-91387 for your replies! If I have any questions when it comes to implementing the feature I will be sure to post them here 🙂