1. I can add dataplatforms + dataprocess instances...
# getting-started
l
2. I can add dataplatforms + dataprocess instances, but they are NOT being displayed (in the catalogue/UI - I can see them in the db). Also, when I associate 2 datasets with a dataprocess instance (as input + output) thesee appear to have lineage (ie the tab is enabled) but if you go to see what is there... it is empty. Is that to be expected?
b
hey Harry! gotcha, so you're seeing a
dataPlatformInstance
aspect set on the entities you expect in the db? would you mind sharing what that aspect looks like for me?
and that's interesting on the dataprocess instance part - I can look into that real quick
l
hi there
I did not understand your question though. What the aspect looks like - you mean in the db?
or my code?
actually I can send you a pic of both
but give me 20' as I am in panic mode now
b
both would be best! i meant in the db but both would be good
yeah take your time
l
thanks much!
b
also regarding the second piece - I don't think we actually show dataProcessInstances in lineage but instead show the dataProcess entities. However it's weird that we would allow you to go into lineage when we don't show anything in it - sounds like a bug and something we can create a github issue about!
l
about to populate the db + show you
nuking docker + then quickstarting. Should be done in 3'
have to redo nuke, quickstart, sorrrrry
b
haha no worries
l
Here is what I say in the UI for a particular dataset (do not freak out it is a .csv file - this is just an example)
db contents coming soon
b
okay nice. yeah I don't see a platform instance there as you said
l
but I want to show it to you in the db 1 sec
b
totally
l
b
okay cool so that looks like data process instances, do you have data platform instances as well?
l
wait, let me show you the data process instance properties
because I only showed you the outputs
this is the data platform
b
okay so just realizing that at the current moment we don't actually do anything with data process instances in the UI. so sorry about this and I totally understand that that is confusing and unintuitive. In the meantime I would suggest you move to using data jobs instead of data process instances!
and now i'm looking at the platform stuff
okay is that data platform associated with the entity you care about above?
l
yes sir this is the one
platform -> data process instance -> input/output file(s)
but... it is ok I can work with data flows + datajobs instead
we are building a platform to help scientists. Not sysadmins
b
okay thank you for understanding! definitely understood why you would have done that in the first place
l
so they have some inputs, they run some simulations/machine learning jobs/...., they produce outputs
b
okay makes sense. I think data jobs would be your best bet for that then!
l
and I would like to be able to show them how the output happened. Ie this file, came from this processes, which was part of this experiment, which used this code, which used this input
one last question: what is the deepest hierarchy I can have?
dataflow->datajob->file?
In the ui I get: orchestrator (part of dataflow but appears separately)->_dataflow->datajob->dataset_
is that the deepest hierarchy I can get ? (I will have to somehow match the "orchestrator" context to my world, also the data flow one (could be the "sim engine") -> datajob would be the particulare run -> dataset (file) would be the result. Can I do anything "deeper" (ie more steps in the hierarchy)?)
hmm just realized the question may not be appropriate for this place...
b
looking into this super quick
l
thaaank you a lot!!
b
of course! so the main thing is that a DataFlow contains DataJobs. DataJobs can depend on each other and ultimately produce datasets. so you could have a dataflow with many different datajobs, and you could have datajob -> datajob
l
but datajob is a "running" process or a process that "ran-and-terminated" right?
ie something with a job id when you do %ps at the command line prompt. Correct?
or no?
b
these docs would help give more info as well! https://datahubproject.io/docs/metadata-modeling/metadata-model then you can look specifically at DataFlow and DataJob
and I believe you are correct
l
i have read . Many times. But I will look again (maybe I am jaded - my viewpoint, that is + needs to change). And it is up to me to map my world to the datahub entity world
last LAST question. If I create my own entity types
those will NOT be shown in the UI, correct?
b
hmm well if you go through adding a new entity to the entity-registry, and all the work described in this piece of documentation then you should be able to! however, if you just ingest new entity types they will not show up in the UI
l
aha. Interesting. I need to try!!!!!
you are super useful!!!
b
glad to help!