Question: We are exploring DataHub and are new to ...
# advice-metadata-modeling
m
Question: We are exploring DataHub and are new to it - but have a difficulty to figure out how to model the following: Our goal is to use DataHub in the Azure-eco-system. To model datasets from Azure Data Lake seems to work smooth and is easy to model in the DataHub metadata model. We also want to model data linage with Data_jobs/flows and present them similarly to Airflow for Azure Data Factory. For our understanding the Airflow is modeled as an orchestrator and viewed as a dataPlatform UI. We want to make an Azure Data Factory (ADF) entity to represent the pipelines from ADF and model them as done for Airflow (Data_job). We have not succeeded with that. Can anyone help with that?
s
What problem are you facing exactly?
Adding ADF logo?
This is an example of sending orchastrator, job, flow to datahub
If getting lineage information itself is the challenge then that depends on whether ADF has some lineage information exported by default and whether you can find a way to hook that to datahub similar to how it is done for airflow
m
Yes adding a logo is one thing. Getting a DataPlatform like Airflow is another thing. We will provide the information from ADF. We just want to model it. When we add an urn with, say ADF, then it has no logo and it is not visible in the UI like Airflow under Dataplatforms.
s
m
Yes I have added a DataPlatform for ADF
but it is not visible and I do not know how to link the DataJob to it (like done with airflow)
s
It is case-sensitive. It has to be exactly same as in your urn. What is your urn and what did you add for your data platform exactly?
m
The urn for ADF is: urnlidataPlatform:ADL
But how to connect it, say:
Copy code
chart_info_mcp = MetadataChangeProposalWrapper(
    entityType="dataJob",
    changeType=ChangeTypeClass.UPSERT,
    entityUrn=builder.make_data_job_urn(
        orchestrator="ADF", flow_id="flow2", job_id="job1", cluster="PROD"
    ),
    aspectName="dataJobInfo",
    aspect=datajob_info,
)
This does not work
s
Because they are different
Copy code
urn:li:dataPlatform:ADL
Notice the
L
at the end
m
Okay what to do?
I tried that
s
Your dataplatform is named as
ADL
not
ADF
which you are sending in as the orchasterator.
Those need to be same for the logo to work
m
That was a type - I copied the wrong - it is ADF
typo
sorry about that. It is ADF for azure data factory (adl is for azure data lake)
Can you please share which example you have used to write your code after cross-checking spellings for any other typos or case-problems?
s
I see other examples have
dataJob
. Can you please try with
datajob
? Seems there is some contradictory examples. I will cross-check the examples
m
yes - I will try
No I get the same. In DataJobInfoClass do you know how the type arg is used?
It says it is a AzkabanJobTypeClass
s
Are you seeing any errors in gms pod?
m
No - I see info's like: 095948.374 [qtp544724190-15] INFO  c.l.m.r.entity.EntityResource:111 - GET urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow14,PROD),job14)
s
When you ingest things do you see any error?
m
just a sec - I see one here
s
Can you check your database table to see if the URN has any data?
m
No that was a search error. If I resend same ingest, It writes.
10:18:27.897 [qtp544724190-14] INFO  c.l.m.r.entity.AspectResource:125 - INGEST PROPOSAL proposal: {aspectName=dataJobInfo, entityUrn=urn:li:dataJob:(urn:li:dataFlow:(urn:li:dataPlatform:ADF,flow14,PROD),job14), entityType=datajob, aspect={contentType=application/json, value=ByteString(length=153,bytes=7b226375...6429227d)}, changeType=UPSERT}
10:18:27.915 [pool-8-thread-1] INFO  c.l.m.filter.RestliLoggingFilter:56 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 18ms
Not sure where I do this: Can you check your database table to see if the URN has any data?
s
In your mysql database there is going to be a table with the name which has v2 in it.
If you look at the keys it is URN and version.
look for your urn and version = 0
m
Okay - I am not really good at this (sorry). I open a CLI from Docker and get a terminal. Then I start mysql - but it says Access denied for user ... Do I need username password, I guess?
s
Can you try 
root
 as user and 
datahub
 as pass?
m
Are we looking for something like this: urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow12,PROD),job12)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow12,PROD),job12)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow12,PROD),job12)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow12,PROD),job12)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow14,PROD),job14)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow14,PROD),job14)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow14,PROD),job14)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow14,PROD),job14)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow14,PROD),job14)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow3,prod),job3)                        | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow3,prod),job3)                        | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow3,prod),job3)                        | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADF,flow3,prod),job3)                        | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADL,flow11,PROD),job11)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADL,flow11,PROD),job11)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADL,flow11,PROD),job11)                       | | urnlidataJob:(urnlidataFlow:(urnlidataPlatform:ADL,flow11,PROD),job11)
s
yes. the version should be 0. The 3rd column contains the value
m
version 0 - yes
s
can you check if your dataplatform is in here or not?
by looking for the urn for dataplatform that you had shared earlier?
m
Yes - they are both there urnlidataPlatform:ADF                                                  | urnlidataPlatform:ADL
s
And the image urls are opening up in the browser?
Can you share the image urls?
m

https://orangeman.dk/wp-content/uploads/2019/06/DataFactory.jpg

for the ADF
The ADL is working fine
ADL is working for datasets and is all good. The ADF is not showing up like airflow, and is not getting a correct name in the DataHub UI
o
Hey Rune! Sorry for leaving you hanging here. What do you get on your frontpage results under the platform section? Does ADF show up at all (even without the icon)? Also are you saying that Airflow does not show up as a platform either?
If you do:
Copy code
curl <http://localhost:8080/entities/urn%3Ali%3AdataPlatform%3AADF>
Do you get anything back? (pointed at wherever your GMS is located, for this example assuming it's localhost)
m
Hi Ryan, No worries. The entities with ADF show up without icons and not with the display-name but with the URN. The Airflow works as expected with icon and display-name. The ADF does also not show up in the Data Platform view (the Airflows do). The curl gives:
Copy code
{
  "value": {
    "com.linkedin.metadata.snapshot.DataPlatformSnapshot": {
      "urn": "urn:li:dataPlatform:ADF",
      "aspects": [
        {
          "com.linkedin.metadata.key.DataPlatformKey": {
            "platformName": "ADF"
          }
        },
        {
          "com.linkedin.dataplatform.DataPlatformInfo": {
            "name": "Azure Data Factory",
            "datasetNameDelimiter": "/",
            "type": "OTHERS",
            "displayName": "ADF",
            "logoUrl": "<https://orangeman.dk/wp-content/uploads/2019/06/DataFactory.jpg>"
          }
        }
      ]
    }
  }
}
o
Hmm, can you look in your browser inspector in the network tab and see if a request goes out to that image url at all and possibly errors? Or if there are any errors in the console tab?
So it looks like your DataFlows are set up with:
orchestrator=DataPlatformUrn(ADF)
rather than just
orchestrator=ADF
. Are the ADL ones that are working set up this same way? I see some results above that have ADL with the same set up where it looks like:
urn:li:dataJob:(urn:li:dataFlow:(urn:li:dataPlatform:ADL,flow11,PROD),job11)
can you confirm that the logo shows up on that Urn? The issue is that we have logic to map the name -> platformUrn, but you have the fully formed platform Urn as the orchestrator so it tries to find a platform with the name "`urnlidataPlatform:ADF` " rather than the name "`ADF` " and since you don't have a platform urn set as "`urnlidataPlatformurnlidataPlatformADF` " it does not find the result. See logic here: https://github.com/linkedin/datahub/blob/master/datahub-web-react/src/app/entity/dataFlow/DataFlowEntity.tsx#L99-L110
m
Hi Ryan, If I understand correctly, that was my starting point. I started with the following two examples (after adding ADF): 1st one: https://github.com/linkedin/datahub/blob/f1045f817cb4300962dea6b41a6254f1f515fa57/metadata-ingestion/examples/library/lineage_job_dataflow.py I changed orchestrator="airflow" to orchestrator="ADF" (both places) Also the type="AIRFLOW" (which is a bit unclear to me what it does). Then your colleague suggested to use the full URN of ADF. 2nd: https://github.com/linkedin/datahub/blob/f1045f817cb4300962dea6b41a6254f1f515fa57/metadata-ingestion/examples/library/lineage_dataset_job_dataset.py Also here I made the same modifications. When using the "airflow" it works as expected: • I get the airflow icon • I get a Airflow Dataplatform When I using "ADF" or URN(ADF): • I get no icon • I get no ADF Dataplatform • It is like it is not created correctly.
o
Are you talking about:
It is case-sensitive. It has to be exactly same as in your urn. What is your urn and what did you add for your data platform exactly?
This is meant as it has to be "ADF" rather than "adf" or other variations, not that it is supposed to be exactly the full Urn. Sorry for any misunderstanding there. I looked at the implementation of the getLogoFromPlatform method and it looks like there is a static list of supported platforms from the frontend rather than it executing a query: https://github.com/linkedin/datahub/blob/master/datahub-web-react/src/app/shared/getLogoFromPlatform.tsx since your platform is not in this list, it doesn't work. Synching with team for a solution
thank you 1
m
Thank you Ryan. I appreciate all the help. If I understand you correctly - it is not possible as-is - but you will ask your team and get back.
o
Correct! I wrote up a fix for this yesterday, it's now up for review: https://github.com/linkedin/datahub/pull/3968 Once it has been merged in and you update to latest this should work with formats like:
urn:li:dataJob:(urn:li:dataFlow:(ADF,flow11,PROD),job11)
assuming you have a platform that matches:
urn:li:dataPlatform:ADF
with a proper logo url.
m
Hi Ryan, first of all - thank you for your big effort. I restarted DataHub locally today with: datahub docker quickstart Then I got some errors in the DataHub web interface. I decided to Nuke and start all over. But even before and after I re-ingested our data the web interface gives errors. I don't know if it is related to the recent changes. It shows the following message in the web interface: Validation error of type FieldUndefined: Field 'platform' in type 'Dashboard' is undefined @ 'browse/entities/platform' Validation error of type FieldUndefined: Field 'platform' in type 'Chart' is undefined @ 'browse/entities/platform' Validation error of type FieldUndefined: Field 'platform' in type 'DataFlow' is undefined @ 'browse/entities/platform' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities' Validation error of type FieldsConflict: platform: fields have different nullability shapes @ 'browse/entities'
o
Hmm, let me check it out and see if I can reproduce.
Looks like the GMS image has not pulled in the changes yet, but the frontend has. Working through why that happened.
m
Thank you Ryan. It seems to work now. The only difference I see between using Airflow and our own (ADF) is that Airflows also appear under the Platform list (not sure we care, but maybe it is supposed to?). Just another question for my understanding. There is both a
flow_id
and
job_id
How are they to be used? My guess would be:
flow_id
is identifying the "workflow" and
job_id
is an actual run of that workflow. I am not sure that is how it should be used? If so, I get new entities for each
job_id
.
o
Airflow is ingested as a platform with the sample data, if you don't want it to be in the platform list you would need to delete the platform entity related to it and not use it. That is the correct understanding of the relationship between jobs and flows, full documentation about the models can be found here: https://datahubproject.io/docs/graphql/objects/#datajob This RFC outlines how DataJobs and DataFlows are envisioned with Azkaban and may help with understanding the design as well: https://datahubproject.io/docs/rfc/active/1820-azkaban-flow-job
m
Thank you Ryan - you have been of great help.
🙇‍♂️ 1