Good morning datahub people, I have a question re...
# troubleshoot
b
Good morning datahub people, I have a question regarding the integration of the great expectations with datahub. As far as I understood the official way to upload great expectations results to datahub urn is to run the
great_expectations
command. Although is it possible to update the already existing urn when I already have the results stored in a file from great_expectations? I'm talking here about some nice curl command which I could run passing the json with contents of the great_expectations results, would it be possible? As always thank you a lot for the help : )
b
hey Pawel! so just so I understand you a bit better, you're saying you have data on datahub from great expectations, but now you want to manually update it? would this perhaps be what you're looking for? https://datahubproject.io/docs/how/add-user-data/#using-file-based-ingestion-recipe
^that's specific to adding a user, but more just using file-based ingestion in order to update data is the idea
b
Hi @bulky-soccer-26729 this is exactly what I needed, do you have somewhere example with the great expectations? I am not entirely sure what should I pass to aspects in order to update the stats tab
@bulky-soccer-26729 one more thing, do you perhaps have an example on how to update already existing urn with data to add there statistics stored in postgres database? Let's say that I already have created urn, with schema and etc. but I would like to additionally fetch the great expectations stats from my postgres db and upload stats to the urn, does datahub provide some example on how to proceed with this test case?
@bulky-soccer-26729 lastly, please correct me if I am wrong, but datahub still does not support PandasExecutionEngine for great expectations, right?
b
hey Pawel! first, that is correct on Pandas, right now we only support SqlAlchemyExecutionEngine with Great Expectations
then for the other two questions, if you ingest data for an existing urn we simply perform an upsert so it works the same as if ingesting data for the first time for an urn! so ingesting for an existing urn should be no different
and i'm not sure if I have any specific examples for great expectations and your specific data but let me look
unfortunately I'm not finding any examples right now, but I do know the stats tab is related to
datasetProfile
aspect, which is documented here! https://datahubproject.io/docs/generated/metamodel/entities/dataset#datasetprofile-timeseries
b
@bulky-soccer-26729 I think I had found some example source for using that datasetProfile in datahub repo: https://github.com/datahub-project/datahub/blob/108b492ed1b111e74d588789b60c476bcf[…]tadata-ingestion/tests/integration/trino/trino_mces_golden.json So basically this whole validation tab can be updated manually via curl command I think I could reuse the json example and pass it like below:
Copy code
{
    "auditHeader": null,
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:trino,library_catalog.librarydb.book,PROD)",
    "entityKeyAspect": null,
    "changeType": "UPSERT",
    "aspectName": "datasetProfile",
    "aspect": {
        "value": "{\"timestampMillis\": 1632398400000, \"partitionSpec\": {\"type\": \"FULL_TABLE\", \"partition\": \"FULL_TABLE_SNAPSHOT\"}, \"rowCount\": 3, \"columnCount\": 6, \"fieldProfiles\": [{\"fieldPath\": \"id\", \"uniqueCount\": 3, \"uniqueProportion\": 1.0, \"nullCount\": 0, \"nullProportion\": 0.0, \"sampleValues\": [\"1\", \"2\", \"3\"]}, {\"fieldPath\": \"name\", \"uniqueCount\": 3, \"uniqueProportion\": 1.0, \"nullCount\": 0, \"nullProportion\": 0.0, \"sampleValues\": [\"Book 1\", \"Book 2\", \"Book 3\"]}, {\"fieldPath\": \"author\", \"uniqueCount\": 3, \"uniqueProportion\": 1.0, \"nullCount\": 0, \"nullProportion\": 0.0, \"sampleValues\": [\"ABC\", \"PQR\", \"XYZ\"]}, {\"fieldPath\": \"publisher\", \"uniqueCount\": 0, \"nullCount\": 3, \"nullProportion\": 1.0, \"sampleValues\": []}, {\"fieldPath\": \"tags\", \"nullCount\": 3, \"nullProportion\": 1.0, \"sampleValues\": []}, {\"fieldPath\": \"genre_ids\", \"uniqueCount\": 0, \"nullCount\": 3, \"nullProportion\": 1.0, \"sampleValues\": []}]}",
        "contentType": "application/json"
    },
    "systemMetadata": {
        "lastObserved": 1632398400000,
        "runId": "trino-test",
        "registryName": null,
        "registryVersion": null,
        "properties": null
    }
},
Backstory is: we run the great expectations before datahub upload of data, and in order to avoid doing the same work twice, we want to just fetch already existing great_expectations results from database, parse them and upload correspoinding urns with it. I hope this curl command will be sufficient for us now. Do you plan to add pandas as execution engine in near future?
@bulky-soccer-26729 I tried to create some really trivial json file that could be used to update validation tab for a given urn and I came up with something like this:
Copy code
{
  "entity": {
    "value": {
      "com.linkedin.metadata.snapshot.DatasetSnapshot": {
        "urn": "urn:li:dataset:(urn:li:dataPlatform:s3,testing7/testing6/testing5/testing4/testing3/testing2/testing1/test.csv,DEV)",
        "aspects": [
          {
            "com.linkedin.dataset.DatasetFieldProfile": {
              "value": {
                "timestampMillis": 1632398400000
              }
            }
          }
        ]
      }
    }
  }
}
Unfortunately I probably missed something because during the curl command i receive error 400, could you help out in creating the aspect structure for the update of validation tab?
b
gotcha gotcha - yup I can help out! I have to finish something up first so I'll take a look at this soon
b
Thanks a lot @bulky-soccer-26729, also one more small question, how does great expectations stage know to which urn upload of validation card should go to, do you know where urn name is fetched from?
b
okay sorry about that, things have just been a bit crazy on my end here. First, can you send me any errors you have so I can check them out? also did you grab that urn from an existing dataset that you have on your instance? also for you second question, I'm a little confused what you mean. are you unsure how we get different pieces of the urn?
thank you 1
b
Hi @bulky-soccer-26729 no worries, I am really thankfull for all the help you provide : ) Okay so I tried to run the curl command with contents of the json file that I had specified above like this:
curl '<http://localhost:8080/entities?action=ingest>' -X POST --data-raw "$(<testing_curl.json)"
where
testing_curl.json
was:
Copy code
{
  "entity": {
    "value": {
      "com.linkedin.metadata.snapshot.DatasetSnapshot": {
        "urn": "urn:li:dataset:(urn:li:dataPlatform:s3,testing7/testing6/testing5/testing4/testing3/testing2/testing1/test.csv,DEV)",
        "aspects": [
          {
            "com.linkedin.dataset.DatasetFieldProfile": {
              "value": {
                "timestampMillis": 1632398400000
              }
            }
          }
        ]
      }
    }
  }
}
and following error was received (I attached it as file, because it was really long) This urn is just an example, unfortunately I cannot share real urn name, but I dont suppose urn name is the issue here, as for other updates for urn everything works fine, most likely the DatasetFieldProfile is just wrongly constructed by me. And for the second question: So as far as I understood we should add the change to the great expectations yml file with datahub action in order to run the upload to datahub the great expectations results. Question is: how does the great expectations know to which urn the update should go to? Let's say I have two urns:
Copy code
urn:li:dataset:(urn:li:dataPlatform:s3,testing7/testing6/testing5/testing4/testing3/testing2/testing1/test1.csv,DEV)
urn:li:dataset:(urn:li:dataPlatform:s3,testing7/testing6/testing5/testing4/testing3/testing2/testing1/test2.csv,DEV)
How are those urn names fetched? Or in other words, how does great expectations/datahub know where the results from great expectations should be uploaded into (to which urn)?
b
okay gotcha! yeah so for your first issue I believe you're missing some required fields in your field profile aspect. Also I believe it's actually recommended to use the ingest aspects endpoint over the ingest entities endpoint when trying to just add/update an aspect. Check out the docs here, and those docs also point to a file with examples on ingesting different aspects. The example file is here and around line 3108 you should see an example ingesting
datasetProfile
and in its value its ingesting
fieldProfiles
as well!
thank you 1
For your second piece, tbh I'm less familiar with the specific details of ingestion myself, I can point you first to some docs around urns in the first place which talks about how urns are generated and what the different parts mean. I believe we gather metadata from great expectations, or whatever data source, and generate the urn that we need based on that data to know what to update on datahub. However, if you want more fine-grained info or if this doesn't answer your question I would suggest asking in #ingestion and someone will be able to help you out!
thank you 1
b
@bulky-soccer-26729 Thanks a lot for the help, for now I had managed to workaround the issue 🙂
👍 1