breezy-portugal-43538
06/17/2022, 11:07 AMgreat_expectations
command. Although is it possible to update the already existing urn when I already have the results stored in a file from great_expectations? I'm talking here about some nice curl command which I could run passing the json with contents of the great_expectations results, would it be possible?
As always thank you a lot for the help : )bulky-soccer-26729
06/17/2022, 2:32 PMbulky-soccer-26729
06/17/2022, 2:33 PMbulky-soccer-26729
06/17/2022, 2:34 PMbreezy-portugal-43538
06/20/2022, 6:32 AMbreezy-portugal-43538
06/20/2022, 11:45 AMbreezy-portugal-43538
06/20/2022, 11:51 AMbulky-soccer-26729
06/21/2022, 2:19 PMbulky-soccer-26729
06/21/2022, 2:21 PMbulky-soccer-26729
06/21/2022, 2:21 PMbulky-soccer-26729
06/21/2022, 2:25 PMdatasetProfile
aspect, which is documented here! https://datahubproject.io/docs/generated/metamodel/entities/dataset#datasetprofile-timeseriesbreezy-portugal-43538
06/22/2022, 12:25 PM{
"auditHeader": null,
"entityType": "dataset",
"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:trino,library_catalog.librarydb.book,PROD)",
"entityKeyAspect": null,
"changeType": "UPSERT",
"aspectName": "datasetProfile",
"aspect": {
"value": "{\"timestampMillis\": 1632398400000, \"partitionSpec\": {\"type\": \"FULL_TABLE\", \"partition\": \"FULL_TABLE_SNAPSHOT\"}, \"rowCount\": 3, \"columnCount\": 6, \"fieldProfiles\": [{\"fieldPath\": \"id\", \"uniqueCount\": 3, \"uniqueProportion\": 1.0, \"nullCount\": 0, \"nullProportion\": 0.0, \"sampleValues\": [\"1\", \"2\", \"3\"]}, {\"fieldPath\": \"name\", \"uniqueCount\": 3, \"uniqueProportion\": 1.0, \"nullCount\": 0, \"nullProportion\": 0.0, \"sampleValues\": [\"Book 1\", \"Book 2\", \"Book 3\"]}, {\"fieldPath\": \"author\", \"uniqueCount\": 3, \"uniqueProportion\": 1.0, \"nullCount\": 0, \"nullProportion\": 0.0, \"sampleValues\": [\"ABC\", \"PQR\", \"XYZ\"]}, {\"fieldPath\": \"publisher\", \"uniqueCount\": 0, \"nullCount\": 3, \"nullProportion\": 1.0, \"sampleValues\": []}, {\"fieldPath\": \"tags\", \"nullCount\": 3, \"nullProportion\": 1.0, \"sampleValues\": []}, {\"fieldPath\": \"genre_ids\", \"uniqueCount\": 0, \"nullCount\": 3, \"nullProportion\": 1.0, \"sampleValues\": []}]}",
"contentType": "application/json"
},
"systemMetadata": {
"lastObserved": 1632398400000,
"runId": "trino-test",
"registryName": null,
"registryVersion": null,
"properties": null
}
},
Backstory is: we run the great expectations before datahub upload of data, and in order to avoid doing the same work twice, we want to just fetch already existing great_expectations results from database, parse them and upload correspoinding urns with it. I hope this curl command will be sufficient for us now. Do you plan to add pandas as execution engine in near future?breezy-portugal-43538
06/24/2022, 2:00 PM{
"entity": {
"value": {
"com.linkedin.metadata.snapshot.DatasetSnapshot": {
"urn": "urn:li:dataset:(urn:li:dataPlatform:s3,testing7/testing6/testing5/testing4/testing3/testing2/testing1/test.csv,DEV)",
"aspects": [
{
"com.linkedin.dataset.DatasetFieldProfile": {
"value": {
"timestampMillis": 1632398400000
}
}
}
]
}
}
}
}
Unfortunately I probably missed something because during the curl command i receive error 400, could you help out in creating the aspect structure for the update of validation tab?bulky-soccer-26729
06/24/2022, 2:01 PMbreezy-portugal-43538
06/27/2022, 6:42 AMbulky-soccer-26729
06/27/2022, 1:22 PMbreezy-portugal-43538
06/27/2022, 1:47 PMcurl '<http://localhost:8080/entities?action=ingest>' -X POST --data-raw "$(<testing_curl.json)"
where testing_curl.json
was:
{
"entity": {
"value": {
"com.linkedin.metadata.snapshot.DatasetSnapshot": {
"urn": "urn:li:dataset:(urn:li:dataPlatform:s3,testing7/testing6/testing5/testing4/testing3/testing2/testing1/test.csv,DEV)",
"aspects": [
{
"com.linkedin.dataset.DatasetFieldProfile": {
"value": {
"timestampMillis": 1632398400000
}
}
}
]
}
}
}
}
and following error was received (I attached it as file, because it was really long)
This urn is just an example, unfortunately I cannot share real urn name, but I dont suppose urn name is the issue here, as for other updates for urn everything works fine, most likely the DatasetFieldProfile is just wrongly constructed by me.
And for the second question:
So as far as I understood we should add the change to the great expectations yml file with datahub action in order to run the upload to datahub the great expectations results. Question is: how does the great expectations know to which urn the update should go to?
Let's say I have two urns:
urn:li:dataset:(urn:li:dataPlatform:s3,testing7/testing6/testing5/testing4/testing3/testing2/testing1/test1.csv,DEV)
urn:li:dataset:(urn:li:dataPlatform:s3,testing7/testing6/testing5/testing4/testing3/testing2/testing1/test2.csv,DEV)
How are those urn names fetched? Or in other words, how does great expectations/datahub know where the results from great expectations should be uploaded into (to which urn)?bulky-soccer-26729
06/27/2022, 2:08 PMdatasetProfile
and in its value its ingesting fieldProfiles
as well!bulky-soccer-26729
06/27/2022, 2:25 PMbreezy-portugal-43538
07/04/2022, 10:04 AM