Hello, I have a question, regarding properties up...
# troubleshoot
b
Hello, I have a question, regarding properties update. Lately I had ingested the datasets to datahub using S3 as an origin. I can see that my datasets were uploaded to datahub correctly. Right now I would like to update an urn by adding to it some custom properties. Unfortunately performing curl command gives me an error. I think I had done everything correctly yet the error with
message:"No root resource defined for path '/datasets'","status":404}
appears. Is it possible to update properties to datasets ingested from S3, if yes then how? my curl command:
curl --location --request POST '<http://localhost:8080/datasets?action=ingest>' \
--header 'X-RestLi-Protocol-Version: 2.0.0' \
--header 'Content-Type: application/json' \
--data-raw '{
"snapshot": {
"aspects": [
{
"com.linkedin.dataset.DatasetProperties":  {
"customProperties": {
"SuperProperty": "over 9000"
}
}
}
],
"urn": "urn:li:dataset:(urn:li:dataset:(urn:li:dataPlatform:s3,origin_file_src%2Fdata%2Ftest%2Fother_timeZ%2Ftime%2other_folder%2Fsome_folder%2Fexample.csv,DEV)
}
}'
Issue might be because my urn is incorrect - I had copied it from the webpage url. I tried to find the correct url at http://localhost:9200/datasetindex_v2/_search?=pretty but for some reason dataplatform:s3 is not visible there, do you know how can I get my s3 urn name to be sure that I had it setup correctly? Thanks in advance for the help! *EDIT: changing in the urn name to use . instead of %2F did not help
d
There is a copy urn button on the ui what you can use to get the run properly:
plus1 1
h
Hello @breezy-portugal-43538, just out of curiosity, did you use s3 source to ingest datasets from s3 or something else ? About your curl command, It looks like you are not using the correct endpoint. Since you need to update only single aspect and not the entire entity, you should use
/aspects?action=ingestProposal
endpoint with appropriate payload - https://datahubproject.io/docs/metadata-service/#ingesting-aspects Alternatively, to save the trouble of creating serialized json string, you can use python emitter to create DatasetProperties aspect and emit the same. https://datahubproject.io/docs/metadata-ingestion/as-a-library/#example-usage
b
Hi! Thanks for the tips 🙂 After clicking on the button to copy urn nothing is copied and it looks like button does not take any action, my clipboard is empty. Can I retrieve it from somewhere else? As for the ingestion - yes I had used the S3 beta source https://datahubproject.io/docs/metadata-ingestion/source_docs/s3_data_lake/ just to be clear - using ingestProposal curl should look like this?
curl --location --request POST '<http://localhost:8080/aspects?action=ingestProposal>' \
--header 'X-RestLi-Protocol-Version: 2.0.0' \
--header 'Content-Type: application/json' \
--data-raw '{
"proposal" : {
"entityType": "dataset",
"entityUrn" : "urn:li:dataset:(urn:li:dataset:(urn:li:dataPlatform:s3,origin_file_src%2Fdata%2Ftest%2Fother_timeZ%2Ftime%2other_folder%2Fsome_folder%2Fexample.csv,DEV)",
"changeType" : "UPSERT",
"aspectName" : "DatasetProperties",
"aspect" : {
"customProperties" : "{{
"SuperProperty": "over 9000"
}",
"contentType": "application/json"
}
}
}'
h
You need to pass JSON-serialized aspect in aspect.value Like this
Copy code
curl --location --request POST '<http://localhost:8080/aspects?action=ingestProposal>' \
--header 'X-RestLi-Protocol-Version: 2.0.0' \
--header 'Content-Type: application/json' \
--data-raw '{
  "proposal" : {
    "entityType": "dataset",
    "entityUrn" : "<dataset urn>",
    "changeType" : "UPSERT",
    "aspectName" : "datasetProperties",
    "aspect" : {
      "value":"{\"customProperties\": {\"SuperProperty\": \"over 9000\"}}",
      "contentType": "application/json"
    }
  }
}'
For URN, replace %2F with / When you copy to clipboard, do you see the tickmark that urn has been copied ? Not sure why copy urn does not work for you. What browser/machine are you using ?
b
it's opera browser and I don't see a tick. I tried to narrow down the problem and here is what I did: First I had ran an ingestion using S3 as a source specifying my s3 path and it succeeded - I can see my datasets, also logs are pointing to the correct urn:
Copy code
[2022-04-29 12:51:54,146] INFO     {datahub.ingestion.run.pipeline:84} - sink wrote workunit <s3://testing/folder1/test/iwinskiTest1/iwinskiTest2/iwinskiTest3/results2575/somestats.csv>
[2022-04-29 12:51:54,169] INFO     {datahub.ingestion.run.pipeline:84} - sink wrote workunit container-urn:li:container:3ca95115310858747c3e3993be56c861-to-urn:li:dataset:(urn:li:dataPlatform:s3,testing/folder1/test/iwinskiTest1/iwinskiTest2/iwinskiTest3/results2575/somestats.csv,DEV)
[2022-04-29 12:51:54,170] INFO     {datahub.cli.ingest_cli:106} - Finished metadata ingestion
After trying to run command:
$ datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:s3,testing/folder1/test/iwinskiTest1/iwinskiTest2/iwinskiTest3/results2575/somestats.csv,DEV)"
I receive following error:
Copy code
..................................................
     entity_urn = 'urn:li:dataset:(urn:li:dataPlatform:s3,testing/folder1/test/iwinskiTest1/iwinskiTest2/iwinskiTest3/results2575/somestats.cs
                   v,DEV)'
     aspects = ()
     List = typing.List
     typed = False
     cached_session_host = None
     Optional = typing.Optional
     Tuple = typing.Tuple
     Session = <class 'requests.sessions.Session'>
     Dict = typing.Dict
     Union = typing.Union
     DictWrapper = <class 'avrogen.dict_wrapper.DictWrapper'>
     entity_response = {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                        'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:404]\n\tat com.linkedin.restli.server.Res
                        tLiServiceException.fromThrowable(RestLiServiceException.java:315)\n\tat com.linkedin.restli.server.BaseRestLiServer.bui
                        ldPreRoutingError(BaseRestLiServer.java:202)\n\tat com.linkedin.restli.server.RestRestLiServer.buildPreRoutingRestExcept
                        ion(RestRestLiServer.java:254)\n\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.
                        java:228)\n\tat com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:215)\n\tat com.linkedi
                        n.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:171)\n\tat com.linkedin.restli.server.RestLiServer.
                        handleRequest(RestLiServer.java:130)\n\tat com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(De
                        legatingTransportDispatcher.java:70)\n\tat com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(Dispat
                        cherRequestFilter.java:70)\n\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)\n\tat com
                        .linkedin.r2.filter.FilterChainIterator$FilterCh...
     non_timeseries_aspects = []
    ..................................................

---- (full traceback above) ----
File "/sharedvolume/datahub_tbd/datahub/metadata-ingestion/src/datahub/entrypoints.py", line 138, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
File "/sharedvolume/datahub_tbd/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
File "/sharedvolume/datahub_tbd/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
File "/sharedvolume/datahub_tbd/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/sharedvolume/datahub_tbd/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
File "/sharedvolume/datahub_tbd/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
File "/sharedvolume/datahub_tbd/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
File "/sharedvolume/datahub_tbd/datahub/metadata-ingestion/src/datahub/telemetry/telemetry.py", line 304, in wrapper
    raise e
File "/sharedvolume/datahub_tbd/datahub/metadata-ingestion/src/datahub/telemetry/telemetry.py", line 256, in wrapper
    res = func(*args, **kwargs)
File "/sharedvolume/datahub_tbd/datahub/metadata-ingestion/src/datahub/cli/get_cli.py", line 38, in get
    get_aspects_for_entity(entity_urn=urn, aspects=aspect, typed=False),
File "/sharedvolume/datahub_tbd/datahub/metadata-ingestion/src/datahub/cli/cli_utils.py", line 673, in get_aspects_for_entity
    aspect_list: Dict[str, dict] = entity_response["aspects"]
For simplicity I had pasted only the last content of the log - if it is required I can paste all output from the get command. I'm not sure, but it looks like urn name is somehow incorrect... could you advice on further steps to resolve the issue?
*update: I was able to update customProperties using curl provided, thank you so much! 🙂 but the issue of getting name urn still consists, is it some known issue?
h
For the datahub get --urn, can you please share the complete output please ?
b
I had attached the output in the attachment
h
Hi @breezy-portugal-43538, URN looks alright to me. From logs its clear that error was received from datahub gms server. To understand why gms server failed, Is it possible for you to share datahub-gms container logs at the time of this request? Also, which version of datahub cli and datahub-gms are you using ? Does the corresponding dataset look okay on datahub UI ?
b
Hi @hundreds-photographer-13496 I can share it, just send me some instruction on how to do it and where to find those logs 😄 as for the gms and cli I had pulled the latest changes from master and additionally did the steps regarding s3 datalake source: https://datahubproject.io/docs/metadata-ingestion/source_docs/s3_data_lake ofcourse this step also required:
Copy code
../gradlew :metadata-ingestion:installDev
source venv/bin/activate
here is the output from datahub version:
Copy code
$ datahub version
/home/mluser/.local/lib/python3.8/site-packages/cryptography/hazmat/backends/openssl/x509.py:14: CryptographyDeprecationWarning: This version of cryptography contains a temporary pyOpenSSL fallback path. Upgrade pyOpenSSL now.
  warnings.warn(
DataHub CLI version: 0.8.31.6
Python version: 3.8.10 (default, Nov 26 2021, 20:14:08)
[GCC 9.3.0]
h
Well simply use
docker logs datahub-gms
to view datahub-gms container logs. (More details here - https://datahubproject.io/docs/how/extract-container-logs/)
b
Hi @hundreds-photographer-13496 I had pasted the gms logs as an attachment, if anything else will be required please let me know
h
I don't see any logs related to youe error. Probably older logs (before error occured)? Can you share the datahub-gms logs around time -
Copy code
[2022-04-29 14:31:17,749] ERROR
when the error occured. Alternatively, just re-execute the command and share the most recent ogs.
Hey @breezy-portugal-43538, do you mind creating a new thread for the datahub cli error you face when using
datahub get --urn
? Please include cli command, cli response and relevant datahub-gms log in that thread. The original issue on the thread(adding custom properties) is already resolved
b
@hundreds-photographer-13496 sure, would you like me to create it here or github as an open issue?
h
I was thinking of slack thread to start with.
thank you 1
plus1 1
b
sure, I will gather fresh logs in free time and do it. Thanks again for all the help you provided : )