Hi guys, I got a trouble when I ingest metadata fr...
# troubleshoot
m
Hi guys, I got a trouble when I ingest metadata from S3. The recipe I use is attached. Firstly, it worked very well and I got datasets in table granularity. While after I set profiling and set it back, I only got some containers which are in tables' upper level. No more datasets in table granularity. Any help will be appreciated. Thank you!
Copy code
source:
  type: s3
  config:
    platform: s3
    profiling:
      enabled: false
      profile_table_level_only: false
    path_specs:
       - include: "<s3://path/cluster=dev/datatype={table}/year={partition[0]}/month={partition[1]}/day={partition[2]}/*.parquet>"
    aws_config:
      aws_region: us-east-1

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"
h
Hi @microscopic-room-90690 can you upload the output of file sink here , in this thread ? also mention which dataset did you expect to see but is not there. and corresponding s3 path for the same.
m
Hi @hundreds-photographer-13496 the output is attached. The datasets are expected to be named the
{table}
part of url s3://path/cluster=dev/datatype={table}/year={partition[0]}/month={partition[1]}/day={partition[2]}/*.parquet
Copy code
"fieldPath": "p95",
                                "jsonPath": null,
                                "nullable": false,
                                "description": null,
                                "created": null,
                                "lastModified": null,
                                "type": {
                                    "type": {
                                        "com.linkedin.pegasus2avro.schema.StringType": {}
                                    }
                                },
                                "nativeDataType": "string",
                                "recursive": false,
                                "globalTags": null,
                                "glossaryTerms": null,
                                "isPartOfKey": false,
                                "isPartitioningKey": null,
                                "jsonProps": null
                            },
                            {
                                "fieldPath": "p96",
                                "jsonPath": null,
                                "nullable": false,
                                "description": null,
                                "created": null,
                                "lastModified": null,
                                "type": {
                                    "type": {
                                        "com.linkedin.pegasus2avro.schema.StringType": {}
                                    }
                                },
                                "nativeDataType": "string",
                                "recursive": false,
                                "globalTags": null,
                                "glossaryTerms": null,
                                "isPartOfKey": false,
                                "isPartitioningKey": null,
                                "jsonProps": null
                            },
                            {
                                "fieldPath": "p97",
                                "jsonPath": null,
                                "nullable": false,
                                "description": null,
                                "created": null,
                                "lastModified": null,
                                "type": {
                                    "type": {
                                        "com.linkedin.pegasus2avro.schema.StringType": {}
                                    }
                                },
                                "nativeDataType": "string",
                                "recursive": false,
                                "globalTags": null,
                                "glossaryTerms": null,
                                "isPartOfKey": false,
                                "isPartitioningKey": null,
                                "jsonProps": null
                            },
                            {
                                "fieldPath": "p98",
                                "jsonPath": null,
                                "nullable": false,
                                "description": null,
                                "created": null,
                                "lastModified": null,
                                "type": {
                                    "type": {
                                        "com.linkedin.pegasus2avro.schema.StringType": {}
                                    }
                                },
                                "nativeDataType": "string",
                                "recursive": false,
                                "globalTags": null,
                                "glossaryTerms": null,
                                "isPartOfKey": false,
                                "isPartitioningKey": null,
                                "jsonProps": null
                            },
                            {
                                "fieldPath": "p99",
                                "jsonPath": null,
                                "nullable": false,
                                "description": null,
                                "created": null,
                                "lastModified": null,
                                "type": {
                                    "type": {
                                        "com.linkedin.pegasus2avro.schema.StringType": {}
                                    }
                                },
                                "nativeDataType": "string",
                                "recursive": false,
                                "globalTags": null,
                                "glossaryTerms": null,
                                "isPartOfKey": false,
                                "isPartitioningKey": null,
                                "jsonProps": null
                            },
                            {
                                "fieldPath": "region",
                                "jsonPath": null,
                                "nullable": false,
                                "description": null,
                                "created": null,
                                "lastModified": null,
                                "type": {
                                    "type": {
                                        "com.linkedin.pegasus2avro.schema.StringType": {}
                                    }
                                },
                                "nativeDataType": "string",
                                "recursive": false,
                                "globalTags": null,
                                "glossaryTerms": null,
                                "isPartOfKey": false,
                                "isPartitioningKey": null,
                                "jsonProps": null
                            },
                            {
                                "fieldPath": "rowKey",
                                "jsonPath": null,
                                "nullable": false,
                                "description": null,
                                "created": null,
                                "lastModified": null,
                                "type": {
                                    "type": {
                                        "com.linkedin.pegasus2avro.schema.StringType": {}
                                    }
                                },
                                "nativeDataType": "string",
                                "recursive": false,
                                "globalTags": null,
                                "glossaryTerms": null,
                                "isPartOfKey": false,
                                "isPartitioningKey": null,
                                "jsonProps": null
                            },
                            {
                                "fieldPath": "ts",
                                "jsonPath": null,
                                "nullable": false,
                                "description": null,
                                "created": null,
                                "lastModified": null,
                                "type": {
                                    "type": {
                                        "com.linkedin.pegasus2avro.schema.StringType": {}
                                    }
                                },
                                "nativeDataType": "string",
                                "recursive": false,
                                "globalTags": null,
                                "glossaryTerms": null,
                                "isPartOfKey": false,
                                "isPartitioningKey": null,
                                "jsonProps": null
                            },
                            {
                                "fieldPath": "year",
                                "jsonPath": null,
                                "nullable": false,
                                "description": null,
                                "created": null,
                                "lastModified": null,
                                "type": {
                                    "type": {
                                        "com.linkedin.pegasus2avro.schema.StringType": {}
                                    }
                                },
                                "nativeDataType": "string",
                                "recursive": false,
                                "globalTags": null,
                                "glossaryTerms": null,
                                "isPartOfKey": false,
                                "isPartitioningKey": null,
                                "jsonProps": null
                            }
                        ],
                        "primaryKeys": null,
                        "foreignKeysSpecs": null,
                        "foreignKeys": null
                    }
                }
            ]
        }
    },
    "proposedDelta": null,
    "systemMetadata": {
        "lastObserved": 1667285121057,
        "runId": "s3-2022_11_01-06_44_04",
        "registryName": null,
        "registryVersion": null,
        "properties": null
    }
},
{
    "auditHeader": null,
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:s3,path/cluster=dev/joinvoip,PROD)",
    "entityKeyAspect": null,
    "changeType": "UPSERT",
    "aspectName": "container",
    "aspect": {
        "value": "{\"container\": \"urn:li:container:55b31a9bf2521237914e6ad52ccb5f4f\"}",
        "contentType": "application/json"
    },
    "systemMetadata": {
        "lastObserved": 1667285121089,
        "runId": "s3-2022_11_01-06_44_04",
        "registryName": null,
        "registryVersion": null,
        "properties": null
    }
}
@hundreds-photographer-13496 I found something interesting. If I use recipe with specific path_spec, it works well. After that, if I remove the metadata ingested before and do ingestion again, the metadata will not appear until I use another one
h
@microscopic-room-90690 this is not complete ingestion file.Can you can upload the entire file ?
Also I am slightly uncertain about specifying part of folder name as table instead of entire folder name. i.e.
"<s3://path/cluster=dev/datatype={table}/year={partition[0]}/month={partition[1]}/day={partition[2]}/*.parquet>"
versus
"<s3://path/cluster=dev/{table}/year={partition[0]}/month={partition[1]}/day={partition[2]}/*.parquet>"
Can you provide the absolute s3 path for which this dataset is created -
urn:li:dataset:(urn:li:dataPlatform:s3,path/cluster=dev/joinvoip
?
m
@hundreds-photographer-13496 Sorry, this is the entire file. And as I see it, part of folder name can be specified as table name. While if I remove the metadata and do ingestion again, it doesn't work, so I'm wondering if some default settings cause it. This is the urn. urnlidataset:(urnlidataPlatform:s3,companydev-watch2/data-lake-hudi/ASYNCMQ_QOS/cluster=dev/DOWNLINK_NETWORK,PROD)
h
Got it. So on a fresh instance everything works fine and it stops showing up after you delete the metadata. Is that right ? For this, https://datahubspace.slack.com/archives/C029A3M079U/p1667294786258309?thread_ts=1667284071.274729&amp;cid=C029A3M079U How did you remove the the metadata ingested before ?
g
If you’re using
datahub delete
without
--hard
, the “soft-deleted” status might be sticking even though the ingestion was successful. A workaround for now would be to run use the --hard delete flag so that everything is re-ingested completely fresh
m
@hundreds-photographer-13496 Yes, I remove the metadata using
datahub delete --env PROD --entity_type container --platform s3
. And @gray-shoe-75895 it works with
--hard
! I have another question. How to recover metadata after soft-delete?
Also, I'm confused that in the recipe I set every config false, while the metadata still shows metrics of null value and distinct value.
g
That looks like a bug - should be fixed by this PR https://github.com/datahub-project/datahub/pull/6354
If you want to speed up ingestion, it’d likely be easier to set
profile_table_level_only
to true or disable it altogether
m
@hundreds-photographer-13496 Got it! And if I use soft-delete, how to re-show the metadata?
h
@microscopic-room-90690 - soft delete simply sets status aspect with
removed=True
. So updating status aspect for all s3 entities to set
removed=False
should re-show metadata. Ideally, that should happen automatically if you run s3 ingestion again. However the s3 source currently does not emit status aspect. It should be easy to fix the source, let me come back on this.