Hi All, i am trying to ingest AWS s3 by following...
# troubleshoot
c
Hi All, i am trying to ingest AWS s3 by following this document https://datahubproject.io/docs/metadata-ingestion/source_docs/s3 but it isn't working : UnboundLocalError: local variable 'node_urn' referenced before assignment below is the code snippet that I am using
Copy code
source:
    type: glue
    config:
        aws_region: us-east-2
        aws_access_key_id: AKIA226GV
        aws_secret_access_key: j4EzEH12YEQLw0p4+K
        aws_session_token: null
        database_pattern:
            allow:
                - "billing"
        table_pattern:
            allow:
                - "billingtable"
sink:
    type: datahub-rest
    config:
        server: '<http://localhost:8080>'
e
Hey. is this different from the other issue you posted in ingestion channel above? We will have someone take a look soon
h
Hey @curved-crayon-1929 can you share the metadata ingestion logs ? Do you see any logs starting with
Unrecognized Glue data object type
just before this error ? I suspect, the error is due to presence of connector that datahub glue source does not recognize. I have created a small PR to fix this. https://github.com/datahub-project/datahub/pull/4667
c
Hi @hundreds-photographer-13496 yes i got the error Unrecognized Glue data object type: {'catalog_connection': 'RedShiftCluster', 'connection_options': {'dbtable': 'y
h
Well, then above PR should fix the error you are seeing and continue remaining ingestion. It would still not be able to extract the lineage for unrecognized data object and you'll continue to see this in logs.
c
@hundreds-photographer-13496 thanks for the changes suggested it worked and i am able to ingest. Change suggested: (Updated script in my local env) this might be due to the presence of custom connector nodes that datahub does not recognize. A quick fix could be to update lines 348-349
Copy code
source_node = nodes[edge["Source"]]
target_node = nodes[edge["Target"]]
with
Copy code
source_node = nodes.get(edge["Source"])
target_node = nodes.get(edge["Target"])

if source_node is None or target_node is None:
    continue