We stumbled on a strange issue with a single very central da DataHub #troubleshoot

Join Slack

We stumbled on a strange issue with a single, very...

# troubleshoot

high-hospital-85984

06/06/2022, 10:58 AM

We stumbled on a strange issue with a single, very central dataset (so a lot of depedencies). Error log in thread.

high-hospital-85984

06/06/2022, 10:59 AM

Untitled.txt

high-hospital-85984

06/06/2022, 11:02 AM

So we hit some sort of

DataFetchingException

. Any idea what could be the cause?

high-hospital-85984

06/06/2022, 11:09 AM

There are also some errors like this

Copy code

10:30:10.120 [Thread-1180287] ERROR c.l.d.g.e.DataHubDataFetcherExceptionHandler:21 - Failed to execute DataFetcher
java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to retrieve entities of type DataJob
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to retrieve entities of type DataJob
	at com.linkedin.datahub.graphql.GmsGraphQLEngine.lambda$null$153(GmsGraphQLEngine.java:1376)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
	... 1 common frames omitted
Caused by: java.lang.RuntimeException: Failed to batch load Data Jobs
	at com.linkedin.datahub.graphql.types.datajob.DataJobType.batchLoad(DataJobType.java:118)
	at com.linkedin.datahub.graphql.GmsGraphQLEngine.lambda$null$153(GmsGraphQLEngine.java:1373)
	... 2 common frames omitted
Caused by: com.datahub.util.exception.ModelConversionException: Failed to deserialize DataMap: {"inputDatajobs":[],"inputDatasets":["urn:li:dataset:(urn:li:dataPlatform:snowflake,MYTABLE,PROD)",

which continues as a long list of elements and ends as

,"urn:li:dataset

. I’m not sure if this an artefact of the error logging or if the actual payload is cut short.

high-hospital-85984

06/06/2022, 3:16 PM

I can reproduce the issues when querying GMS directly. The response contains a

data

field and a

errors

field. The data field contains stuff that looks correct (as far as I can tell). The error field contains a long list of errors like this

Copy code

{
            "message": "An unknown error occurred.",
            "locations": [
                {
                    "line": 185,
                    "column": 3
                }
            ],
            "path": [
                "dataset",
                "schemaMetadata"
            ],
            "extensions": {
                "code": 500,
                "type": "SERVER_ERROR",
                "classification": "DataFetchingException"
            }
        },
        {
            "message": "An unknown error occurred.",
            "locations": [
                {
                    "line": 189,
                    "column": 3
                }
            ],
            "path": [
                "dataset",
                "previousSchemaMetadata"
            ],
            "extensions": {
                "code": 500,
                "type": "SERVER_ERROR",
                "classification": "DataFetchingException"
            }
        },
        {
            "message": "An unknown error occurred.",
            "locations": [
                {
                    "line": 432,
                    "column": 5
                }
            ],
            "path": [
                "dataset",
                "incoming",
                "relationships",
                0,
                "entity"
            ],
            "extensions": {
                "code": 500,
                "type": "SERVER_ERROR",
                "classification": "DataFetchingException"
            }
        },

high-hospital-85984

06/06/2022, 3:17 PM

@big-carpet-38439 would you have idea what could be going wrong here?

high-hospital-85984

06/06/2022, 4:15 PM

Looking closer at the:

Copy code

"path": [
                "dataset",
                "incoming",
                "relationships",
                0,
                "entity"
            ],

And checking the

data

field at that path, I see:

Copy code

"incoming": {
                "start": 0,
                "count": 100,
                "total": 288,
                "relationships": [
                    {
                        "type": "Produces",
                        "direction": "INCOMING",
                        "entity": null,
                        "__typename": "EntityRelationship"
                    },
                    ...

high-hospital-85984

06/06/2022, 4:15 PM

the

"entity": null,

part is of course quite interesting 🤔

high-hospital-85984

06/06/2022, 4:18 PM

When looking closer at the list of errors, all but the two first (copy pasted above) as related to incoming relationships. To be exact, all relationship-getting fails in the same way (entity is null). Outgoing relationships show up fine

high-hospital-85984

06/06/2022, 4:39 PM

So getting the schemaMetadata, previousSchemaMetadata, and the entity of the relationships fails for some reson

high-hospital-85984

06/06/2022, 4:44 PM

When I drop the max number of items to return for example for the incoming-relationships, the entities that do get return are not null anymore, but I hit similar errors at new paths, for example:

Copy code

"path": [
                "dataset",
                "incoming",
                "relationships",
                9,
                "entity",
                "inputOutput",
                "outputDatasets",
                0,
                "schemaMetadata"
            ],

high-hospital-85984

06/06/2022, 4:47 PM

So is there some limitation on Elastic's side that causes problems? (we're actually using opensearch) Or is the problem on Datahub's side?

high-hospital-85984

06/06/2022, 5:15 PM

I can't however figure how get the schemaMetadata and previousSchemaMetadata to load. Maybe our indices are outdated?

big-carpet-38439

06/06/2022, 6:05 PM

I am trying to understand it

big-carpet-38439

06/06/2022, 6:05 PM

It could be that there are bad links between DataJobs and the Datasets that they produce

big-carpet-38439

06/06/2022, 6:06 PM

We've not encountered this in scaling DataHub (Yet) so I'm hesitant to call it a scaling issue. It could be that the indices are somehow not up to date, returing a relationship to a node which does not exist / has been deleted. One way to rule this out is to run the restore indices job

high-hospital-85984

06/06/2022, 6:10 PM

Alright, I'll try that tomorrow. Any theory about the schemaMetadata not loading?

high-hospital-85984

06/06/2022, 6:16 PM

Did a quick test with the

Copy code

incoming: relationships(
      input: {types: ["DownstreamOf", "Consumes", "Produces"], direction: INCOMING, start: 0, count: 100}

I noticed that if I set the count to ~80 it went through, but not with 100. I then changed the start to 80 and count 30, and it failed. Setting the start to 100 and count to 30 worked fine. So this seems to support your theory that it's not a size problem, but rather that there might be a bad entry between 80 and 100 in the list.

big-carpet-38439

06/06/2022, 6:17 PM

I think it must be how bad data is being handled

big-carpet-38439

06/06/2022, 6:17 PM

My guess is there is some edge that is ghosting

big-carpet-38439

06/06/2022, 6:17 PM

Ie returning null and the graphql query is not properly accounting for the null case. It'd be useful to narrow down to 1 entry with 1 relationship type - then we can follow the trail through GMS directly once we know the URN

high-hospital-85984

06/06/2022, 6:20 PM

Alright, ill look into this tomorrow, as well.

high-hospital-85984

06/06/2022, 6:22 PM

But the schemaMetadata is still a mystery.

big-carpet-38439

06/06/2022, 6:22 PM

One thought is that schemaMetadata aspect itself is super duper large

high-hospital-85984

06/06/2022, 6:23 PM

Very likely, it's a very wide table

high-hospital-85984

06/06/2022, 6:25 PM

(I'm not a graphql expert so forgive me) But something like this

Copy code

schemaMetadata(version: 0) {
      name
      __typename
    }

also throws an error

big-carpet-38439

06/06/2022, 6:26 PM

Aha

big-carpet-38439

06/06/2022, 6:26 PM

For a specific urn?

high-hospital-85984

06/06/2022, 6:28 PM

yes

high-hospital-85984

06/07/2022, 6:31 PM

@big-carpet-38439 Small update on the schemaMetadata. I checked the aspect in the GMS db, and it seems like the aspect value (i.e. the

metadata

field) is not valid JSON. It ends with something like

..., "type":{"type":{"com.linkedin.schema.NumberType":{}}},"

. Like I said, it's a wide table, but I checked and the content is around 16380 characters long, which should fit into the

TEXT

field (postgres)

high-hospital-85984

06/07/2022, 6:34 PM

Invalid json would explain the inability to parse the content. But how it got there is good question

high-hospital-85984

06/07/2022, 6:58 PM

I also identified the bad apple among the incoming relationships. The issue is similar, invalid JSON. The root cause is that our custom Airflow linage backend has added the same exact URN (representing a bucket in S3, containing multiple files) multiple times to the input of the same DataJob. This results in the

outputDatasets

being very long, and the stored string actually has the exact same length as the problematic schemaMetadata above, 16384 characters long.

high-hospital-85984

06/07/2022, 7:05 PM

@big-carpet-38439 , IMO there are multiple bugs here: • ~~For some reason, somewhere along the pipeline, the aspect get cut to 16384 characters long (Postgres DB, Kafka ingestion)~~ False alarm, see below • The GMS send invalid JSON from the

metadata

field onwards. (How does the MAE not capture this, btw?) • The

datajobInputOutput

aspect does not clean up duplicates in the input (this can be debated if it's a bug) • Graphql does not handle erroneous data/nulls gracefully.

high-hospital-85984

06/07/2022, 7:22 PM

Do you want me to make Github issues out of these?

high-hospital-85984

06/08/2022, 9:47 AM

@big-carpet-38439 A small update on this again. We took a look with our DBA and there was some truncation that took place when we migrated the DB before and update. So the invalid JSON was caused by the migration tool (and us not checking the logs properly). I would still argue that Graphqls error handling and sanity checking the validity of the JSON before sending it onwards from the GMS would be things to consider in the future.

big-carpet-38439

06/08/2022, 6:35 PM

big-carpet-38439

06/08/2022, 6:35 PM

I mean we do validation when data comes in

big-carpet-38439

06/08/2022, 6:35 PM

I'm not sure if it really should be part of system contract to deal with the expectation that some weird stuff occurred in the underlying document store - that being said it's debatable - better to have something to tell you what's wrong than nothing i suppose

big-carpet-38439

06/08/2022, 6:36 PM

How are you ingesting

datajobInputOutput

? If you are using "ingest" then the contract is basically what you put in you get out. GMS layer does not have a strong opinion on the semantic meaning of the fields and thus cannot do the de-dupe

4 Views

Open in Slack

Previous Next