We stumbled on a strange issue with a single, very...
# troubleshoot
h
We stumbled on a strange issue with a single, very central dataset (so a lot of depedencies). Error log in thread.
So we hit some sort of
DataFetchingException
. Any idea what could be the cause?
There are also some errors like this
Copy code
10:30:10.120 [Thread-1180287] ERROR c.l.d.g.e.DataHubDataFetcherExceptionHandler:21 - Failed to execute DataFetcher
java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to retrieve entities of type DataJob
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to retrieve entities of type DataJob
	at com.linkedin.datahub.graphql.GmsGraphQLEngine.lambda$null$153(GmsGraphQLEngine.java:1376)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
	... 1 common frames omitted
Caused by: java.lang.RuntimeException: Failed to batch load Data Jobs
	at com.linkedin.datahub.graphql.types.datajob.DataJobType.batchLoad(DataJobType.java:118)
	at com.linkedin.datahub.graphql.GmsGraphQLEngine.lambda$null$153(GmsGraphQLEngine.java:1373)
	... 2 common frames omitted
Caused by: com.datahub.util.exception.ModelConversionException: Failed to deserialize DataMap: {"inputDatajobs":[],"inputDatasets":["urn:li:dataset:(urn:li:dataPlatform:snowflake,MYTABLE,PROD)",
which continues as a long list of elements and ends as
,"urn:li:dataset
. I’m not sure if this an artefact of the error logging or if the actual payload is cut short.
I can reproduce the issues when querying GMS directly. The response contains a
data
field and a
errors
field. The data field contains stuff that looks correct (as far as I can tell). The error field contains a long list of errors like this
Copy code
{
            "message": "An unknown error occurred.",
            "locations": [
                {
                    "line": 185,
                    "column": 3
                }
            ],
            "path": [
                "dataset",
                "schemaMetadata"
            ],
            "extensions": {
                "code": 500,
                "type": "SERVER_ERROR",
                "classification": "DataFetchingException"
            }
        },
        {
            "message": "An unknown error occurred.",
            "locations": [
                {
                    "line": 189,
                    "column": 3
                }
            ],
            "path": [
                "dataset",
                "previousSchemaMetadata"
            ],
            "extensions": {
                "code": 500,
                "type": "SERVER_ERROR",
                "classification": "DataFetchingException"
            }
        },
        {
            "message": "An unknown error occurred.",
            "locations": [
                {
                    "line": 432,
                    "column": 5
                }
            ],
            "path": [
                "dataset",
                "incoming",
                "relationships",
                0,
                "entity"
            ],
            "extensions": {
                "code": 500,
                "type": "SERVER_ERROR",
                "classification": "DataFetchingException"
            }
        },
@big-carpet-38439 would you have idea what could be going wrong here?
Looking closer at the:
Copy code
"path": [
                "dataset",
                "incoming",
                "relationships",
                0,
                "entity"
            ],
And checking the
data
field at that path, I see:
Copy code
"incoming": {
                "start": 0,
                "count": 100,
                "total": 288,
                "relationships": [
                    {
                        "type": "Produces",
                        "direction": "INCOMING",
                        "entity": null,
                        "__typename": "EntityRelationship"
                    },
                    ...
the
"entity": null,
part is of course quite interesting 🤔
When looking closer at the list of errors, all but the two first (copy pasted above) as related to incoming relationships. To be exact, all relationship-getting fails in the same way (entity is null). Outgoing relationships show up fine
So getting the schemaMetadata, previousSchemaMetadata, and the entity of the relationships fails for some reson
When I drop the max number of items to return for example for the incoming-relationships, the entities that do get return are not null anymore, but I hit similar errors at new paths, for example:
Copy code
"path": [
                "dataset",
                "incoming",
                "relationships",
                9,
                "entity",
                "inputOutput",
                "outputDatasets",
                0,
                "schemaMetadata"
            ],
So is there some limitation on Elastic's side that causes problems? (we're actually using opensearch) Or is the problem on Datahub's side?
I can't however figure how get the schemaMetadata and previousSchemaMetadata to load. Maybe our indices are outdated?
b
I am trying to understand it
It could be that there are bad links between DataJobs and the Datasets that they produce
We've not encountered this in scaling DataHub (Yet) so I'm hesitant to call it a scaling issue. It could be that the indices are somehow not up to date, returing a relationship to a node which does not exist / has been deleted. One way to rule this out is to run the restore indices job
h
Alright, I'll try that tomorrow. Any theory about the schemaMetadata not loading?
Did a quick test with the
Copy code
incoming: relationships(
      input: {types: ["DownstreamOf", "Consumes", "Produces"], direction: INCOMING, start: 0, count: 100}
I noticed that if I set the count to ~80 it went through, but not with 100. I then changed the start to 80 and count 30, and it failed. Setting the start to 100 and count to 30 worked fine. So this seems to support your theory that it's not a size problem, but rather that there might be a bad entry between 80 and 100 in the list.
b
I think it must be how bad data is being handled
My guess is there is some edge that is ghosting
Ie returning null and the graphql query is not properly accounting for the null case. It'd be useful to narrow down to 1 entry with 1 relationship type - then we can follow the trail through GMS directly once we know the URN
h
Alright, ill look into this tomorrow, as well.
But the schemaMetadata is still a mystery.
b
One thought is that schemaMetadata aspect itself is super duper large
h
Very likely, it's a very wide table
(I'm not a graphql expert so forgive me) But something like this
Copy code
schemaMetadata(version: 0) {
      name
      __typename
    }
also throws an error
b
Aha
For a specific urn?
h
yes
@big-carpet-38439 Small update on the schemaMetadata. I checked the aspect in the GMS db, and it seems like the aspect value (i.e. the
metadata
field) is not valid JSON. It ends with something like
..., "type":{"type":{"com.linkedin.schema.NumberType":{}}},"
. Like I said, it's a wide table, but I checked and the content is around 16380 characters long, which should fit into the
TEXT
field (postgres)
Invalid json would explain the inability to parse the content. But how it got there is good question
I also identified the bad apple among the incoming relationships. The issue is similar, invalid JSON. The root cause is that our custom Airflow linage backend has added the same exact URN (representing a bucket in S3, containing multiple files) multiple times to the input of the same DataJob. This results in the
outputDatasets
being very long, and the stored string actually has the exact same length as the problematic schemaMetadata above, 16384 characters long.
@big-carpet-38439 , IMO there are multiple bugs here: • For some reason, somewhere along the pipeline, the aspect get cut to 16384 characters long (Postgres DB, Kafka ingestion) False alarm, see below • The GMS send invalid JSON from the
metadata
field onwards. (How does the MAE not capture this, btw?) • The
datajobInputOutput
aspect does not clean up duplicates in the input (this can be debated if it's a bug) • Graphql does not handle erroneous data/nulls gracefully.
Do you want me to make Github issues out of these?
@big-carpet-38439 A small update on this again. We took a look with our DBA and there was some truncation that took place when we migrated the DB before and update. So the invalid JSON was caused by the migration tool (and us not checking the logs properly). I would still argue that Graphqls error handling and sanity checking the validity of the JSON before sending it onwards from the GMS would be things to consider in the future.
b
Hm
I mean we do validation when data comes in
I'm not sure if it really should be part of system contract to deal with the expectation that some weird stuff occurred in the underlying document store - that being said it's debatable - better to have something to tell you what's wrong than nothing i suppose
How are you ingesting
datajobInputOutput
? If you are using "ingest" then the contract is basically what you put in you get out. GMS layer does not have a strong opinion on the semantic meaning of the fields and thus cannot do the de-dupe