Intermittent Authetication errors: Hi, we run ing...
# troubleshoot
m
Intermittent Authetication errors: Hi, we run ingestions as a part of our CI pipelines using datahub rest api, when ingesting we receive intermittent authentication errors for example our client reports:
Copy code
requests.exceptions.JSONDecodeError: [Errno Expecting value] <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 401 Unauthorized to perform this action.</title>
</head>
<body><h2>HTTP ERROR 401 Unauthorized to perform this action.</h2>
<table>
<tr><th>URI:</th><td>/entities</td></tr>
<tr><th>STATUS:</th><td>401</td></tr>
<tr><th>MESSAGE:</th><td>Unauthorized to perform this action.</td></tr>
<tr><th>SERVLET:</th><td>restliRequestHandler</td></tr>
</table>
<hr/><a href="<https://eclipse.org/jetty>">Powered by Jetty:// 9.4.46.v20220331</a><hr/>
</body>
</html>
The server reports a missing authentication token:
Copy code
10:36:57.676 [qtp1830908236-57260] WARN c.d.a.a.AuthenticatorChain:70 - Authentication chain failed to resolve a valid authentication. Errors: [(com.datahub.authentication.authenticator.DataHubSystemAuthenticator,Failed to authenticate inbound request: Authorization header is missing 'Basic' prefix.), (com.datahub.authentication.authenticator.DataHubTokenAuthenticator,Failed to authenticate inbound request: Unable to verify the provided token.)]
This behaviour happens intermittently, where some jobs suceed and other fail, we havent changed our client or token between jobs, so I dont understand why the token is missing. We host our deployment using EKS, we use MySQL as our datastore. I have checked: • RDS database connections and system resources • Kafka system resource • ES system resources None of these are under contention. I also check the node where the frontend and gms containers are running, both have plenty of free memory and cpu time. I am wondering if this could be a bug, does anyone have any suggestions?
We were using datahub
v0.8.38
but have upgrade to
v0.8.40
The problem persists in
v0.8.40
i
Dominic, Is the token valid or has it expired? The error you shown is not necessarily an error. It just means that one of the authenticators in our chain failed. Usually that means that a given request was not made by the
datahub
system user. We are looking into improving these auth errors. cc @big-carpet-38439
Could you share the recipes you run? Do you have metadata service authentication enabled?
m
Hi • GMS Authentication is enabled: ◦ METADATA_SERVICE_AUTH_ENABLED = true for both gms and frontend containers • The token is valid • The token is set (I assert its value) A pipeline recipe would look like:
Copy code
config_dict = {
        "source": {
            "type": source,
            "config": config,
            **extra_config,
        },
        "sink": {
            "type": "datahub-rest",
            "config": {
                "server": settings.DATAHUB_GMS_API_HOST,
                "token": gms_token,
            },
        },
    }
The pipeline is validated:
Copy code
try:
        pipeline = Pipeline.create(config_dict=config_dict)
    except ValidationError as e:
        click.echo(e, err=True)
        raise
    pipeline.run()
    pipeline.raise_from_status(raise_warnings=strict_warnings)
    return pipeline.pretty_print_summary(warnings_as_failure=strict_warnings)
b
Thank you! We’ve heard similar reports and are looking into this with urgency
plus1 1
i
I’m actively trying to reproduce this locally as we speak
plus1 1
m
We replace our token and reran our CI, during the sample pipeline several ingests succeed before failing. All using the same token.
👍 1
i
Hello Dominic, Out of curiosity have you recorded when those unauthorized errors happen? Do they happen to occur in 5m intervals (inconsistently)?
m
Hi Its hard to say, initially we were able to run two successful ingests, these ran within one minute of one another. A third ingest ran about 2 minutes later and failed. We are using a token which should last 3 months.
i
How long did these ingest processes take?
m
around 9 minutes (1:56 + 7:14 = 9:10) then a third job failed.
same errors.
r
Curious if this got resolved @most-nightfall-36645