Hello there, anyone has experience with using a p...
# ingestion
w
Hello there, anyone has experience with using a proxy in front of DataHub together with enabled
metadata_service_authentication
? I’m trying to get my recipes to use an extra header for authorization purposes. I could already confirm with the GraphQL endpoint that my headers containing the Google IAP token and the DataHub personal access token work. Example:
Copy code
curl --location --request POST '<https://example.com/api/graphql>' \
  --header 'Authorization: Bearer <personal access token>' \
  --header 'Proxy-Authorization: Bearer <IAP token>' \
  --header 'Content-Type: application/json' \
  --data-raw '{"query": "{\n  me {\n    corpUser {\n        username\n    }\n  }\n}"}'
However, when trying to ingest using recipes, it seems like the emitter ignores the
extra_headers
field containing the proxy token. Example:
Copy code
sink:
  type: "datahub-rest"
  config:
    server: "<https://example.com:443>"
    token: "<personal access token>"
    extra_headers:
      Proxy-Authorization: "Bearer <IAP token>"
Looking at the source code, it should be possible to set a custom header: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/emitter/rest_emitter.py#L82 Interestingly, the
extra_headers
field seems to work when no second (personal access) token is required and the proxy token is set as
Authorization
instead of `Proxy-Authorization`:
Copy code
sink:
  type: "datahub-rest"
  config:
    server: "<https://example.com:443>"
    extra_headers:
      Authorization: "Bearer <IAP token>"
Of course, just setting the proxy token as
token
directly works too. I’m on v0.8.40.2. Any help greatly appreciated! Cheers
b
@big-carpet-38439 maybe you can help us with this request? 🙂
b
Hi there- So I'm assuming you have a proxy sitting in front of DataHub and thats what teh Proxy-Auth header is for?
w
Hi, yes. The Proxy-Auth header is exactly for that. It seems to work with curl, as the first example shows. However, setting the additional header for the proxy in the recipe does not.
b
And you're definitely pointing ingestion toward the proxy?
I don't see anything obvious in code
b
the
config.server
value is pointing to our proxy server which redirects than to DataHub REST API. Or what do you mean?
cofig.token
is set to DataHub API token and
config.extra_headers.Proxy-Authorization
our proxy token. Or is the format of our
extra-headers
definition not correct?
b
This looks correct
In your example, you provided an extra header called Authorization - just to confirm, you've replaced that with Proxy Authorization?
b
yes. so the example we used and hoped to work is the first one with
Proxy-Authorization
for our proxy. But it seemed like it does not get picked up correctly.
The second example was just a try to see if
extra_headers
simply work and it does if we do not need two tokens. (but that is not our case) Would it even be possible to set `Authorization`and
Proxy-Authorization
in one
extra_header
? How would it look like from format etc?
s
Can you try running with
DATAHUB_DEBUG=true
env variable @wooden-arm-26381 to see if that gives you some information about what might be going wrong here? I don't have datahub behind IAP to test this out but that env variable should print out the curl commands being used. That might help in debugging this problem further.
w
Running
datahub --debug ingest -c recipe.yaml
gives following output:
Copy code
[2022-07-27 09:09:21,073] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.40.2
[2022-07-27 09:09:21,076] DEBUG    {datahub.cli.ingest_cli:105} - Using config: {'source': {'type': 'bigquery', 'config': {'project_id': 'my-gcp-project_id', 'env': 'DEV', 'include_views': False, 'table_pattern': {'deny': ['.*']}, 'include_table_lineage': False}}, 'sink': {'type': 'datahub-rest', 'config': {'server': '<https://example.com:443>', 'token': '<personal access token>', 'extra_headers': {'Proxy-Authorization': 'Bearer <IAP token>'}}}}
Looks like the recipe got rendered correctly but I’m getting this error:
Copy code
[2022-07-27 09:09:21,515] ERROR    {datahub.entrypoints:165} - Unable to connect to <https://example.com:443/config> with status_code: 401. Maybe you need to set up authentication? Please check your configuration and make sure you are talking to the DataHub GMS (usually <datahub-gms-host>:8080) or Frontend GMS API (usually <frontend>:9002/api/gms).
I’m using the same URL as without metadata authentication. But accessing GraphQL with activated authentication via curl still worked.
s
When you are using curl for graphql (which is working) are you still going through the proxy or directly connecting to datahub without proxy?
w
In this case, there is no way to access the DataHub instance without going/authenticating through the proxy first, even when using curl.
@square-activity-64562 There was a configuration mistake with our proxy and the URL endpoint used in the recipes. All seems to work just fine right now. Sorry for the inconvenience.
s
Can you please share so other people in community can solve the same problem if encountered with Google IAP?