magnificent-kangaroo-91705
07/14/2022, 8:35 AMsteep-vr-39297
07/14/2022, 9:00 AMgifted-knife-16120
07/14/2022, 10:36 AMcareful-pilot-86309
07/14/2022, 11:16 AMwooden-arm-26381
07/14/2022, 1:45 PMmetadata_service_authentication
?
I’m trying to get my recipes to use an extra header for authorization purposes. I could already confirm with the GraphQL endpoint that my headers containing the Google IAP token and the DataHub personal access token work. Example:
curl --location --request POST '<https://example.com/api/graphql>' \
--header 'Authorization: Bearer <personal access token>' \
--header 'Proxy-Authorization: Bearer <IAP token>' \
--header 'Content-Type: application/json' \
--data-raw '{"query": "{\n me {\n corpUser {\n username\n }\n }\n}"}'
However, when trying to ingest using recipes, it seems like the emitter ignores the extra_headers
field containing the proxy token. Example:
sink:
type: "datahub-rest"
config:
server: "<https://example.com:443>"
token: "<personal access token>"
extra_headers:
Proxy-Authorization: "Bearer <IAP token>"
Looking at the source code, it should be possible to set a custom header: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/emitter/rest_emitter.py#L82
Interestingly, the extra_headers
field seems to work when no second (personal access) token is required and the proxy token is set as Authorization
instead of `Proxy-Authorization`:
sink:
type: "datahub-rest"
config:
server: "<https://example.com:443>"
extra_headers:
Authorization: "Bearer <IAP token>"
Of course, just setting the proxy token as token
directly works too.
I’m on v0.8.40.2.
Any help greatly appreciated!
Cheerskind-whale-32412
07/14/2022, 3:51 PMFailed to configure source (superset) due to 'access_token'
My config looks like this:
source:
type: superset
config:
# Coordinates
connect_uri: <http://localhost:8188>
sink:
type: "datahub-rest"
config:
server: "<http://localhost:8080>"
The question is, where do I put access_token? If I try to add it to source.config.access_token then it's erroring out as:
Failed to configure source (superset) due to 1 validation error for SupersetConfig
access_token
extra fields not permitted (type=value_error.extra)
I could not see any access_token field in the schema too.faint-television-78785
07/15/2022, 2:44 AMpostgres
database, could output the URN (urnlidataPlatform:postgres,main.public.customers,PROD). How do I handle this overlap in Datahub so I can remember which Datasets come from which postgres cluster?straight-policeman-77814
07/15/2022, 7:02 AMbetter-orange-49102
07/15/2022, 10:25 AMsticky-twilight-17476
07/15/2022, 11:12 AMsticky-twilight-17476
07/15/2022, 11:13 AM{
"entity": {
"value": {
"com.linkedin.domain": {
"aspects": [
{
"domainProperties": {
"name": "Facilities",
"description": "The facilities domain"
}
}
],
"urn": "urn:li:domain:Facilities"
}
}
}
}
sticky-twilight-17476
07/15/2022, 11:14 AMsilly-ice-4153
07/15/2022, 3:13 PMFile "/home/airflow/.local/lib/python3.8/site-packages/requests/sessions.py", line 742, in get_adapter
raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for 'xxx:8080/entities?action=ingest'
I'm using the following code for connection - I put it the Connection UI for host my hostname and :8080 - the port is open
from datetime import timedelta
from airflow import DAG
try:
from airflow.operators.bash import BashOperator
except ModuleNotFoundError:
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
import datahub.emitter.mce_builder as builder
from datahub_provider.operators.datahub import DatahubEmitterOperator
default_args = {
"owner": "airflow",
"depends_on_past": False,
"email": ["<mailto:jdoe@example.com|jdoe@example.com>"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
"execution_timeout": timedelta(minutes=120),
}
with DAG(
"datahub_lineage_emission_example",
default_args=default_args,
description="An example DAG demonstrating lineage emission within an Airflow DAG.",
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
catchup=False,
) as dag:
# This example shows a SnowflakeOperator followed by a lineage emission. However, the
# same DatahubEmitterOperator can be used to emit lineage in any context.
transformation_task = BashOperator(
task_id="bash_test",
dag=dag,
bash_command="echo 'This is where you might run your data tooling.'",
)
emit_lineage_task = DatahubEmitterOperator(
task_id="emit_lineage",
datahub_conn_id="datahub_rest_default",
mces=[
builder.make_lineage_mce(
upstream_urns=[
builder.make_dataset_urn("postgres", "postgres.zoom.events"),
],
downstream_urn=builder.make_dataset_urn(
"postgres", "postgres.zoom.events"
),
)
],
)
transformation_task >> emit_lineage_task
colossal-needle-73093
07/16/2022, 9:15 AMlemon-zoo-63387
07/16/2022, 10:51 AMmysterious-nail-70388
07/18/2022, 2:58 AMlemon-zoo-63387
07/18/2022, 3:23 AMwonderful-egg-79350
07/18/2022, 6:39 AMstocky-midnight-78204
07/18/2022, 9:32 AMstocky-midnight-78204
07/18/2022, 9:32 AMstocky-midnight-78204
07/18/2022, 9:32 AMlate-bear-87552
07/18/2022, 9:44 AMSource (mysql) report:
{'workunits_produced': 13,
'workunit_ids': ['container-info-none-urn:li:container:572998b031769da2cb678f19608a921f',
'container-platforminstance-none-urn:li:container:572998b031769da2cb678f19608a921f',
'container-subtypes-none-urn:li:container:572998b031769da2cb678f19608a921f',
'container-info-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98',
'container-platforminstance-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98',
'container-subtypes-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98',
'container-parent-container-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98-urn:li:container:572998b031769da2cb678f19608a921f',
'container-urn:li:container:7e15bde4c890869dbd7058de93e81a98-to-urn:li:dataset:(urn:li:dataPlatform:mysql,audience_manager.tableA,PROD)',
'audience_manager.tableA',
'audience_manager.tableA-subtypes',
'container-urn:li:container:7e15bde4c890869dbd7058de93e81a98-to-urn:li:dataset:(urn:li:dataPlatform:mysql,audience_manager.task_events,PROD)',
'audience_manager.task_events',
'audience_manager.task_events-subtypes'],
'warnings': {},
'failures': {},
'cli_version': '0.8.38',
'cli_entry_location': '/usr/local/lib/python3.9/site-packages/datahub/__init__.py',
'py_version': '3.9.13 (main, May 24 2022, 21:28:44) \n[Clang 13.0.0 (clang-1300.0.29.30)]',
'py_exec_path': '/usr/local/opt/python@3.9/bin/python3.9',
'os_details': 'macOS-11.6.2-x86_64-i386-64bit',
'tables_scanned': 2,
'views_scanned': 0,
'entities_profiled': 0,
'filtered': ['information_schema.*', 'datahub.*', 'mysql.*', 'performance_schema.*', 'sys.*'],
'soft_deleted_stale_entities': [],
'query_combiner': None}
Sink (datahub-kafka) report:
{'records_written': 13,
'warnings': [],
'failures': [],
'downstream_start_time': None,
'downstream_end_time': None,
'downstream_total_latency_in_seconds': None}
Pipeline finished successfully producing 13 workunits
[2022-07-18 15:11:21,048] WARNING {urllib3.connectionpool:810} - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x13ea2b5b0>: Failed to establish a new connection: [Errno 61] Connection refused')': /config
[2022-07-18 15:11:25,049] WARNING {urllib3.connectionpool:810} - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x13ea2b6a0>: Failed to establish a new connection: [Errno 61] Connection refused')': /config
[2022-07-18 15:11:33,055] WARNING {urllib3.connectionpool:810} - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x13ea2b8b0>: Failed to establish a new connection: [Errno 61] Connection refused')': /config
square-hair-99480
07/18/2022, 4:17 PMsparse-barista-40860
07/18/2022, 6:16 PMsparse-barista-40860
07/18/2022, 6:16 PMsparse-barista-40860
07/18/2022, 6:17 PMrefined-lizard-83096
07/18/2022, 6:46 PMbigquery_project_map
, which will just have project mappings between the symbolic link and the shared datasets. As an example:
bigquery_project_map:
looker-1: prd-1
looker-2: prd-2
looker-3: prd-3
Is this something that you would be open to us adding? cc: @plain-farmer-27314sparse-barista-40860
07/18/2022, 7:52 PMsparse-barista-40860
07/19/2022, 1:28 AMsparse-barista-40860
07/19/2022, 1:28 AM