DataHub #ingestion

Hi! I am new here, testing possibilities with datahub. With ingestion and profiling, I have a problem. My recipe looks like this source: type: mssql config: env: dev username: datahubproject password: supersecret database: ShopfloorMgmt host_port: 'host:1433' profiling: enabled: true The ingestion works, but I dont see any stats. In the logs, there is the following - it says, profiling done for 99 tables, but there is a bunch of error messages right before: "AttributeError: 'CreateColumn' object has no attribute 'name'\n" '[2022-07-14 074242,478] ERROR {datahub.utilities.sqlalchemy_query_combiner:250} - Failed to execute query normally, using fallback: ' 'INSERT INTO [#ge_temp_95eb8c63] (condition) SELECT CASE WHEN (1 = 1 AND [BillingDocumentCategory] IS NOT NULL) THEN %(param_1)s ELSE ' '%(param_2)s END AS condition \n' 'FROM dbo.[SalesDocumentItems]\n' 'Traceback (most recent call last):\n' ' File ' '"/tmp/datahub/ingest/venv-a4978ec6-eceb-460e-94e9-7068140d0b35/lib/python3.9/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", ' 'line 111, in get_query_columns\n' ' inner_columns = list(query.inner_columns)\n' "AttributeError: 'Insert' object has no attribute 'inner_columns'\n" '\n' 'During handling of the above exception, another exception occurred:\n' '\n' 'Traceback (most recent call last):\n' ' File ' '"/tmp/datahub/ingest/venv-a4978ec6-eceb-460e-94e9-7068140d0b35/lib/python3.9/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", ' 'line 246, in _sa_execute_fake\n' ' handled, result = self._handle_execute(conn, query, args, kwargs)\n' ' File ' '"/tmp/datahub/ingest/venv-a4978ec6-eceb-460e-94e9-7068140d0b35/lib/python3.9/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", ' 'line 211, in _handle_execute\n' ' if not self.is_single_row_query_method(query):\n' ' File ' '"/tmp/datahub/ingest/venv-a4978ec6-eceb-460e-94e9-7068140d0b35/lib/python3.9/site-packages/datahub/ingestion/source/ge_data_profiler.py", ' 'line 220, in _is_single_row_query_method\n' ' query_columns = get_query_columns(query)\n' ' File ' '"/tmp/datahub/ingest/venv-a4978ec6-eceb-460e-94e9-7068140d0b35/lib/python3.9/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", ' 'line 114, in get_query_columns\n' ' return list(query.columns)\n' "AttributeError: 'Insert' object has no attribute 'columns'\n" '[2022-07-14 074300,878] INFO {datahub.ingestion.source.ge_data_profiler:930} - Finished profiling ' 'WMITShopfloorMgmt.dbo.SalesDocumentItems; took 71.096 seconds\n' '[2022-07-14 074300,971] INFO {datahub.ingestion.source.ge_data_profiler:776} - Profiling 99 table(s) finished in 117.176 seconds\n'

steep-vr-39297

07/14/2022, 9:00 AM

I have a question. Is there any way for JDBC connection in the hive recipe setting?

gifted-knife-16120

07/14/2022, 10:36 AM

Hi there, May I know, can we retrieve the information_scheme (postgres plattorm) to generate Lineage? For now, I need to manually create the script to generate Lineage. table-to-table (same dataset) level

careful-pilot-86309

07/14/2022, 11:16 AM

If you know urn of dataset and have access to cli u can use https://datahubproject.io/docs/cli#get

wooden-arm-26381

07/14/2022, 1:45 PM

Hello there, anyone has experience with using a proxy in front of DataHub together with enabled

metadata_service_authentication

? I’m trying to get my recipes to use an extra header for authorization purposes. I could already confirm with the GraphQL endpoint that my headers containing the Google IAP token and the DataHub personal access token work. Example:

Copy code

curl --location --request POST '<https://example.com/api/graphql>' \
  --header 'Authorization: Bearer <personal access token>' \
  --header 'Proxy-Authorization: Bearer <IAP token>' \
  --header 'Content-Type: application/json' \
  --data-raw '{"query": "{\n  me {\n    corpUser {\n        username\n    }\n  }\n}"}'

However, when trying to ingest using recipes, it seems like the emitter ignores the

extra_headers

field containing the proxy token. Example:

Copy code

sink:
  type: "datahub-rest"
  config:
    server: "<https://example.com:443>"
    token: "<personal access token>"
    extra_headers:
      Proxy-Authorization: "Bearer <IAP token>"

Looking at the source code, it should be possible to set a custom header: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/emitter/rest_emitter.py#L82 Interestingly, the

extra_headers

field seems to work when no second (personal access) token is required and the proxy token is set as

Authorization

instead of `Proxy-Authorization`:

Copy code

sink:
  type: "datahub-rest"
  config:
    server: "<https://example.com:443>"
    extra_headers:
      Authorization: "Bearer <IAP token>"

Of course, just setting the proxy token as

token

directly works too. I’m on v0.8.40.2. Any help greatly appreciated! Cheers

kind-whale-32412

07/14/2022, 3:51 PM

Hi all I am trying to test datahub with superset ingestion and getting this error:

Failed to configure source (superset) due to 'access_token'

My config looks like this:

Copy code

source:
  type: superset
  config:
    # Coordinates
    connect_uri: <http://localhost:8188>

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

The question is, where do I put access_token? If I try to add it to source.config.access_token then it's erroring out as:

Failed to configure source (superset) due to 1 validation error for SupersetConfig

access_token

extra fields not permitted (type=value_error.extra)

I could not see any access_token field in the schema too.

faint-television-78785

07/15/2022, 2:44 AM

During ingestion, how would I go iterating through the Datasets that all came from one Postgres cluster? ie the Datasets reflect the Tables, but I want to be able to update all necessary tables during ingestion with one script. The main issue i see is that another postgres cluster, with its

postgres

database, could output the URN (urnlidataPlatform:postgres,main.public.customers,PROD). How do I handle this overlap in Datahub so I can remember which Datasets come from which postgres cluster?

straight-policeman-77814

07/15/2022, 7:02 AM

can u helpme while ingesting metadata in datahub i am facing issue

better-orange-49102

07/15/2022, 10:25 AM

i'm wondering if there's any interest for the elasticsearch connector to be able to read from index templates as well; my org's daily ES indices are generated using these templates and we're not keen to show thousands of indices in datahub that are similar to each other and (in my case) they only differ by date. I can put up a PR for this

sticky-twilight-17476

07/15/2022, 11:12 AM

hi all, I'm trying to ingest new entities using the REST.li API. I've tried the examples included within the Guides section of the datahub docs and I managed to ingest a new custom Data Platform but I now trying to add another entities such as a new Domain. I've checked out the schema of a Domain in the docs but I don't know how to fill some fields of the json payload for the POST endpoint (<server>:8080/entities?action=ingest). See following my current version of the payload:

sticky-twilight-17476

07/15/2022, 11:13 AM

Copy code

{
  "entity": {
    "value": {
      "com.linkedin.domain": {
        "aspects": [
          {
            "domainProperties": {
              "name": "Facilities",
              "description": "The facilities domain"
            }
          }
        ],
        "urn": "urn:li:domain:Facilities"
      }
    }
  }
}

sticky-twilight-17476

07/15/2022, 11:14 AM

As you can see above I'm trying with com.linkedin.domain and domainProperties for the value and the aspects but it doesn't work. Are there any examples of the payloads required to create new entities through the REST.li API? Thx!

silly-ice-4153

07/15/2022, 3:13 PM

Hello all again I'm debugging an airflow connection issue with datahub. I get the following error

Copy code

File "/home/airflow/.local/lib/python3.8/site-packages/requests/sessions.py", line 742, in get_adapter
    raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for 'xxx:8080/entities?action=ingest'

I'm using the following code for connection - I put it the Connection UI for host my hostname and :8080 - the port is open

Copy code

from datetime import timedelta

from airflow import DAG
try:
  from airflow.operators.bash import BashOperator
except ModuleNotFoundError:
  from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago

import datahub.emitter.mce_builder as builder
from datahub_provider.operators.datahub import DatahubEmitterOperator

default_args = {
  "owner": "airflow",
  "depends_on_past": False,
  "email": ["<mailto:jdoe@example.com|jdoe@example.com>"],
  "email_on_failure": False,
  "email_on_retry": False,
  "retries": 1,
  "retry_delay": timedelta(minutes=5),
  "execution_timeout": timedelta(minutes=120),
}
with DAG(
  "datahub_lineage_emission_example",
  default_args=default_args,
  description="An example DAG demonstrating lineage emission within an Airflow DAG.",
  schedule_interval=timedelta(days=1),
  start_date=days_ago(2),
  catchup=False,
) as dag:
  # This example shows a SnowflakeOperator followed by a lineage emission. However, the
  # same DatahubEmitterOperator can be used to emit lineage in any context.

  transformation_task = BashOperator(
    task_id="bash_test",
    dag=dag,
    bash_command="echo 'This is where you might run your data tooling.'",
  )

  emit_lineage_task = DatahubEmitterOperator(
    task_id="emit_lineage",
    datahub_conn_id="datahub_rest_default",
    mces=[
      builder.make_lineage_mce(
        upstream_urns=[
          builder.make_dataset_urn("postgres", "postgres.zoom.events"),
        ],
        downstream_urn=builder.make_dataset_urn(
          "postgres", "postgres.zoom.events"
        ),
      )
    ],
  )

  transformation_task >> emit_lineage_task

colossal-needle-73093

07/16/2022, 9:15 AM

Hello, how to fetch sample data in customized transformer ?

lemon-zoo-63387

07/16/2022, 10:51 AM

Hello, everyone,in addition to the following two methods, are there any other ways, such as reading stored procedures when ingesting metadata,Thanks in advance for your help https://datahubproject.io/docs/lineage/sample_code https://datahubproject.io/docs/generated/ingestion/sources/file-based-lineage/

mysterious-nail-70388

07/18/2022, 2:58 AM

Hello, I have a question, once I ingest metadata, but accidentally shut down the schema-Registry container, but I didn't notice, and then continued to start the DataHub client to ingest metadata, resulting in an exception. However, when I started the schema-Registry container again, I took the same YML file again. There was no exception in the client of DataHub, nor in GMS, but there was no data source information on Datahub. When I modify the instance name, the ingested data source information can appear. I want to know how can I ingest metadata normally without modifying the yml file

lemon-zoo-63387

07/18/2022, 3:23 AM

Hello, everyone, how to create these lineages in the demonstration case? Postgres does not have this config (include_table_lineage: true). I want to create lineages automatically with ingest metadata. Thanks for your help https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:postgres,jaffle_shop.dbt_jaffle.orders,PROD)/?is_lineage_mode=true https://datahubproject.io/docs/generated/ingestion/sources/postgres

wonderful-egg-79350

07/18/2022, 6:39 AM

Hi! All. I have a question about 'csv-enricher' moduler. when I started to ingest specific glossary term by using 'csv-moduler', It didn't work. Below the picture, you can see error messages.

stocky-midnight-78204

07/18/2022, 9:32 AM

I faced this issue :

stocky-midnight-78204

07/18/2022, 9:32 AM

Caused by: org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers

plus1 1

stocky-midnight-78204

07/18/2022, 9:32 AM

Do you know what is the root cause and how to fix?

late-bear-87552

07/18/2022, 9:44 AM

Copy code

Source (mysql) report:
{'workunits_produced': 13,
 'workunit_ids': ['container-info-none-urn:li:container:572998b031769da2cb678f19608a921f',
                  'container-platforminstance-none-urn:li:container:572998b031769da2cb678f19608a921f',
                  'container-subtypes-none-urn:li:container:572998b031769da2cb678f19608a921f',
                  'container-info-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98',
                  'container-platforminstance-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98',
                  'container-subtypes-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98',
                  'container-parent-container-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98-urn:li:container:572998b031769da2cb678f19608a921f',
                  'container-urn:li:container:7e15bde4c890869dbd7058de93e81a98-to-urn:li:dataset:(urn:li:dataPlatform:mysql,audience_manager.tableA,PROD)',
                  'audience_manager.tableA',
                  'audience_manager.tableA-subtypes',
                  'container-urn:li:container:7e15bde4c890869dbd7058de93e81a98-to-urn:li:dataset:(urn:li:dataPlatform:mysql,audience_manager.task_events,PROD)',
                  'audience_manager.task_events',
                  'audience_manager.task_events-subtypes'],
 'warnings': {},
 'failures': {},
 'cli_version': '0.8.38',
 'cli_entry_location': '/usr/local/lib/python3.9/site-packages/datahub/__init__.py',
 'py_version': '3.9.13 (main, May 24 2022, 21:28:44) \n[Clang 13.0.0 (clang-1300.0.29.30)]',
 'py_exec_path': '/usr/local/opt/python@3.9/bin/python3.9',
 'os_details': 'macOS-11.6.2-x86_64-i386-64bit',
 'tables_scanned': 2,
 'views_scanned': 0,
 'entities_profiled': 0,
 'filtered': ['information_schema.*', 'datahub.*', 'mysql.*', 'performance_schema.*', 'sys.*'],
 'soft_deleted_stale_entities': [],
 'query_combiner': None}
Sink (datahub-kafka) report:
{'records_written': 13,
 'warnings': [],
 'failures': [],
 'downstream_start_time': None,
 'downstream_end_time': None,
 'downstream_total_latency_in_seconds': None}

Pipeline finished successfully producing 13 workunits
[2022-07-18 15:11:21,048] WARNING  {urllib3.connectionpool:810} - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x13ea2b5b0>: Failed to establish a new connection: [Errno 61] Connection refused')': /config
[2022-07-18 15:11:25,049] WARNING  {urllib3.connectionpool:810} - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x13ea2b6a0>: Failed to establish a new connection: [Errno 61] Connection refused')': /config
[2022-07-18 15:11:33,055] WARNING  {urllib3.connectionpool:810} - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x13ea2b8b0>: Failed to establish a new connection: [Errno 61] Connection refused')': /config

square-hair-99480

07/18/2022, 4:17 PM

Is there a way to see the logs for an ingestion executed from the UI during the ingestion process? Currently it seem sI can only access it once the ingestion has succeeded or failed.

sparse-barista-40860

07/18/2022, 6:16 PM

how can add new folder?

sparse-barista-40860

07/18/2022, 6:16 PM

for example new one: "dataudea"

sparse-barista-40860

07/18/2022, 6:17 PM

or something like that

refined-lizard-83096

07/18/2022, 6:46 PM

Hey DataHub team. We're looking to possibly introduce a new optional config to the Looker ingestion plugin to allow us to handle "symbolic links" in BigQuery. For further context, for the purposes of Looker consumption, we recently introduced linked datasets in our BigQuery prod instances, which are basically read-only datasets that serve as a symbolic link to a shared dataset. Thus, right now, the tables in our shared datasets aren't displaying their downstream Looker dashboards properly. We were thinking of introducing an optional config called, say

bigquery_project_map

, which will just have project mappings between the symbolic link and the shared datasets. As an example:

Copy code

bigquery_project_map:
    looker-1: prd-1
    looker-2: prd-2
    looker-3: prd-3

Is this something that you would be open to us adding? cc: @plain-farmer-27314

sparse-barista-40860

07/18/2022, 7:52 PM

Second question today: How can intest HDFS in Hadoop?

sparse-barista-40860

07/19/2022, 1:28 AM

Then hdfs is not compatible yet

sparse-barista-40860

07/19/2022, 1:28 AM