https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • m

    magnificent-kangaroo-91705

    07/14/2022, 8:35 AM
    Hi! I am new here, testing possibilities with datahub. With ingestion and profiling, I have a problem. My recipe looks like this source: type: mssql config: env: dev username: datahubproject password: supersecret database: ShopfloorMgmt host_port: 'host:1433' profiling: enabled: true The ingestion works, but I dont see any stats. In the logs, there is the following - it says, profiling done for 99 tables, but there is a bunch of error messages right before: "AttributeError: 'CreateColumn' object has no attribute 'name'\n" '[2022-07-14 074242,478] ERROR {datahub.utilities.sqlalchemy_query_combiner:250} - Failed to execute query normally, using fallback: ' 'INSERT INTO [#ge_temp_95eb8c63] (condition) SELECT CASE WHEN (1 = 1 AND [BillingDocumentCategory] IS NOT NULL) THEN %(param_1)s ELSE ' '%(param_2)s END AS condition \n' 'FROM dbo.[SalesDocumentItems]\n' 'Traceback (most recent call last):\n' ' File ' '"/tmp/datahub/ingest/venv-a4978ec6-eceb-460e-94e9-7068140d0b35/lib/python3.9/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", ' 'line 111, in get_query_columns\n' ' inner_columns = list(query.inner_columns)\n' "AttributeError: 'Insert' object has no attribute 'inner_columns'\n" '\n' 'During handling of the above exception, another exception occurred:\n' '\n' 'Traceback (most recent call last):\n' ' File ' '"/tmp/datahub/ingest/venv-a4978ec6-eceb-460e-94e9-7068140d0b35/lib/python3.9/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", ' 'line 246, in _sa_execute_fake\n' ' handled, result = self._handle_execute(conn, query, args, kwargs)\n' ' File ' '"/tmp/datahub/ingest/venv-a4978ec6-eceb-460e-94e9-7068140d0b35/lib/python3.9/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", ' 'line 211, in _handle_execute\n' ' if not self.is_single_row_query_method(query):\n' ' File ' '"/tmp/datahub/ingest/venv-a4978ec6-eceb-460e-94e9-7068140d0b35/lib/python3.9/site-packages/datahub/ingestion/source/ge_data_profiler.py", ' 'line 220, in _is_single_row_query_method\n' ' query_columns = get_query_columns(query)\n' ' File ' '"/tmp/datahub/ingest/venv-a4978ec6-eceb-460e-94e9-7068140d0b35/lib/python3.9/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", ' 'line 114, in get_query_columns\n' ' return list(query.columns)\n' "AttributeError: 'Insert' object has no attribute 'columns'\n" '[2022-07-14 074300,878] INFO {datahub.ingestion.source.ge_data_profiler:930} - Finished profiling ' 'WMITShopfloorMgmt.dbo.SalesDocumentItems; took 71.096 seconds\n' '[2022-07-14 074300,971] INFO {datahub.ingestion.source.ge_data_profiler:776} - Profiling 99 table(s) finished in 117.176 seconds\n'
    c
    e
    m
    • 4
    • 7
  • s

    steep-vr-39297

    07/14/2022, 9:00 AM
    I have a question. Is there any way for JDBC connection in the hive recipe setting?
    c
    h
    • 3
    • 2
  • g

    gifted-knife-16120

    07/14/2022, 10:36 AM
    Hi there, May I know, can we retrieve the information_scheme (postgres plattorm) to generate Lineage? For now, I need to manually create the script to generate Lineage. table-to-table (same dataset) level
    c
    • 2
    • 1
  • c

    careful-pilot-86309

    07/14/2022, 11:16 AM
    If you know urn of dataset and have access to cli u can use https://datahubproject.io/docs/cli#get
  • w

    wooden-arm-26381

    07/14/2022, 1:45 PM
    Hello there, anyone has experience with using a proxy in front of DataHub together with enabled
    metadata_service_authentication
    ? I’m trying to get my recipes to use an extra header for authorization purposes. I could already confirm with the GraphQL endpoint that my headers containing the Google IAP token and the DataHub personal access token work. Example:
    Copy code
    curl --location --request POST '<https://example.com/api/graphql>' \
      --header 'Authorization: Bearer <personal access token>' \
      --header 'Proxy-Authorization: Bearer <IAP token>' \
      --header 'Content-Type: application/json' \
      --data-raw '{"query": "{\n  me {\n    corpUser {\n        username\n    }\n  }\n}"}'
    However, when trying to ingest using recipes, it seems like the emitter ignores the
    extra_headers
    field containing the proxy token. Example:
    Copy code
    sink:
      type: "datahub-rest"
      config:
        server: "<https://example.com:443>"
        token: "<personal access token>"
        extra_headers:
          Proxy-Authorization: "Bearer <IAP token>"
    Looking at the source code, it should be possible to set a custom header: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/emitter/rest_emitter.py#L82 Interestingly, the
    extra_headers
    field seems to work when no second (personal access) token is required and the proxy token is set as
    Authorization
    instead of `Proxy-Authorization`:
    Copy code
    sink:
      type: "datahub-rest"
      config:
        server: "<https://example.com:443>"
        extra_headers:
          Authorization: "Bearer <IAP token>"
    Of course, just setting the proxy token as
    token
    directly works too. I’m on v0.8.40.2. Any help greatly appreciated! Cheers
    b
    b
    s
    • 4
    • 17
  • k

    kind-whale-32412

    07/14/2022, 3:51 PM
    Hi all I am trying to test datahub with superset ingestion and getting this error:
    Failed to configure source (superset) due to 'access_token'
    My config looks like this:
    Copy code
    source:
      type: superset
      config:
        # Coordinates
        connect_uri: <http://localhost:8188>
    
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    The question is, where do I put access_token? If I try to add it to source.config.access_token then it's erroring out as:
    Failed to configure source (superset) due to 1 validation error for SupersetConfig
    access_token
    extra fields not permitted (type=value_error.extra)
    I could not see any access_token field in the schema too.
    s
    • 2
    • 3
  • f

    faint-television-78785

    07/15/2022, 2:44 AM
    During ingestion, how would I go iterating through the Datasets that all came from one Postgres cluster? ie the Datasets reflect the Tables, but I want to be able to update all necessary tables during ingestion with one script. The main issue i see is that another postgres cluster, with its
    postgres
    database, could output the URN (urnlidataPlatform:postgres,main.public.customers,PROD). How do I handle this overlap in Datahub so I can remember which Datasets come from which postgres cluster?
    l
    m
    • 3
    • 4
  • s

    straight-policeman-77814

    07/15/2022, 7:02 AM
    can u helpme while ingesting metadata in datahub i am facing issue
  • b

    better-orange-49102

    07/15/2022, 10:25 AM
    i'm wondering if there's any interest for the elasticsearch connector to be able to read from index templates as well; my org's daily ES indices are generated using these templates and we're not keen to show thousands of indices in datahub that are similar to each other and (in my case) they only differ by date. I can put up a PR for this
    c
    • 2
    • 3
  • s

    sticky-twilight-17476

    07/15/2022, 11:12 AM
    hi all, I'm trying to ingest new entities using the REST.li API. I've tried the examples included within the Guides section of the datahub docs and I managed to ingest a new custom Data Platform but I now trying to add another entities such as a new Domain. I've checked out the schema of a Domain in the docs but I don't know how to fill some fields of the json payload for the POST endpoint (<server>:8080/entities?action=ingest). See following my current version of the payload:
  • s

    sticky-twilight-17476

    07/15/2022, 11:13 AM
    Copy code
    {
      "entity": {
        "value": {
          "com.linkedin.domain": {
            "aspects": [
              {
                "domainProperties": {
                  "name": "Facilities",
                  "description": "The facilities domain"
                }
              }
            ],
            "urn": "urn:li:domain:Facilities"
          }
        }
      }
    }
  • s

    sticky-twilight-17476

    07/15/2022, 11:14 AM
    As you can see above I'm trying with com.linkedin.domain and domainProperties for the value and the aspects but it doesn't work. Are there any examples of the payloads required to create new entities through the REST.li API? Thx!
    c
    • 2
    • 4
  • s

    silly-ice-4153

    07/15/2022, 3:13 PM
    Hello all again I'm debugging an airflow connection issue with datahub. I get the following error
    Copy code
    File "/home/airflow/.local/lib/python3.8/site-packages/requests/sessions.py", line 742, in get_adapter
        raise InvalidSchema("No connection adapters were found for {!r}".format(url))
    requests.exceptions.InvalidSchema: No connection adapters were found for 'xxx:8080/entities?action=ingest'
    I'm using the following code for connection - I put it the Connection UI for host my hostname and :8080 - the port is open
    Copy code
    from datetime import timedelta
    
    from airflow import DAG
    try:
      from airflow.operators.bash import BashOperator
    except ModuleNotFoundError:
      from airflow.operators.bash_operator import BashOperator
    from airflow.utils.dates import days_ago
    
    import datahub.emitter.mce_builder as builder
    from datahub_provider.operators.datahub import DatahubEmitterOperator
    
    default_args = {
      "owner": "airflow",
      "depends_on_past": False,
      "email": ["<mailto:jdoe@example.com|jdoe@example.com>"],
      "email_on_failure": False,
      "email_on_retry": False,
      "retries": 1,
      "retry_delay": timedelta(minutes=5),
      "execution_timeout": timedelta(minutes=120),
    }
    with DAG(
      "datahub_lineage_emission_example",
      default_args=default_args,
      description="An example DAG demonstrating lineage emission within an Airflow DAG.",
      schedule_interval=timedelta(days=1),
      start_date=days_ago(2),
      catchup=False,
    ) as dag:
      # This example shows a SnowflakeOperator followed by a lineage emission. However, the
      # same DatahubEmitterOperator can be used to emit lineage in any context.
    
      transformation_task = BashOperator(
        task_id="bash_test",
        dag=dag,
        bash_command="echo 'This is where you might run your data tooling.'",
      )
    
      emit_lineage_task = DatahubEmitterOperator(
        task_id="emit_lineage",
        datahub_conn_id="datahub_rest_default",
        mces=[
          builder.make_lineage_mce(
            upstream_urns=[
              builder.make_dataset_urn("postgres", "postgres.zoom.events"),
            ],
            downstream_urn=builder.make_dataset_urn(
              "postgres", "postgres.zoom.events"
            ),
          )
        ],
      )
    
      transformation_task >> emit_lineage_task
    s
    • 2
    • 2
  • c

    colossal-needle-73093

    07/16/2022, 9:15 AM
    Hello, how to fetch sample data in customized transformer ?
    c
    • 2
    • 1
  • l

    lemon-zoo-63387

    07/16/2022, 10:51 AM
    Hello, everyone,in addition to the following two methods, are there any other ways, such as reading stored procedures when ingesting metadata,Thanks in advance for your help https://datahubproject.io/docs/lineage/sample_code https://datahubproject.io/docs/generated/ingestion/sources/file-based-lineage/
    c
    • 2
    • 2
  • m

    mysterious-nail-70388

    07/18/2022, 2:58 AM
    Hello, I have a question, once I ingest metadata, but accidentally shut down the schema-Registry container, but I didn't notice, and then continued to start the DataHub client to ingest metadata, resulting in an exception. However, when I started the schema-Registry container again, I took the same YML file again. There was no exception in the client of DataHub, nor in GMS, but there was no data source information on Datahub. When I modify the instance name, the ingested data source information can appear. I want to know how can I ingest metadata normally without modifying the yml file
    i
    • 2
    • 10
  • l

    lemon-zoo-63387

    07/18/2022, 3:23 AM
    Hello, everyone, how to create these lineages in the demonstration case? Postgres does not have this config (include_table_lineage: true). I want to create lineages automatically with ingest metadata. Thanks for your help https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:postgres,jaffle_shop.dbt_jaffle.orders,PROD)/?is_lineage_mode=true https://datahubproject.io/docs/generated/ingestion/sources/postgres
    c
    • 2
    • 3
  • w

    wonderful-egg-79350

    07/18/2022, 6:39 AM
    Hi! All. I have a question about 'csv-enricher' moduler. when I started to ingest specific glossary term by using 'csv-moduler', It didn't work. Below the picture, you can see error messages.
    b
    • 2
    • 2
  • s

    stocky-midnight-78204

    07/18/2022, 9:32 AM
    I faced this issue :
  • s

    stocky-midnight-78204

    07/18/2022, 9:32 AM
    Caused by: org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
    plus1 1
  • s

    stocky-midnight-78204

    07/18/2022, 9:32 AM
    Do you know what is the root cause and how to fix?
    d
    • 2
    • 3
  • l

    late-bear-87552

    07/18/2022, 9:44 AM
    Copy code
    Source (mysql) report:
    {'workunits_produced': 13,
     'workunit_ids': ['container-info-none-urn:li:container:572998b031769da2cb678f19608a921f',
                      'container-platforminstance-none-urn:li:container:572998b031769da2cb678f19608a921f',
                      'container-subtypes-none-urn:li:container:572998b031769da2cb678f19608a921f',
                      'container-info-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98',
                      'container-platforminstance-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98',
                      'container-subtypes-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98',
                      'container-parent-container-audience_manager-urn:li:container:7e15bde4c890869dbd7058de93e81a98-urn:li:container:572998b031769da2cb678f19608a921f',
                      'container-urn:li:container:7e15bde4c890869dbd7058de93e81a98-to-urn:li:dataset:(urn:li:dataPlatform:mysql,audience_manager.tableA,PROD)',
                      'audience_manager.tableA',
                      'audience_manager.tableA-subtypes',
                      'container-urn:li:container:7e15bde4c890869dbd7058de93e81a98-to-urn:li:dataset:(urn:li:dataPlatform:mysql,audience_manager.task_events,PROD)',
                      'audience_manager.task_events',
                      'audience_manager.task_events-subtypes'],
     'warnings': {},
     'failures': {},
     'cli_version': '0.8.38',
     'cli_entry_location': '/usr/local/lib/python3.9/site-packages/datahub/__init__.py',
     'py_version': '3.9.13 (main, May 24 2022, 21:28:44) \n[Clang 13.0.0 (clang-1300.0.29.30)]',
     'py_exec_path': '/usr/local/opt/python@3.9/bin/python3.9',
     'os_details': 'macOS-11.6.2-x86_64-i386-64bit',
     'tables_scanned': 2,
     'views_scanned': 0,
     'entities_profiled': 0,
     'filtered': ['information_schema.*', 'datahub.*', 'mysql.*', 'performance_schema.*', 'sys.*'],
     'soft_deleted_stale_entities': [],
     'query_combiner': None}
    Sink (datahub-kafka) report:
    {'records_written': 13,
     'warnings': [],
     'failures': [],
     'downstream_start_time': None,
     'downstream_end_time': None,
     'downstream_total_latency_in_seconds': None}
    
    Pipeline finished successfully producing 13 workunits
    [2022-07-18 15:11:21,048] WARNING  {urllib3.connectionpool:810} - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x13ea2b5b0>: Failed to establish a new connection: [Errno 61] Connection refused')': /config
    [2022-07-18 15:11:25,049] WARNING  {urllib3.connectionpool:810} - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x13ea2b6a0>: Failed to establish a new connection: [Errno 61] Connection refused')': /config
    [2022-07-18 15:11:33,055] WARNING  {urllib3.connectionpool:810} - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x13ea2b8b0>: Failed to establish a new connection: [Errno 61] Connection refused')': /config
    m
    • 2
    • 3
  • s

    square-hair-99480

    07/18/2022, 4:17 PM
    Is there a way to see the logs for an ingestion executed from the UI during the ingestion process? Currently it seem sI can only access it once the ingestion has succeeded or failed.
    m
    • 2
    • 2
  • s

    sparse-barista-40860

    07/18/2022, 6:16 PM
    how can add new folder?
  • s

    sparse-barista-40860

    07/18/2022, 6:16 PM
    for example new one: "dataudea"
  • s

    sparse-barista-40860

    07/18/2022, 6:17 PM
    or something like that
    g
    • 2
    • 1
  • r

    refined-lizard-83096

    07/18/2022, 6:46 PM
    Hey DataHub team. We're looking to possibly introduce a new optional config to the Looker ingestion plugin to allow us to handle "symbolic links" in BigQuery. For further context, for the purposes of Looker consumption, we recently introduced linked datasets in our BigQuery prod instances, which are basically read-only datasets that serve as a symbolic link to a shared dataset. Thus, right now, the tables in our shared datasets aren't displaying their downstream Looker dashboards properly. We were thinking of introducing an optional config called, say
    bigquery_project_map
    , which will just have project mappings between the symbolic link and the shared datasets. As an example:
    Copy code
    bigquery_project_map:
        looker-1: prd-1
        looker-2: prd-2
        looker-3: prd-3
    Is this something that you would be open to us adding? cc: @plain-farmer-27314
  • s

    sparse-barista-40860

    07/18/2022, 7:52 PM
    Second question today: How can intest HDFS in Hadoop?
    g
    • 2
    • 1
  • s

    sparse-barista-40860

    07/19/2022, 1:28 AM
    Then hdfs is not compatible yet
    c
    • 2
    • 1
  • s

    sparse-barista-40860

    07/19/2022, 1:28 AM
    Ok
1...545556...144Latest