https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • c

    clean-tomato-22549

    09/26/2022, 9:46 AM
    Besides, according to doc, Account Usage system tables are need to ingest table linkage. Why? The database INFORMATION_SCHEMA should be enough, right? https://datahubproject.io/docs/generated/ingestion/sources/snowflake/#prerequisites-1
    Copy code
    If you plan to enable extraction of table lineage, via the include_table_lineage config flag or extraction of usage statistics, via the include_usage_stats config, you'll also need to grant access to the Account Usage system tables, using which the DataHub source extracts information. This can be done by granting access to the snowflake database.
    h
    • 2
    • 1
  • f

    fancy-alligator-33404

    09/26/2022, 10:54 AM
    hello! I have a problem with hive ingestion. There is tables in the hive DB, but when I bring them as ingestion, I cannot see the tables to the datahub. Instead, if I put the 'include_views: true' option, the table information is brought to the dat hub, but as the view tables... I am attaching the captured picture that I did ingestion I would be very grateful if you could tell me the solution...!!!
    g
    • 2
    • 7
  • f

    few-carpenter-93837

    09/26/2022, 7:02 AM
    Hey, does anyone have any information regarding what Tableau datahub ingestion includes under "Tags". Is it only attribute level tags that are brought over or should it also include the workbook level tags?
    h
    • 2
    • 20
  • c

    careful-action-61962

    09/26/2022, 11:14 AM
    Hey Folks, I'm trying to ingest metadata of tableau into datahub, I'm gettting following error repeatedly and not able to get the same asset in tableau when i search it there. Would really appreciate your help. Thanks
    exec-urn_li_dataHubExecutionRequest_04e93dc4-f2b5-4014-ba02-47bcd06e06f7.log
    h
    • 2
    • 3
  • f

    fierce-baker-1392

    09/26/2022, 12:21 PM
    Hey Folks, I download the datahub sourcecode, but I have lots of python lib do not install automatically when compile module ‘_*metadata-ingestion*_’, I am not very familiar with python, how can I solve this problem? thanks~
    g
    • 2
    • 3
  • l

    lemon-engine-23512

    09/26/2022, 1:11 PM
    Hello, am unable to establish Connection between mwaa and datahub rest, for dag schedule. I get connection time out from dag. i even tested by opening all ports and ip address. Still same error. Am i missing anything? Thank you
    g
    b
    • 3
    • 7
  • p

    proud-table-38689

    09/26/2022, 3:58 PM
    we have a database that’s exposed via HTTPS, I believe that DataHub uses SQLAlchemy under the hood, is there a way we can add these parameters to our ingestion script?
    Copy code
    ssl_args = {'ssl_ca': ca_path}
    engine = create_engine("mysql+mysqlconnector://<user>:<pass>@<addr>/<schema>",
                            connect_args=ssl_args)
    r
    • 2
    • 1
  • p

    proud-table-38689

    09/26/2022, 3:58 PM
    see the
    connect_args=ssl_args
    one
  • c

    creamy-tent-10151

    09/26/2022, 5:18 PM
    Hi all, on the documentation for Athena ingestion, it mentions that you can enable table-level lineage through configuration. How do enable lineage? Thank you.
    g
    • 2
    • 2
  • b

    bland-balloon-48379

    09/26/2022, 7:14 PM
    Hey team, I wanted to start a discussion about how datasets get represented over time. In particular I'm thinking of a situation where a dataset is created, ingested, edited in Datahub, and deleted at some point. Then some time in the future, a new dataset with the same name and in the same database/schema is created. The same urn would be generated in both these cases, so internally they would be treated as the same dataset. And if the original dataset was soft-deleted, then the subsequent one would inherit that custom documentation even if it doesn't apply to this new case. Should these be considered two separate datasets? Would the second one be considered a new version of the original? If the original dataset had it's status set to removed when it was deleted from the database, would there be some indication that it had been removed and then reingested? I think there is a lot to consider here. I'm interested to hear what Datahub and the community's perspective is on this scenario, both from a stateful and stateless ingestion standpoint, and if there is a particular direction Datahub has in mind. Thanks!
    g
    • 2
    • 11
  • c

    clean-tomato-22549

    09/27/2022, 3:06 AM
    I am trying to ingest presto-on-hive, but got following error, could anyone help to check it. Thx
    Copy code
    datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure source (presto-on-hive)
    [2022-09-26 05:58:34,312] ERROR    {datahub.entrypoints:195} - Command failed: 
    	Failed to configure source (presto-on-hive) due to 
    		'TSocket read 0 bytes'.
    	Run with --debug to get full stacktrace.
    d
    c
    • 3
    • 8
  • m

    melodic-beach-18239

    09/27/2022, 8:00 AM
    Hello! I had ingested metadata of mysql and postgres. Could I create lineage of these two datasets?
    d
    • 2
    • 7
  • m

    microscopic-mechanic-13766

    09/27/2022, 10:29 AM
    Hi, I am planning on making a custom connection to enable HDFS ingestion. Could someone walk me through on what would be needed to be created? Just a file like the ones that can be found under the path
    /datahub/metadata-ingestion/src/datahub/ingestion/source
    inside the project?
    d
    • 2
    • 7
  • b

    bumpy-journalist-41369

    09/27/2022, 11:15 AM
    Hello. I have a problem when ingesting data from Glue. Even though I can see that I have ingested data, the run fails and these are the exception that I can see in the log:
    Copy code
    " 'warnings': {'<s3://aws-glue-scripts-063693278873-us-east-1/NilayDev/CompressionS3.py>': ['Error parsing DAG for Glue job. The script '\n"
               '                                                                                         '
               "'<s3://aws-glue-scripts-063693278873-us-east-1/NilayDev/CompressionS3.py> '\n"
               "                                                                                         'cannot be processed by Glue (this usually "
               "occurs when it '\n"
               "                                                                                         'has been user-modified): An error occurred '\n"
               "                                                                                         '(InvalidInputException) when calling the "
               "GetDataflowGraph '\n"
               "                                                                                         'operation: line 19:4 no viable alternative at "
               "input '\n"
               '                                                                                         "\'e3g:))\'e:)o.)) #\'"],\n'
               "              '<s3://cdc-analytics-dev-us-east-1-alert-classification-glue/ttp_window_features.py>': ['Error parsing DAG for Glue job. The "
               "script '\n"
               '                                                                                                    '
               "'<s3://cdc-analytics-dev-us-east-1-alert-classification-glue/ttp_window_features.py> '\n"
               "                                                                                                    'cannot be processed by Glue (this "
               "usually '\n"
               "                                                                                                    'occurs when it has been "
               "user-modified): An '\n"
               "                                                                                                    'error occurred "
               "(InvalidInputException) when '\n"
               "                                                                                                    'calling the GetDataflowGraph "
               "operation: line '\n"
               "                                                                                                    '337:12 no viable alternative at "
               "input '\n"
               '
    as well as this :
    Copy code
    exception=NoSuchKey('An error occurred (NoSuchKey) wh\n"
               "               en calling the GetObject operation: The specified key does not exist.')>\n"
    the recipe I am using is : sink: type: datahub-rest config: server: ‘http://datahub-datahub-gms:8080’ source: type: glue config: aws_region: us-east-1 env: DEV database_pattern: allow: - cdca Does anyone know how to fix the problem?
    g
    • 2
    • 1
  • f

    future-hair-23690

    09/27/2022, 2:40 PM
    Hi, This might be a pretty basic question, but I'm trying to ingest MSSQL from the UI, but It require pyodbc. I have a deployment via the official helm chart. Where/how can I actually install it? I mean which subchart actually takes care of ingestion?
    g
    • 2
    • 11
  • k

    kind-dawn-17532

    09/27/2022, 11:25 AM
    Hi Team, What am I missing here - I see that platform is needed to make a Feature table urn, but only a table name is needed to make a Feature? Are we thinking the feature table name will be unique enough?
    n
    • 2
    • 1
  • f

    fast-potato-13714

    09/27/2022, 3:20 PM
    Hi, I'm loading objects from trino and hive, is there a way to unify datasets, since some of them appear as trino and others as Hive, thanks in advance!
    d
    • 2
    • 9
  • g

    green-lion-58215

    09/27/2022, 3:43 PM
    Hello all, is there a way to remove a specific tag from all resources in an environment fabric in datahub?
    g
    • 2
    • 5
  • w

    wonderful-notebook-20086

    09/27/2022, 7:18 PM
    I'm running datahub locally from the
    getting-started
    docker container images based on the Quickstart guide I tried setting up a connection to our RS cluster and ran into this error:
    Copy code
    '2022-09-27 18:30:45.362431 [exec_id=fec3ab48-c33b-4403-abfc-f61720c609ae] INFO: Starting execution for task with name=RUN_INGEST',
               '2022-09-27 18:47:03.670827 [exec_id=fec3ab48-c33b-4403-abfc-f61720c609ae] INFO: Caught exception EXECUTING '
               'task_id=fec3ab48-c33b-4403-abfc-f61720c609ae, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 182, in execute\n'
               '    await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 126, in '
               '_read_output_lines\n'
               '    full_log_file.write(line)\n'
               'OSError: [Errno 28] No space left on device\n'
               '\n'
               'During handling of the above exception, another exception occurred:\n'
               '\n'
               'OSError: [Errno 28] No space left on device\n'
               '\n'
               'During handling of the above exception, another exception occurred:\n'
               '\n'
               'Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
               '    task_event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
               '    return future.result()\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 188, in execute\n'
               '    full_log_file.close()\n'
               'OSError: [Errno 28] No space left on device\n'
    Ingestion recipe yml looks something like this:
    Copy code
    source:
        type: redshift
        config:
            start_time: '2022-09-26 00:00:00Z'
            end_time: '2022-09-26 12:00:00Z'
            table_lineage_mode: mixed
            include_table_lineage: true
            database: insightsetl
            password: '${etl2_test_datahub_creds}'
            profiling:
                enabled: true
            host_port: '<http://pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192|pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192>'
            stateful_ingestion:
                enabled: true
            username: datahub_ingestion
    pipeline_name: 'urn:li:dataHubIngestionSource:93b5640d-8ed3-456e-89f9-0ec3def38733'
    I'm not sure if it's a container issue or config or something else.
    g
    • 2
    • 16
  • a

    aloof-leather-92383

    09/27/2022, 9:14 PM
    Hi, I'm doing a kafka ingestion and is there anyway to display the topic name without the cluster id? all the ingested topics are written <cluster_id>.<topic-name>. thank you!
  • p

    proud-table-38689

    09/28/2022, 1:45 AM
    how would I add a package to the ingestion virtual environment? trying to add
    teradatasqlalchemy
    . I added it to the docker image but I don’t see it used in the
    venv
    that’s used per ingestion
    g
    • 2
    • 1
  • f

    few-sugar-84064

    09/28/2022, 2:36 AM
    Hi, I found all ingested glue jobs have dataFlow urn only, doesn't have job urn. So I can't see lineage even for the job with auto generated script. Is there anyone knows how to update glue jobs as Datajob, not DataFlow? Below is my yaml ingested the jobs, tks.
    Copy code
    source:
      type: glue
      config:
        aws_region: "ap-northeast-2"
        extract_transforms: True
        catalog_id: "catalog_id"
    
    sink:
      type: "datahub-rest"
      config:
        server: "gms sever address"
    g
    • 2
    • 4
  • c

    chilly-ability-77706

    09/28/2022, 5:33 AM
    Hi, I am looking for the Hive recipe which supports KNOX authentication with required SSL certificate, below is the one I tried but I received the error message "CannotSendHeader = <class 'http.client.CannotSendHeader'>"
    d
    • 2
    • 2
  • c

    chilly-ability-77706

    09/28/2022, 5:33 AM
    Copy code
    source:
        type: hive
        config:
            database: <>
            password: <>
            host_port: <>
            stateful_ingestion:
                enabled: true
            username: <>
            options:
                connect_args:
                    auth: NOSASL
                    http_path: "/gateway/default/hive"
                    ssl_cert: "required"
            scheme: hive+https
  • g

    gifted-diamond-19544

    09/28/2022, 8:40 AM
    Hello! What should my IAM permissions be to use the ATHENA ingestion? Which tables should I be able to run queries on? I don’t think this is specified in the documentation (At least I don’t seem to find it). Thank you!
    plus1 2
  • t

    thankful-ghost-61888

    09/28/2022, 9:02 AM
    Hi all, a question about the lookml ingestion, we have seen recurring errors with the the SQL lineage analyzer, it looks like it doesn’t recognise some lookML syntax like the
    ${EXTENDED}
    keyword (docs here). From the ingestion logs:
    Copy code
    b"2022-09-25 10:05:57,890 ERROR    SQL lineage analyzer error 'An Identifier is expected, got Token[value: EXTENDED] instead.' for query: 'SELECT\n"
    [2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b"      date_trunc('week',purchase_date) as purchase_date,\n"
    [2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b'      user_id,\n'
    [2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b'      buyer_country,\n'
    [2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      count(id) as items_bought_week,\n'
    [2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      sum(GMV) as gmv_bought_week,\n'
    [2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      min(user_order_sequence_number) as user_order_sequence_number_minweek,\n'
    [2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      max(user_order_sequence_number) as user_order_sequence_number_maxweek,\n'
    [2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b"      percent_rank() over (partition by date_trunc('week',purchase_date) order by items_bought_week, sum(GMV)) as rank_items_bought,\n"
    [2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b"      percent_rank() over (partition by date_trunc('week',purchase_date),buyer_country order by items_bought_week, sum(GMV)) as rank_items_bought_country\n"
    [2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'    FROM (EXTENDED)\n'
    [2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'    GROUP BY 1,2,3\n'
    [2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'2022-09-25 10:05:57,890 ERROR    sql holder not present so cannot get tables\n'
    [2022-09-25 10:05:57,894] {{pod_launcher.py:156}} INFO - b'2022-09-25 10:05:57,890 ERROR    sql holder not present so cannot get columns\n'
    The original lookML sql looks like this:
    Copy code
    include: "dt_engine_purchasetransaction.view"
    
    view: dt_buyer_transactions_weekly {
      extends: [dt_engine_purchasetransaction]
      derived_table: {
      sql:
        SELECT
          date_trunc('week',purchase_date) as purchase_date,
          user_id,
          buyer_country,
          count(id) as items_bought_week,
          sum(GMV) as gmv_bought_week,
          min(user_order_sequence_number) as user_order_sequence_number_minweek,
          max(user_order_sequence_number) as user_order_sequence_number_maxweek,
          percent_rank() over (partition by date_trunc('week',purchase_date) order by items_bought_week, sum(GMV)) as rank_items_bought,
          percent_rank() over (partition by date_trunc('week',purchase_date),buyer_country order by items_bought_week, sum(GMV)) as rank_items_bought_country
        FROM (${EXTENDED})
        GROUP BY 1,2,3
        ;;
      }
    We’re also seeing an issue with parsing a new line after the
    FROM
    statement:
    Copy code
    b"2022-09-25 10:06:10,069 ERROR    SQL lineage analyzer error 'An Identifier is expected, got Token[value: \n"
    [2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b"] instead.' for query: 'SELECT\n"
    [2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        __d_a_t_e\n'
    [2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , userid\n'
    [2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , query\n'
    [2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , sum(total_searches) AS total_searches\n'
    [2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        FROM\n'
    [2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'            (\n'
    [2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'            (SELECT\n'...
    The original syntax looks like this:
    Copy code
    view: product_interaction_searches {
        derived_table: {
          sql_trigger_value: SELECT max(event_date::date) FROM datalake_processed.etl_tracking_searches;;
          distribution: "date"
          sortkeys: ["date"]
          sql:
           SELECT
            date
            , userid
            , query
            , sum(total_searches) AS total_searches
            FROM
                (
                (SELECT
                  cast(date as date) AS date
                  , userid AS userid
                  , query
                  , count(*) AS total_searches
                FROM datalake_compacted.mixpanel_tracking_search_results_query_action...
    Any thoughts about the above?
    m
    • 2
    • 1
  • c

    careful-action-61962

    09/28/2022, 10:12 AM
    Hey Folks, has anyone used service principals in databricks instead of their own user to ingest metadata into datahub?
    plus1 1
    g
    • 2
    • 2
  • f

    flaky-soccer-57765

    09/28/2022, 11:50 AM
    Hello All, I am trying to build an emitter that will update the predefined glossary terms to a column. However, I have noticed that when creating "terms" in UI the URN has a GUID format ex: urnliglossaryTerm:bfe84277-037a-4c4c-b650-5cf0a4c2002e. Ideally, I would like to have the URN setup with the name of the node. name of the term to build logic inside the emitter. Can you suggest how to do that please?
    b
    • 2
    • 2
  • e

    early-airplane-84388

    09/28/2022, 12:19 PM
    Hi all, After I enabled the metadata authentication in DataHub, my DataHub Ingestions couldn't access the Secrets created for Ingestions. This isn't the an issue in my localhost deployment of Datahub using the quickstart. Can you please suggestion how to fix this?
    Copy code
    ~~~~ Execution Summary ~~~~
    
    RUN_INGEST - {'errors': [],
     'exec_id': 'efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc',
     'infos': ['2022-09-28 11:44:07.154056 [exec_id=efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc] INFO: Starting execution for task with name=RUN_INGEST',
               '2022-09-28 11:44:07.182384 [exec_id=efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc] INFO: Caught exception EXECUTING '
               'task_id=efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 67, in execute\n'
               '    recipe: dict = SubProcessTaskUtil._resolve_recipe(validated_args.recipe, ctx, self.ctx)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_task_common.py", line 84, in _resolve_recipe\n'
               '    raise TaskError(f"Failed to resolve secret with name {match}. Aborting recipe execution.")\n'
               'acryl.executor.execution.task.TaskError: Failed to resolve secret with name Dev_DataHub_PRIVATE_KEY_ID. Aborting recipe execution.\n']}
    Execution finished with errors.
    g
    b
    • 3
    • 7
  • a

    ancient-policeman-73437

    09/28/2022, 7:09 PM
    Dear all, I am ingesting the metadata from Glue and it starts the load of data, but at some moment shows the next error message:
    Copy code
    '/usr/local/bin/run_ingest.sh: line 26:  4036 Killed                  ( python3 -m datahub ingest -c "$4/$1.yml" )\n',
               "2022-09-28 14:04:23.217735 [exec_id=900d34f5-3632-4ada-a541-aa104a65e6ca] INFO: Failed to execute 'datahub ingest'",
               '2022-09-28 14:04:23.218122 [exec_id=900d34f5-3632-4ada-a541-aa104a65e6ca] INFO: Caught exception EXECUTING '
    what could be a reason ? Thank you in advance!
    g
    • 2
    • 6
1...737475...144Latest