https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • c

    clean-coat-28016

    03/31/2022, 12:41 AM
    Hi All, I am getting an error when use postgres ingestion recipe. If I remove "platform_instance" from the recipe, the error message goes away. I am using DataHub version 0.8.31. IIRC, this was working for 0.8.29. Error message and recipe are attached. Any pointers about what is wrong?
    rds.err.logrds.recipe.yaml
    m
    b
    • 3
    • 3
  • i

    incalculable-forest-10734

    03/31/2022, 3:15 AM
    Hi guys, I have a question about Removing Metadata from DataHub. I ingested BigQuery tables into DataHub and deleted table from BigQuery that ingested to DataHub. And then I re-ingested BigQuery tables into DataHub. I expected the table to be deleted from DatHub, but it was still there. Why is the table still there? Do I have to manually delete it?
    m
    • 2
    • 1
  • n

    numerous-morning-88512

    03/31/2022, 8:52 AM
    Hi guys, i have a problem in metabase ingestion it works successfully but I can't find it in datahub
    c
    • 2
    • 3
  • s

    shy-fireman-88724

    03/31/2022, 8:27 PM
    Hello, were facing some problems with Spark jobs integration. We're creating a scala notebook whose jobs reads from a hive table using
    spark.sql()
    and writes the data into another hive table. Even though the lineage appears, it has wrong names in the components, in the source it shows the S3 location and in the spark job it shows the method name as you can see in the image bellow. We expected to appear the
    schema_name.table_name
    instead of the S3 location. Is there something more we can configure? Another question: is the demo source code available somewhere?
    l
    i
    c
    • 4
    • 13
  • c

    cold-hydrogen-10513

    04/01/2022, 10:13 AM
    hello, I was posting a question regarding connection to snowflake and @incalculable-ocean-74010 helped me with specifying the correct endpoint for my recipe. https://datahubspace.slack.com/archives/CUMUWQU66/p1648554749148879 now I managed to connect to snowflake but I have an issue with getting the metadata. The ingestion job takes about 1.5h and then fails. The stacktrace is the following ( I trimmed it a bit). Could you please give me a hint what can be checked here?
    Copy code
    '[2022-03-30 16:24:11,842] ERROR    {datahub.entrypoints:152} - File '
               '"/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/datahub/entrypoints.py", line 138, in main\n'
               '    135  def main(**kwargs):\n'
               '    136      # This wrapper prevents click from suppressing errors.\n'
               '    137      try:\n'
               '--> 138          sys.exit(datahub(standalone_mode=False, **kwargs))\n'
               '    139      except click.exceptions.Abort:\n'
               '    ..................................................\n'
               '     kwargs = {}\n'
               '     datahub = <Group datahub>\n'
               "     click.exceptions.Abort = <class 'click.exceptions.Abort'>\n"
               '    ..................................................\n'
               '\n'
               'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/click/core.py", line 1130, in __call__\n'
               '    1128  def __call__(self, *args: t.Any, **kwargs: t.Any) -> t.Any:\n'
               ' (...)\n'
               '--> 1130      return self.main(*args, **kwargs)\n'
               '    ..................................................\n'
               '     self = <Group datahub>\n'
               '     args = ()\n'
               '     t.Any = typing.Any\n'
               "     kwargs = {'standalone_mode': False,\n"
               "               'prog_name': 'python3 -m datahub'}\n"
               '    ..................................................\n'
               '\n'
               'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/click/core.py", line 1055, in main\n'
               '    rv = self.invoke(ctx)\n'
               'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/click/core.py", line 1657, in invoke\n'
               '    return _process_result(sub_ctx.command.invoke(sub_ctx))\n'
    ........................
    'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/snowflake/sqlalchemy/snowdialect.py", '
               'line 573, in <listcomp>\n'
               '    return [self.normalize_name(row[1]) for row in cursor]\n'
               'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/snowflake/sqlalchemy/snowdialect.py", '
               'line 204, in normalize_name\n'
               '    if name.upper() == name and not self.identifier_preparer._requires_quotes(name.lower()):\n'
               'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/sqlalchemy/sql/compiler.py", line 3613, '
               'in _requires_quotes\n'
               '    or value[0] in self.illegal_initial_characters\n'
               '\n'
               'IndexError: string index out of range\n'
               '[2022-03-30 16:24:11,842] INFO     {datahub.entrypoints:161} - DataHub CLI version: 0.8.31 at '
               '/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/datahub/__init__.py\n'
               '[2022-03-30 16:24:11,842] INFO     {datahub.entrypoints:164} - Python version: 3.9.9 (main, Dec 21 2021, 10:03:34) \n'
               '[GCC 10.2.1 20210110] at /tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/bin/python3 on '
               'Linux-5.4.176-91.338.amzn2.x86_64-x86_64-with-glibc2.31\n'
               "[2022-03-30 16:24:11,842] INFO     {datahub.entrypoints:167} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': "
               "'v0.8.31', 'commit': '2f078c981c86b72145eebf621230ffd445948ef6'}}, 'managedIngestion': {'defaultCliVersion': '0.8.31', 'enabled': True}, "
               "'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, "
               "'retention': 'true', 'noCode': 'true'}\n",
               "2022-03-30 16:24:14.167925 [exec_id=a8e48815-7f1f-4468-958c-3c2b1fcbf48e] INFO: Failed to execute 'datahub ingest'",
               '2022-03-30 16:24:14.168306 [exec_id=a8e48815-7f1f-4468-958c-3c2b1fcbf48e] INFO: Caught exception EXECUTING '
               'task_id=a8e48815-7f1f-4468-958c-3c2b1fcbf48e, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 81, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    s
    • 2
    • 112
  • c

    chilly-oil-22683

    04/01/2022, 11:45 AM
    Hi, trying to set up ingestion from Athena source: https://datahubproject.io/docs/metadata-ingestion/source_docs/athena#config-details But I'm tryng tor wrap my head around the concept of schema in the context of Athena. AFAIK Athena doesn't know the concept of schema (perhaps under the hood) but to the user we only deal with the concepts
    data catalog
    ,
    database
    ,
    table
    and
    view
    . So what do you mean by
    schema
    here?
    d
    • 2
    • 1
  • c

    chilly-oil-22683

    04/02/2022, 9:26 AM
    Hi, Business Glossary Ingestion doesn't seem to support an S3 URI as file path? Seems more logical to me instead of having to push the glossary file to the EKS cluster or the pods file storage. Mature file libraries should be able to parse common URI's by themselves I think. Am I overlooking something, what URI's does this ingestor support besides local files?
    b
    • 2
    • 1
  • s

    swift-breakfast-25077

    04/02/2022, 11:22 AM
    hello everyone, currently I'm testing the metadata ingestion, I noticed that it takes all tables, is there a method to specify only the tables considered by the ingestion ? Another question, how we can ingest powerbi reports which are deployed on a remote server (not on my local machine) ?
    i
    • 2
    • 4
  • h

    handsome-minister-84652

    04/03/2022, 11:19 PM
    Hi team - quick question, is there an API to create a chart? I see one to update, but would like to create one (our charts are kept in a yaml file, just like our domains, we then parse the file and update datahub on our build)
    i
    • 2
    • 3
  • m

    mammoth-fountain-32989

    04/04/2022, 9:43 AM
    Hi, I want to load metadata (through datahub ingestion UI) from certain schemas and for tables with specific pattern in their names (Postgresql source). My yaml looks similar to this: schema_pattern: allow: - abc - pqr - test table_pattern: allow: - test_base_tbl - check_validations - user_info I assumed it to follow logical AND of these schemas and tables but I see that all objects from the given schemas (irrespective of the object name pattern) are being ingested. Also, is there a way to restrict the views that we ingest. (I am using include_views as True which is pulling all the views) Any sample on how to provide schema and table regex pattern that can be used in conjunction will help. Thanks
    d
    • 2
    • 10
  • m

    most-waiter-95820

    04/04/2022, 10:56 AM
    Hi all, BigQuery lineage question. Say we have this flow of tables and views: TABLE_A --> VIEW_A --> VIEW_B VIEW_B is based on VIEW_A which in turn is built on top of TABLE_A. I've tried to run an ingestion with a lineage from BigQuery exported audit logs and it returned a graph like: TABLE_A --> VIEW_A ┕--> VIEW_B Is there a way to build a lineage more in a "chained" way to be able to track the order of views built on top of each other?
    s
    • 2
    • 9
  • f

    few-grass-66826

    04/04/2022, 3:03 PM
    Hi guys, I am trying to ingest snowflake usage from UI with this config but it returns empty results, is there something wrong or my config is not correct? Thanks :) source: type: snowflake-usage config: host_port: ######.eu-west-1 warehouse: COMPUTE_WH username: ######## password: ########## role: ACCOUNTADMIN env: prod top_n_queries: 10 email_domain: #########.com schema_pattern: deny: ◦ 'information_schema.*' sink: type: datahub-rest config: server: 'http://datahub-gms:8080'
    d
    • 2
    • 8
  • q

    quaint-window-7517

    04/05/2022, 6:12 AM
    Hello guys, I am having a problem in the ingestion UI, I have deployed the datahub to AWS EKS, and switched to use AWS RDS instead of the default MySQL pod, after that, I see these errors from the UI, and I can't create new sources: Unknow error 500
    d
    e
    +2
    • 5
    • 21
  • b

    brave-market-65632

    04/05/2022, 6:41 AM
    Team, I downloaded the quickstart docker image(s) and been playing around with ingestion to start with. Focus is to scan a snowflake db. Snowflake scan with
    include_table_lineage: False
    works fine. However, when set to `True`the ingestion logs report the following error.
    Copy code
    [2022-04-05 11:55:31,989] WARNING  {snowflake.connector.vendored.urllib3.connectionpool:780} - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))':
    When I inspected the logs in
    datahub_datahub-actions_1
    container, I see the following error.
    Copy code
    [2022-04-05 06:22:33,752] ERROR    {acryl_action_fwk.source.datahub_streaming:279} - ERROR
    
    Traceback (most recent call last):
    
      File "/usr/local/lib/python3.9/site-packages/acryl_action_fwk/source/datahub_streaming.py", line 268, in _handle_mae
    
        match=a.subscriptions()[0],
    
    IndexError: list index out of range
    On Snowflake history, I see that the metadata lineage query ran fine and returned ~495K records. Any help is appreciated. Thanks.
    l
    s
    b
    • 4
    • 15
  • n

    nutritious-bird-45843

    04/05/2022, 6:49 PM
    Hello, guys! hihi We intend to upgrade our platform to use Datahub newest version,
    v0.8.32
    , so we are doing some local tests, before deploying to stage and production environments. However, we are facing some issues regarding data indexing. For instance, Kafka topics appears inside
    Datasets
    and also if we query using the search bar. However, clicking on Kafka connector, we receive
    No results found for ""
    as if the query is searching for nothing. The same occurs for Hive connector. The first image shows the home page and the second shows the issue that happens after clicking on Kafka. One thing worth mentioning is that currently we have the same Kafka and Hive metadata ingested on other envs for Datahub version
    28
    and they retrieve the metadata when we click on the connectors. Thanks in advance!
    l
    b
    • 3
    • 6
  • p

    plain-farmer-27314

    04/05/2022, 6:58 PM
    Hey all, just wondering if there was a built in way to add links to a dataset during ingestion. I assume using a transformer would be the correct place to do this
    d
    • 2
    • 1
  • p

    plain-baker-30549

    04/06/2022, 6:18 AM
    Hi team, I'm currently testing DataHub as a data catalog and way how to ingest metadata from Snowflake. Pulling metadata using Snowflake recipe works fine, but I wonder if it's also possible (and how) to push data (e.g. from INFORMATION_SCHEMA of Snowflake DB) to DataHub.
    d
    • 2
    • 4
  • n

    nutritious-bird-77396

    04/06/2022, 2:42 PM
    `A Debug Question`: How do you run
    datahub ingest -c <recipe.yaml>
    in debug mode in local? Is there an additional option you pass in the command line?
    s
    p
    • 3
    • 10
  • p

    plain-farmer-27314

    04/06/2022, 2:45 PM
    Hi, wondering what source is being used to ingest the mlmodels here: https://demo.datahubproject.io/browse/mlModels
    l
    • 2
    • 1
  • b

    billowy-flag-4217

    04/06/2022, 3:58 PM
    Hello, I'm currently using
    acryl-datahub=0.8.31.4
    and
    python=3.8
    when attempting to ingest Looker metadata I get the following error.
    Copy code
    TypeError: You should use `typing_extensions.TypedDict` instead of `typing.TypedDict` with Python < 3.9.2. Without it, there is no way to differentiate required and optional fields when subclassed.
    Is it a requirement now to use python 3.9.2 for Looker ingestion, or is there another workaround?
    • 1
    • 1
  • m

    mysterious-lamp-91034

    04/06/2022, 7:46 PM
    Hello I am running
    Copy code
    ./gradlew :metadata-ingestion:testQuick
    on v0.8.32 I am seeing
    Copy code
    =========================== short test summary info ============================
    FAILED tests/integration/looker/test_looker.py::test_looker_ingest - TypeErro...
    FAILED tests/integration/looker/test_looker.py::test_looker_ingest_allow_pattern
    FAILED tests/integration/lookml/test_lookml.py::test_lookml_ingest - TypeErro...
    FAILED tests/integration/lookml/test_lookml.py::test_lookml_ingest_offline - ...
    FAILED tests/integration/lookml/test_lookml.py::test_lookml_ingest_offline_platform_instance
    FAILED tests/integration/lookml/test_lookml.py::test_lookml_ingest_api_bigquery
    FAILED tests/integration/lookml/test_lookml.py::test_lookml_ingest_api_hive
    FAILED tests/integration/lookml/test_lookml.py::test_lookml_bad_sql_parser - ...
    FAILED tests/integration/lookml/test_lookml.py::test_lookml_github_info - Typ...
    FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_no_partition.json]
    FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_no_partition_exclude.json]
    FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_no_partition_filename.json]
    FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_no_partition_glob.json]
    FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_partition_basic.json]
    FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_partition_keyval.json]
    FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[multiple_files.json]
    FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[single_file.json]
    ==== 17 failed, 325 passed, 52 deselected, 30 warnings in 60.13s (0:01:00) =====
    g
    d
    s
    • 4
    • 7
  • b

    billions-twilight-48559

    04/06/2022, 9:11 PM
    Hi! First of all I’m not new on datahub. I’m trying a new empty setup of datahub on testing for the version 0.8.31. We are executing successfully our ingestion recipes of glossary terms using the datahub cli, all is ok at the CLI and I can se the response status 200 at the gms logs for each insert. But no content appears on the frontend!
    l
    e
    c
    • 4
    • 17
  • o

    orange-coat-2879

    04/06/2022, 10:01 PM
    Hi! I am working on ingestion from MSSQL (SQLEXPRESS), The
    localhost:1433
    does not work for me. I have attached my recipe here. Anyone can help? I am not sure if I should place a real URL (http://........) in the host_port. Thanks for helping!
    b
    b
    • 3
    • 6
  • t

    thousands-room-91010

    04/07/2022, 3:10 AM
    We’d like to add the ability to add entities via graphQL on a local deployment in advance of the feature coming out soon https://datahubproject.io/docs/api/graphql/querying-entities/ Can anyone provide any tips on what would be involved to accomplish this ? Is it as simple as creating a new graphQL mutation that adds a new entry in its internal database ? Thanks for your help
    m
    h
    +3
    • 6
    • 21
  • a

    able-rain-74449

    04/07/2022, 2:12 PM
    Hi all i am getting an error while running
    ConfigurationError: datahub-kafka is disabled; try running: pip install 'acryl-datahub[datahub-kafka]'
    datahub ingest -c example_to_datahub_kafka.yml --dry-run
    see thread. tried running
    pip install 'acryl-datahub[datahub-kafka]'
    get
    Copy code
    88a6420d827a/src/confluent_kafka/src/confluent_kafka.h:23:10: fatal error: 'librdkafka/rdkafka.h' file not found
          #include <librdkafka/rdkafka.h>
                   ^~~~~~~~~~~~~~~~~~~~~~
          1 error generated.
          error: command '/usr/bin/clang' failed with exit code 1
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
    error: legacy-install-failure
    
    × Encountered error while trying to install package.
    ╰─> confluent-kafka
    
    note: This is an issue with the package mentioned above, not pip.
    hint: See above for output from the failure.
    d
    k
    • 3
    • 12
  • a

    able-rain-74449

    04/07/2022, 4:02 PM
    Hi All Anyone got an example of ingesting data Datahub on EKS using what's the best approach?
    h
    s
    s
    • 4
    • 11
  • b

    brave-forest-5974

    04/07/2022, 5:13 PM
    🤔 In Looker, should I expect to see explores connected to their views? (after both LookML and Dashboard ingestion) Or a better question, what would prevent an explore from connecting to its views? Joins perhaps?
    l
    m
    • 3
    • 11
  • h

    handsome-football-66174

    04/07/2022, 6:47 PM
    Hi team, we want to use ingest data coming real time . How do we get started ? I understand these are the steps we need to take care - 1. convert metadata into pegasus format 2. create emitter Anything else I need to consider
    d
    • 2
    • 1
  • n

    numerous-eve-42142

    04/07/2022, 7:48 PM
    Hello! I'm trying to ingest some specific tables from our redshift db to datahub. But there are some tables with names like others. For exemple: The table "db.schema.carrier" have to be ingested So we configured:
    Copy code
    table_pattern:
         allow:
            - "db.schema.carrier"
    
    ...
    
    profile_pattern:
          allow:
            - "db.schema.carrier"`
    But this exemples ingest the tables: • carrier • carrier_damage • carrieir_tower_team There is some whey to strictly specify the tables i want?
    h
    • 2
    • 8
  • i

    icy-piano-35127

    04/07/2022, 8:05 PM
    Hello guys! Is that possible to filter the redash informations bringing just queries and dashboards from a specific datasource?
    d
    • 2
    • 3
1...353637...144Latest