https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • f

    few-grass-66826

    08/10/2022, 2:43 PM
    Hello everyone, another one: I have this dataflow S3 -> DB1.Table1 -> DB2.Table1 Linage for DB1.Table1 shows that it gets data from s3 to DB1.Table1 but linage for DB2.Table1 shows only that it gets data from DB1.Table1 Is there any solution that linage for DB2.Table1 will show also that root is S3 bucket?
    c
    • 2
    • 23
  • j

    jolly-balloon-85466

    08/10/2022, 3:51 PM
    Hello everyone. I'm getting following errors when trying to ingest data from bigquery
    Copy code
    3
               "2022-08-10 15:00:55.500831 [exec_id=c22f29ad-b1a9-4ad6-a1e8-ca6831fc6e47] INFO: Failed to execute 'datahub ingest'",
    1072
               '2022-08-10 15:00:55.506885 [exec_id=c22f29ad-b1a9-4ad6-a1e8-ca6831fc6e47] INFO: Caught exception EXECUTING '
    1071
               'task_id=c22f29ad-b1a9-4ad6-a1e8-ca6831fc6e47, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
    1070
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
    1069
               '    self.event_loop.run_until_complete(task_future)\n'
    1068
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
    1067
               '    return f.result()\n'
    1066
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
    1065
               '    raise self._exception\n'
    1064
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
    1063
               '    result = coro.send(None)\n'
    1062
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
    1061
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
    1060
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    g
    • 2
    • 6
  • j

    jolly-balloon-85466

    08/10/2022, 3:53 PM
    Copy code
    \'failures\': {\'lineage-gcp-logs\': ["Error was \'datasetId\'"]},\n'
    h
    s
    • 3
    • 2
  • j

    jolly-balloon-85466

    08/10/2022, 3:53 PM
    This has marked as the failure
  • r

    rapid-house-76230

    08/10/2022, 10:22 PM
    Hi team, I’m trying to ingest from Hive using a recipe that I’ve used before without a problem. Now I’m just getting a successful report with no schema ingested? 🧵
    m
    c
    g
    • 4
    • 11
  • m

    microscopic-mechanic-13766

    08/11/2022, 7:49 AM
    Hi everyone, since I updated Datahub to v0.8.42 my recipes change from what they were inicially. For example: Initially it would like this:
    Copy code
    source:
        type: postgres
        config:
            host_port: 'postgresql:5432'
            database: <db>
            username: <usr>
            password: <psswd>
            include_tables: true
            include_views: true
            profiling:
                enabled: True
                max_workers: 20
    sink:
        type: datahub-rest
        config:
            server: '<http://datahub-gms:8080>'
    But after the source is created, it looks like this:
    Copy code
    sink:
        type: datahub-rest
        config:
            server: '<http://datahub-gms:8080>'
    source:
        type: postgres
        config:
            include_tables: true
            database: <db>
            profiling:
                max_workers: 20
                enabled: true
            host_port: 'postgresql:5432'
            include_views: true
            username: <usr>
            password: <psswd>
    pipeline_name: 'urn:li:dataHubIngestionSource:7c70f090-79cc-432a-a757-7c01b8c091b9'
    Is this intended? If so, may I know why the change of structure? Thanks in advance!!
    c
    • 2
    • 1
  • a

    alert-fall-82501

    08/11/2022, 8:22 AM
    Hi Team I am working creating DAG with with apache airflow to run my task with datahub ingest ? but I am having issue with source
    h
    • 2
    • 3
  • a

    alert-fall-82501

    08/11/2022, 8:22 AM
    can anyone please suggest on this ?
  • a

    alert-fall-82501

    08/11/2022, 8:22 AM
    Copy code
    raise KeyError(f"Did not find a registered class for {key}")
    KeyError: 'Did not find a registered class for s3'
    d
    • 2
    • 31
  • f

    famous-florist-7218

    08/11/2022, 9:18 AM
    Hi team! Does anyone know why the s3 ingestion job was run successfully but UI doesn’t load s3 dataset?
    d
    g
    • 3
    • 15
  • e

    echoing-farmer-38304

    08/11/2022, 1:10 PM
    Hello everyone, have a question about delta lake ingestion, there is an opportunity to ingest with s3 aws. Example:
    Copy code
    source:
      type: "delta-lake"
      config:
        base_path:  "<s3://my-bucket/my-folder/sales-table>"
        s3:
          aws_config:
    tried to use it with minio credentials but it doesn't see my data(run successfully but doesn’t load data). It happens, in my opinion, because of troubles with base_path. Is there any opportunity to use delta lake ingestion with minio credentials? And if there are any ready solutions and we want this feature should we implement it as an additional module or we can just add some changes to the current module so it could use both aws and minio?
    h
    c
    +2
    • 5
    • 9
  • g

    gifted-knife-16120

    08/11/2022, 1:29 PM
    let say I have 3 databases (A, B, C) in
    athena
    platform. and I would like to delete 2 of them. how it can be done?
    h
    • 2
    • 4
  • d

    damp-queen-61493

    08/11/2022, 8:26 PM
    Hi team! I'm trying to ingest Kafka with Schema Registry. Unable to get schema registry updates for topic. The schema was ingested correctly the first time, but after that it looks like the datahub doesn't update the schema anymore. Datahub version
    v0.8.43
    (dev env)
    Copy code
    ## Recipe
    source:
        type: kafka
        config:
            platform_instance: poc_cluster_0
            connection:
                bootstrap: 'xxxxxx.gcp.confluent.cloud:9092'
                consumer_config:
                    security.protocol: SASL_SSL
                    sasl.mechanism: PLAIN
                    sasl.username: '${CLUSTER_API_KEY_ID}'
                    sasl.password: '${CLUSTER_API_KEY_SECRET}'
                schema_registry_url: '<https://xxxxx.gcp.confluent.cloud>'
                schema_registry_config:
                    <http://basic.auth.user.info|basic.auth.user.info>: '${REGISTRY_API_KEY_ID}:${REGISTRY_API_KEY_SECRET}'
    h
    • 2
    • 2
  • c

    colossal-sandwich-50049

    08/11/2022, 8:50 PM
    Hello, I am running
    datahub ingest -c my-delta-recipe.yml
    locally and getting the error below; can someone assist? Note, I am running DataHub using datahub quickstart
    Copy code
    ###### Recipe
    source:
      type: "delta-lake"
      config:
        env: "PROD"
        platform_instance: "my-delta-lake"
        platform: "delta-lake"
        base_path: "<s3://my-bucket/data/v3/>"
        s3:
          aws_config:
            aws_region: "eu-west-1"
            aws_endpoint_url: "http://<local-ip>:4566"
    sink:
      type: "datahub-rest"
      config:
        server: "http://<local-ip>:8080" # local IP
    Copy code
    ###### Logs
    [2022-08-11 16:46:47,941] INFO     {datahub.cli.ingest_cli:170} - DataHub CLI version: 0.8.43
    [2022-08-11 16:46:48,004] INFO     {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://10.12.242.238:8080>
    [2022-08-11 16:46:49,063] ERROR    {logger:26} - Please set env variable SPARK_VERSION
    [2022-08-11 16:46:49,565] INFO     {datahub.cli.ingest_cli:119} - Starting metadata ingestion
    [2022-08-11 16:46:49,565] INFO     {datahub.cli.ingest_cli:123} - Source (delta-lake) report:
    {'workunits_produced': '0',
     'workunit_ids': [],
     'warnings': {},
     'failures': {},
     'cli_version': '0.8.43',
     'cli_entry_location': '/usr/local/lib/python3.9/site-packages/datahub/__init__.py',
     'py_version': '3.9.9 (main, Nov 21 2021, 03:23:42) \n[Clang 13.0.0 (clang-1300.0.29.3)]',
     'py_exec_path': '/usr/local/opt/python@3.9/bin/python3.9',
     'os_details': 'macOS-12.4-x86_64-i386-64bit',
     'filtered': []}
    [2022-08-11 16:46:49,565] INFO     {datahub.cli.ingest_cli:126} - Sink (datahub-rest) report:
    {'records_written': '0', 'warnings': [], 'failures': [], 'gms_version': 'v0.8.43'}
    [2022-08-11 16:46:50,061] ERROR    {datahub.entrypoints:188} - Command failed with argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'. Run with --debug to get full trace
    [2022-08-11 16:46:50,061] INFO     {datahub.entrypoints:191} - DataHub CLI version: 0.8.43 at /usr/local/lib/python3.9/site-packages/datahub/__init__.py
    h
    • 2
    • 3
  • k

    kind-whale-32412

    08/11/2022, 8:54 PM
    Hey there, I am using this example to tag columns in a table. One issue I noticed is
    graph.get_aspect_v2
    part where you always have to make a GET request to the server first to obtain all existing tags; then append if it's a new tag; and then emit it to DataHub. I find this design a little bit odd that client side has to know what all the tags are, and then server side is completely stateless. I attempted to bypass this getting the aspect and tried out to just construct
    MetadataChangeProposalWrapper
    with
    GlobalTagsClass(tags=[tag_association_to_add])
    no matter what the state is. I noticed that this removes all the other tags. I was expecting that this would append only the tag that I am attempting to add, not remove other tags. Is this intended by design? Is there a way to change this by having a flag or any other way to submit? One big issue here is the race condition, if I am submitting these changes through kafka events (or even synchronous parallel way) and there happens to be multiple MCPW of the same column, other tags could be lost.
    plus1 2
    👍 1
    g
    • 2
    • 2
  • d

    dazzling-insurance-83303

    08/11/2022, 9:55 PM
    Domain assignment using simple_add_dataset_domain Hello. I am trying to associate domains to datasets using the transfomers - *simple_add_dataset_domain* specification but I am getting the following error:
    KeyError: 'Did not find a registered class for simple_add_dataset_domain'
    I am able to associate a domain to database tables via the domains section under the source configuration section. I am trying to group all the ownerships, associations in the same section, i.e., transformers. Under the transformers section per the documentation the code should look like below:
    Copy code
    transformers:
      - type: "simple_add_dataset_domain"
        config:
          semantics: OVERWRITE
          domains:
            - urn:li:domain:engineering
    My construct is as follows which is throwing the error:
    Copy code
    transformers:
      - type: "simple_add_dataset_ownership"
      ...
      - type: "simple_add_dataset_domain"
        config:
          domains:
            - "urn:li:domain:xxxxxxxx-xxxx-xxxx-xxxx-bffdfb8b977d" # Finance
    Thoughts? Thanks!
    b
    g
    • 3
    • 3
  • a

    alert-fall-82501

    08/12/2022, 6:19 AM
    Hi Team - I have partitioned data on s3 for various partner . I am able to get those data on appropriate server .Right Now I gave hard core base path in the config file . can anybody tell how I can take only table and schema without hardcore path ?
    d
    • 2
    • 23
  • a

    alert-fall-82501

    08/12/2022, 6:19 AM
    s3://my-bucket/foo/tests/bar.avro # single file table s3://my-bucket/foo/tests/*.* # mulitple file level tables s3://my-bucket/foo/tests/{table}/*.avro #table without partition s3://my-bucket/foo/tests/{table}/*/*.avro #table where partitions are not specified s3://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified s3://my-bucket/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name s3://my-bucket/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format s3://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format s3://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions s3://my-bucket/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in bucket s3://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in bucket
  • a

    alert-fall-82501

    08/12/2022, 6:23 AM
    I am following this documents but need to get more clarification
  • f

    full-chef-85630

    08/12/2022, 6:49 AM
    hi all,Using Datahub's Airflow lineage plugin ,Dataub enables token verification. How to set airflow
    d
    • 2
    • 5
  • b

    bright-cpu-56427

    08/12/2022, 7:15 AM
    Hi all, What value should I put in the database value in the yaml recipe when ingesting mysql? If I put a database name that does not exist in mysql in this value, an error occurs, If I put the name of an existing database, when it is displayed in datahub, dbname.dbname.table is displayed like this, and it is confusing. Should I configure it to ingest only one database with only one recipe?
    g
    • 2
    • 3
  • a

    average-rocket-98592

    08/12/2022, 11:56 AM
    Hi, I’m trying to ingest metadata from PowerBI report server on premise. Did someone already try to do it? Thanks in advance for your help!!
    g
    • 2
    • 1
  • f

    fancy-thailand-73281

    08/12/2022, 2:00 PM
    Hi All, We deployed the datahub
    v0.8.36
    in AWS EKS cluster with helm charts, we are using AWS MSK (KAFKA), with as a Bootstrap servers :*SASL/SCRAM* and ZooKeeper TLS. Everything works fine till here. But we are not able to Ingest data(SNOWFLAKE) from UI , I see 'N/A' when I try to run ingestion. We found in Datahub docs(https://datahubproject.io/docs/ui-ingestion/) saying that we need to enable
    datahub-actions
    , then we deployed the public.ecr.aws/datahub/acryl-datahub-actions pods. The pod is not running(CrashLoopBackOff) and we see error logs shows as below
    Copy code
    [2022-08-11 19:56:59,004] ERROR    {datahub.entrypoints:138} - File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 77, in run                                                                                                    │
    │     67   def run(config: str, dry_run: bool, preview: bool, strict_warnings: bool) -> None:                                                             
    
    
                                                                                                                                                                                                                                                             │
    │ KafkaException: KafkaError{code=_INVALID_ARG,val=-186,str="Failed to create consumer: No provider for SASL mechanism GSSAPI: recompile librdkafka with libsasl2 or openssl support. Current build options: PLAIN SASL_SCRAM OAUTHBEARER"}                  │
    │ 2022/08/11 19:56:59 Command exited with error: exit status 1
    could someone please help us and thanks in Advance
    m
    d
    • 3
    • 4
  • r

    rapid-fall-7147

    08/12/2022, 4:29 PM
    Hi All , We are ingesting redshift Metadata but even though we have enabled profiling but the field descriptions (which are part of the redshift DDL statement) are not getting populated in the datahub Any suggestion ?
    m
    • 2
    • 4
  • d

    delightful-zebra-4875

    08/15/2022, 11:46 AM
    hi,I want to display flinkcatalog information when extracting metadata using hive. The metadata information of flinkcatalog is displayed in Properties, but after I modified the columns logic in hive.py and sql_common.py, I replaced the corresponding files in the datahub package under python of acryldata/datahub-actions in docker, and re-run The back front end doesn't show the data I want
  • c

    chilly-sundown-93656

    08/15/2022, 1:18 PM
    Hello, I'm trying to build some automation around datahub ingestion sources, in our company we have lots of different data stores and numerous clusters. for example, we have 40 different Kafka clusters. I'm not sure if my approach is correct one Initially, I've created a python script that creates sources with GraphQL APIs and triggers the executions.
    Copy code
    query = """
       mutation
        {
          createIngestionSource(input: {
            name: "$name",
            type: "kafka",
            description: "$name",
            config: {
              recipe: "$recipe",
              executorId: "default"
            }
          })
    
        }
    
    """
    The rationale is to allow users from different teams to add their own Data stores and see what we already ingesting. Not sure if this is a legit method because the service experiences a lot of trouble pulling topics from several Clusters in parallel. There are a lot of MySql, Kafka, and Schema Registry exceptions. sometimes I need to rerun the execution to make it pass. appreciate any advice on that matter. thanks
    m
    • 2
    • 4
  • b

    bright-receptionist-94235

    08/15/2022, 4:49 PM
    Hi All, General question: Who is writing the data source plugin? Datahub or the data source team?
    g
    • 2
    • 4
  • b

    boundless-mechanic-19488

    08/15/2022, 6:18 PM
    hey guys, i am trying to ingest my glue catalog with the following recipe deployed on a docker-compose, but it always returns 0 assets:
    Copy code
    source:
        type: glue
        config:
            aws_region: eu-central-1
            aws_access_key_id: YYY
            aws_secret_access_key: XXX
    i ve attached the following policy:
    Copy code
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "glue:GetDatabases",
            "glue:GetTables"
          ],
          "Resource": [
            "arn:aws:glue:eu-central-1:XXX:catalog",
            "arn:aws:glue:eu-central-1:XXX:database/*",
            "arn:aws:glue:eu-central-1:XXX:table/*"
          ]
        },
        {
          "Effect": "Allow",
          "Action": [
            "glue:GetDataflowGraph",
            "glue:GetJobs"
          ],
          "Resource": "*"
        },
        {
          "Effect": "Allow",
          "Action": [
            "s3:PutObject",
            "s3:GetObject",
            "s3:DeleteObject"
          ],
          "Resource": [
            "arn:aws:s3:::*"
          ]
        }
      ]
    }
    does anyone has a hint? the pipeline has been executed successfully in 2.5 seconds all the time with ingestion 0 values ando not reporting any issues but a Client-Server Incompatible Warning
    Copy code
    x1b[36mYour client version 0.8.42 is older than your server version 0.8.43. Upgrading the '
               'cli to 0.8.43 is recommended.\n'
    g
    d
    • 3
    • 5
  • q

    quick-megabyte-61846

    08/12/2022, 1:17 PM
    Hello while trying to do some houeskeeping inside my demo datahub I found weird bug:
    Copy code
    ❯ datahub get --urn "urn:li:dataPlatform:dbt"
    {
      "dataPlatformInfo": {
        "datasetNameDelimiter": ".",
        "displayName": "dbt",
        "logoUrl": "/assets/platforms/dbtlogo.png",
        "name": "dbt",
        "type": "OTHERS"
      },
      "dataPlatformKey": {
        "platformName": "dbt"
      }
    }
    g
    h
    • 3
    • 8
  • a

    alert-fall-82501

    08/16/2022, 6:21 AM
    Hi Team - what if I need to take multiple schema and Table from s3 delta lake to datahub using single config file ? please suggest , I will be having data on different s3 paths . I want to use single config file .
    d
    s
    • 3
    • 24
1...606162...144Latest