https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • c

    cool-painting-92220

    02/02/2022, 1:27 AM
    Hey everyone! I had a question about Snowflake query usage stat ingestions: The user account I created in Snowflake for DataHub ingestions is not an `accountadmin`; instead, I've applied a lower level role (has restricted access to tables and is masked from seeing particular sensitive rows/columns) and have granted the user access to the account_usage Snowflake schema. Will the queries pulled in this user's ingestion job also consist of queries that have been made by users with higher level access (ex: an account admin)? As an example of this scenario:
    Copy code
    Tables:
    Table A
    Table B
    
    Users:
    User 1: can only access Table A
    User 2: can access Table A and Table B
    
    User 2 has made a query before of the following: 
    SELECT uid FROM Table A JOIN SELECT uid FROM Table B
    
    Let's say I used User 1's credentials for my ingestion job of Table A: would the query usage stats pull User 2's query above?
    o
    • 2
    • 1
  • r

    rich-winter-40155

    02/02/2022, 1:31 AM
    Hi, I am new to datahub. We are looking to setup a google based SSO for our datahub instance and how does the metadata rest connector will work if we enabled security via google sign in? tried to look into the docs I see there is token option but I am not sure how will that work. Appreciate any help here. Thanks
    b
    • 2
    • 8
  • r

    rhythmic-kitchen-64860

    02/02/2022, 2:31 AM
    Hi, i want to know can we ingest only 1 table from the database? thanks before!!
    b
    • 2
    • 7
  • c

    curved-truck-53235

    02/02/2022, 6:56 AM
    Hi everyone! Can we use environment variables in yaml? I know about datahub.ingestion.run.pipeline but yaml is more preferrable for us
    b
    • 2
    • 2
  • m

    modern-monitor-81461

    02/02/2022, 12:22 PM
    How to disable Airflow lineage for some DAGs Hi all, I am using Airflow as a job scheduler and I have been enjoying the lineage backend with DataHub. I have looked at the code and did not see any hint of this, so I'll ask here. Is there a way to configure a DAG or an Operator to prevent Airflow from emitting task and pipeline lineage to DataHub? By default when you install and configure the backend, any task and DAG that will run in Airflow will emit to DataHub. That's all cool, but we have jobs running in Airflow that are unrelated to data (could be infrastructure maintenance jobs, housekeeping, etc...) and it makes no sense to see those in DataHub. It would be nice if there would be a flag that I can set on a DAG and/or Operator and that flag would indicate if Airflow should or not emit to DataHub. And there should be a default value for this that can be set in the lineage backend config so that you can overwrite the current default behavior (emit by default or not). Does this make sense?
    o
    s
    • 3
    • 4
  • d

    dazzling-cat-48477

    02/02/2022, 5:36 PM
    Hello everyone. I have a pipeline from which I would like to extract the lineage, the pipeline consists of the following components: AWS s3 buckets AWS Glue jobs (pyspark) AWS Redshift All this orchestrated by AWS MWAA (Airflow). So far I have managed to visualize the lineage of the s3, Redshift and Glue jobs (although the latter was a bit difficult) and I wanted to try to get the lineage of Airflow, taking into account that the Airflow tasks are all of type AwsGlueJobOperator. Since we have Airflow operated by AWS we are not allowed to do the backend lineage due to version incompatibility so I plan to try to get it with the help of the DatahubEmitterOperator. My questions are: 1. Is it possible to tell the lineage emitter that both its upstream and downstream tasks are AwsGlueJobOperator type tasks? 2. If this is not possible, could it be done with the Spline-spark-agent to extract the data from the Glue jobs?
    o
    m
    b
    • 4
    • 5
  • h

    handsome-football-66174

    02/02/2022, 6:02 PM
    Hi Everyone, I see that there is Ingestion via WebUI. Wanted to understand if there is governance on who can execute the Ingestions etc.
    o
    • 2
    • 2
  • l

    late-bear-87552

    02/02/2022, 6:04 PM
    Copy code
    source:
      type: "bigquery"
      config:
        ## Coordinates
        project_id: adf-adfa-240416
        credential:
          project_id: adf-adfa-240416
          private_key_id: ""
          private_key: "-----BEGIN PRIVATE KEY"
          client_email: ""
          client_id: ""
        table_pattern:
          deny:
          - 
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    o
    • 2
    • 2
  • a

    ancient-apartment-23316

    02/02/2022, 8:07 PM
    Hi, I trying to ingest from snowflake to DataHub using my local machine, and I get http 500 error
    Copy code
    [2022-02-02 19:57:29,766] ERROR    {datahub.ingestion.run.pipeline:87} - failed to write record with workunit corp_data_forge_ods_dev.ada.curr_ada_permissions with ('Unable to emit metadata to DataHub GMS'
    'status': 500
    errors in GMS pod:
    Copy code
    16:31:33.065 [qtp544724190-11] INFO  c.l.m.filter.RestliLoggingFilter:56 - POST /entities?action=ingest - ingest - 500 - 0ms
    16:31:33.066 [qtp544724190-11] ERROR c.l.m.filter.RestliLoggingFilter:38 - java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
    this is my recipes
    Copy code
    source:
      type: snowflake
      config:
        env: POC
        host_port: "myacc"
        warehouse: "wh-name"
        database_pattern:
          allow:
            - "db-name"
        username: "username"
        password: "pass"
        role: "myrole"
    sink:
      type: "datahub-rest"
      config:
        server: "<http://123123-123123.us-east-1.elb.amazonaws.com:8080>"
    GSM is able, I can send API request and receive respond
    Copy code
    curl --location --request POST '<http://123123-12312123.us-east-1.elb.amazonaws.com:8080/entities?action=search>' \
    --header 'X-RestLi-Protocol-Version: 2.0.0' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "input": "*",
        "entity": "dataset",
        "start": 0,
        "count": 1000
    }'
    but I can set sink to json. It’s work. Then I able to set source = json, sink=datahub and it’s work! Don’t know how it’s happens
    o
    • 2
    • 12
  • g

    glamorous-microphone-33484

    02/03/2022, 1:12 AM
    In our org, we will use spark to read from kafka and write to kafka/hive/files. Can datahub extract these lineage info out from spark streaming jobs using DatahubSparkListener?
    o
    l
    c
    • 4
    • 7
  • l

    late-bear-87552

    02/03/2022, 5:42 AM
    facing the issue while trying to ingest via UI. its working through datahub ingest command. any idea what is missing?
    Copy code
    source:
        type: bigquery
        config:
            project_id: re-240416
            credential:
                private_key_id: 134143qefqafa12341
                private_key: "-----BEGIN PRIVATE KEY-----\n\n-----END PRIVATE KEY-----\n"
                client_email: <mailto:test-query@re.gserviceaccount.com|test-query@re.gserviceaccount.com>
                client_id: '4512451451341341'
    sink:
        type: datahub-rest
        config:
            server: '<http://localhost:8080>'
    d
    • 2
    • 3
  • f

    few-air-56117

    02/03/2022, 7:13 AM
    Hi guys, i tried to ingest biguqery-usage for 2 project, its started , but after 2-3 minutes i get this error
    Copy code
    Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute' of service '<http://logging.googleapis.com|logging.googleapis.com>' for consumer 'project_number:491986273194'. [{'@type': '<http://type.googleapis.com/google.rpc.ErrorInfo|type.googleapis.com/google.rpc.ErrorInfo>', 'reason': 'RATE_LIMIT_EXCEEDED', 'domain': '<http://googleapis.com|googleapis.com>', 'metadata': {'consumer': 'projects/491986273194', 'quota_metric': '<http://logging.googleapis.com/read_requests|logging.googleapis.com/read_requests>', 'quota_limit': 'ReadRequestsPerMinutePerProject', 'service': '<http://logging.googleapis.com|logging.googleapis.com>'}}]
    This is the recepi
    Copy code
    source:
      type: bigquery-usage
      config:
        # Coordinates
        projects:
          - <project1>
          - <project2>
        max_query_duration: 5
    
    sink:
      type: "datahub-rest"
      config:
        server: <ip>
    I use a k8s cronjob and this image
    Copy code
    linkedin/datahub-ingestion:v0.8.24
    with this command
    Copy code
    args: ["ingest", "-c", "file"]
    Thx 😄.
    ✅ 1
    d
    b
    • 3
    • 17
  • s

    sparse-planet-56664

    02/03/2022, 12:27 PM
    Hi, just testing out the meta_mapping with the DBT Ingestion. What if we have a meta key that contains different values that should map to different terms. Lets say we can have this in different models:
    Copy code
    meta:
      some_key: S1
    meta:
      some_key: S2
    Is this possible? Currently we are doing the mapping ourselves, but wanted to test this out if we didn’t have to add our own logic/complexity. I can’t see in any documentation that we can reuse the actual value from the meta key. Or is it possible to use regexp match in the “match” field?
    o
    m
    m
    • 4
    • 6
  • b

    bland-orange-13353

    02/03/2022, 1:59 PM
    This message was deleted.
    l
    • 2
    • 1
  • h

    high-family-71209

    02/03/2022, 2:08 PM
    Hi, what is the status of the Kafka Metadata connector? I would like to ingest some avro that is propagated via Kafka.
    l
    l
    • 3
    • 2
  • m

    millions-waiter-49836

    02/03/2022, 10:27 PM
    Hi everyone, about the data profiling feature, I noticed we use Great Expectations for SQL data stores and Deequ for data lake. Can I ask if the considerations behind this (despite Deequ lacks SQLAlchemy support)? If possible, I would also like to learn your comparison between those two tools, such as which tool is better for which scenarios, etc.
    👍 1
    s
    • 2
    • 1
  • g

    glamorous-microphone-33484

    02/04/2022, 9:17 AM
    Hi all, I have a few questions regarding kafka connector. 1. For kafka connector, does it support cloudera dist kafka or just confluence kafka? 2. Regarding security options, will the connector works on kafka cluster that uses kerberos (ie. sasl.mechanism=GSSAPI)? I tried to connect to my cluster by defining the mandatory parameters for kerberos such as sasl.kerberos.service.name, sasl.kerberos.principal and sasl.kerberos.keytab etc. However it failed with the following exception : ""No provider for SASL mechanism GSSAPI: recompile librdkafka with lbsasl2 or openssl support. Current Build options: Plain SASL_SCRAM OAUTHBEARER Can I assume GSSAPI is not supported at the moment?
    plus1 1
    r
    • 2
    • 1
  • r

    rich-policeman-92383

    02/04/2022, 11:26 AM
    In v0.8.20 setting env: "QA" in hive_source.yaml results in an exception of unknown Fabric Type. Can we use all FabricTypes defined here for all datasources.
    h
    • 2
    • 1
  • g

    gray-table-56299

    02/04/2022, 1:52 PM
    👋 im running into
    ValueError: source produced an invalid metadata work unit:
    when i am trying to a write a custom ingestion script using the python library. is it possible to get a more specific exception msg that provides info on which part of the mcp is invalid?
    m
    o
    • 3
    • 4
  • b

    bulky-arm-32887

    02/04/2022, 3:44 PM
    Hi everyone, I have a question about BigQuery connector. Seems like external tables are ignored during the ingestion process. Is there this limitation?
    o
    • 2
    • 1
  • b

    broad-battery-31188

    02/04/2022, 5:13 PM
    I am experiencing error
    duplicate key value violates unique constraint "pk_metadata_aspect_v2"
    for DBT ingestion. Recipe:
    Copy code
    source:
      type: "dbt"
      config:
        manifest_path: "home/user/manifest.json"
        catalog_path: "/home/user/catalog.json"
        target_platform: "snowflake" 
        load_schemas: False
    q
    i
    • 3
    • 8
  • d

    dazzling-cat-48477

    02/04/2022, 10:10 PM
    Hi again everyone. Has anyone been able to visualize the lineage between AWS Glue and AWS Redshift? I think my annotation for the DataSink in the job is the problem as it shows me the Redshift dataset as a Glue dataset as seen in the red circle in the first image and it should look like in the second image. I have generated the Glue annotation manually because when I try to generate it through Glue Studio I get an error in the DataSink:
    [<http://gluestudio-service.us[MASK].amazonaws.com|gluestudio-service.us[MASK].amazonaws.com>] createScript: InvalidInputException: Invalid DataSink: DataSink(name=Amazon Redshift, classification=DataSink, type=Redshift, inputs=[node-2], isSinkInStreamingDAG=false)
    Am I missing something? I attach the Glue annotation below. Thank you!
    Copy code
    ## @type: DataSink
    ## @args: [database = "redshift_test", table_name = "dev_stg_stg_version_detail", transformation_ctx = "df3"]
    ## @return: df3
    ## @inputs: []
    b
    a
    b
    • 4
    • 5
  • n

    nutritious-egg-28432

    02/06/2022, 9:06 PM
    Hello all, is it possible to integrate dataiku to datahub ?
    plus1 2
    l
    • 2
    • 2
  • g

    glamorous-microphone-33484

    02/07/2022, 5:12 AM
    Hi all, Do you have any connector ready to ingest from MinIO?
    l
    • 2
    • 1
  • h

    high-hospital-85984

    02/07/2022, 1:11 PM
    I'm implementing a custom SQL parser (Snowflake dialect) for use with for example the LookML integration. I'm looking at the get_columns function, and can't really figure out what the output should be. Is it supposed to return the "schema" of the lookml view, or the source columns from a lineage point of view? Based on the tests it's only the column names, without any possible source table prefix?
    d
    • 2
    • 5
  • c

    cool-gpu-73611

    02/07/2022, 2:48 PM
    Hi! Where can I see examples of creating metadata directly using API? Better documentation.. I see ability to add data using plugns, but it is not enough for us. And it is too hard for me to create new plugins. Marquez for example support user friendly api, it is ease to manipulate data using this api. But I don’t see any user friendly way with datahub
    i
    l
    • 3
    • 24
  • s

    some-crayon-90964

    02/07/2022, 5:08 PM
    Hey Acryl team, we are wondering if it is possible to let GMS accepts metadata but does not actually store in database. We would like a pipeline that we developed to test through GMS when deployed to make sure it works with our other systems. Thanks in advanced!
    i
    • 2
    • 1
  • b

    busy-sandwich-94034

    02/08/2022, 4:19 AM
    Hi everyone, we are looking for ingesting kafka schema registry and we customize schema-registry authentication mechanism to only allow JWT. But I only find the basic auth way in kafka metadata recipe.yaml*,* do you have any idea how to use JWT as authentication, thank you!
    i
    • 2
    • 2
  • g

    gray-table-56299

    02/08/2022, 12:25 PM
    👋 since only
    UPSERT
    is supported for MCPs, whats the recommended way to delete an aspect…?
    i
    • 2
    • 1
  • b

    bland-salesmen-77140

    02/08/2022, 1:15 PM
    Hi, some time ago we did PoC at our company and we spotted that metadata ingestion from Snowflake have a constraint that all dbs, schemas, table names should be in upper case- is it still a case? Is there some kind of workaround for that? Unfortunately we use case sensitive naming convention and wa can not change that at this point.
    d
    • 2
    • 5
1...272829...144Latest