https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • f

    few-air-56117

    12/17/2021, 7:57 AM
    Hi all, how i tried to ingest data from biguery and make lineage automaticaly, this is the config
    Copy code
    source:
      type: bigquery
      config:
        project_id: <project_id>
        include_table_lineage: True
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    The table/views are in datahub but the lineage button si not available. Am i missing something? Thx a lot 😄
    d
    f
    • 3
    • 15
  • n

    nice-planet-17111

    12/17/2021, 8:29 AM
    Hi all, does datahub support
    bigquery udfs
    ingestion? I tried to do it, but it really returns nothing (even if i set include_views: true)
    d
    • 2
    • 2
  • f

    few-air-56117

    12/17/2021, 8:58 AM
    Hi everyone, i have a question, For biguqery lineage, if a create a table base(C) on a view(B) base on other table (A), the linage is not A->B>C, is A-C, the view is excluded. It is normal?
    d
    • 2
    • 3
  • g

    green-football-48146

    12/17/2021, 10:03 AM
    Hi all, when we ingest the metadata of
    hive
    , if it encounter an abnormality in some tables, the ingestion will be interrupted. Is there any way to skip these abnormal tables when errors occur?
    s
    i
    m
    • 4
    • 11
  • p

    proud-accountant-49377

    12/17/2021, 11:42 AM
    Hi everyone!😊 When I add a term to a field in my dataset’s schema, this term only appears in the editableSchema object ... is there any way for it to directly modify the schemaMetadata object and appear there? Thanks!
    b
    • 2
    • 3
  • b

    best-planet-6756

    12/17/2021, 7:32 PM
    Hi all, I have ingested an Oracle DB and added a database alias to the recipe. Is there a way to query on the alias in graphql?
    b
    s
    l
    • 4
    • 11
  • m

    millions-fall-80793

    12/20/2021, 4:17 AM
    Hey Guys I am using this business glossary to ingest into DataHub (V 0.8.18) via this recipe. All works fine. My question is, How do I purge/delete the glossary terms?
    l
    • 2
    • 2
  • b

    busy-zebra-64439

    12/20/2021, 9:01 AM
    Hi Team , I have a issue while setting up the ingestor . i have prepared the connection yml and i ran the ingestor command datahub ingest -c mysql_ingestor.yml but , it throwing the error mysql is disabled; try running: pip install 'acryl-datahub[mysql]' when i run the pip install 'acryl-datahub[mysql]' the error shows as ERROR: Could not find a version that satisfies the requirement acryl-datahub[mysql] (from versions: none) ERROR: No matching distribution found for acryl-datahub[mysql] How to activate the mysql source for ingestion.
    s
    • 2
    • 2
  • w

    witty-butcher-82399

    12/20/2021, 9:28 AM
    Is there any chance that the acryldata’s pyhive fork includes the SparkSQL dialect? We have been testing Datahub Hive connector with Spark Thrift Server and pyhive requires those updates. From what I’ve been reading, authors of pyhive (dropbox) doesn’t want to include the new dialect, instead they want Spark to be Hive compatible as they claim to be. This can be noted here or here. Thanks! 🧵
    h
    • 2
    • 3
  • f

    few-air-56117

    12/20/2021, 2:37 PM
    Hi all, did anyone had speed problem when they tried to ingest data from bigquery with a 1/2 years gaps between start date and end date?
    d
    • 2
    • 12
  • r

    red-pizza-28006

    12/20/2021, 2:51 PM
    I am noticing an issue with the Snowflake Lineage. Here is an example - I created a temp table with CTE to build the actual dataset like this
    Copy code
    CREATE TEMP TABLE temp.temp_stone AS
    WITH upd_stone_response_messages AS
              (
                 SELECT DISTINCT srm.id AS stone_response_message_id
                 FROM src_payment.stone_response_messages srm
                 WHERE updated_at BETWEEN $date_start AND $date_end
                   AND request_type = 'authorize'
                   AND transaction_id IS NOT NULL
              )
    Here you can see I have a dependency on
    src_payment.stone_response_messages
    , but when i look at the lineage UI, I only see that the dataset is built using temp_stone.but nothing more than that. The SQL inside of the CTE is not captured in the lineage
    h
    • 2
    • 7
  • m

    modern-monitor-81461

    12/20/2021, 3:09 PM
    Hi all, I am writing a custom source (Iceberg in this case. I know it's on the roadmap, but I need it now and I'm using this to understand the datahub internals) and I am having problems adding a
    MetadataChangeEvent
    with a
    SchemaMetadata
    aspect. It looks like something is rejected by the Avro validator, but it doesn't tell me what. Is there a trick to figure what exactly is incompatible with the schema?
    Copy code
    File "/datahub/metadata-ingestion/src/datahub/cli/ingest_cli.py", line 82, in run
        pipeline.run()
    File "/datahub/metadata-ingestion/src/datahub/ingestion/run/pipeline.py", line 157, in run
        for record_envelope in self.transform(record_envelopes):
    File "/datahub/metadata-ingestion/src/datahub/ingestion/extractor/mce_extractor.py", line 46, in get_records
        raise ValueError(
    
    ValueError: source produced an invalid metadata work unit: MetadataChangeEventClass(...
    h
    • 2
    • 4
  • m

    microscopic-elephant-47912

    12/20/2021, 8:40 PM
    Hi all, I'm trying to ingest lookml files but I got an error. I looked around but could not find a solution or a bug report. Could you please check ?
    looker-dwh-master.zip
    l
    • 2
    • 2
  • m

    mysterious-lamp-91034

    12/23/2021, 5:07 AM
    Hi I have ingested 116554 tables to datahub, the web UI began to crash(waiting forever) when 10k tables were ingested. I am not sure what was going on.
    datahub docker check
    shows no issue The context is I am running
    docker-compose.quickstart.yml
    in my dev machine
    l
    b
    +6
    • 9
    • 112
  • a

    abundant-photographer-45796

    12/24/2021, 6:25 AM
    I performed a superset ingestion, the yml code is shown below
    Copy code
    source:
      type: superset
      config:
        # Coordinates
        connect_uri: <http://localhost:8088>
    
        # Credentials
        username: xxx
        password: xxx
        provider: db
    
    sink:
      # sink configs
      type: "datahub-rest"
      config:
        server: "<http://192.168.229.4:8080>"
    then, I carried out the ingestion command,
    Copy code
    datahub ingest -c superset.yml
    I get this hint But in my datahub homepage, I can't see the charts. Can someone tell me why? Thank you
    b
    e
    o
    • 4
    • 9
  • b

    busy-zebra-64439

    12/27/2021, 11:26 AM
    Hi Team , I am facing the below issue while i am trying to ingest the oracle data using the docker image. Kindly provide some help to resolve this issue. Docker command - docker run 3d271c19a693 ingest --config /data/oracle_ingestor.yml Error - DatabaseError: (cx_Oracle.DatabaseError) DPI-1047: Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared object file: No such file or directory". See https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html for help (Background on this error at: http://sqlalche.me/e/13/4xp6)
    r
    • 2
    • 5
  • r

    rich-policeman-92383

    12/27/2021, 12:32 PM
    Hello Can we specify a hive queue name while ingestion metadata from hive source. Like in beeline we can do something like:
    Copy code
    beeline -e "set tez.queue.name='myqueue' describe formatted myschem.mytable;"
    d
    • 2
    • 9
  • a

    agreeable-river-32119

    12/28/2021, 6:49 AM
    Hello team, now we run scheduler tasks by Apache Dolphinscheduler.I found that you have provided acryl-datahub[airflow] as a lineage component. How can I develop acryl-datahub[dolphinscheduler] for us? As a contributor of Apache DolphinScheduler,I want to participate in datahub.😊
    e
    • 2
    • 2
  • b

    busy-zebra-64439

    12/28/2021, 1:55 PM
    Hi Team , i am having some queries on the oracle ingestion. Issue 1 : i faced the below error column at array position 0 fetched with error using the include_views: False , this issue got fixed Was this issue fixed on any version of ingestion. currently used docker image - linkedin/datahub-ingestion:head Issue 2 : we see very minimal table are cataloged , like only 1% of table was catalogued during the ingestion process. could someone please provide an update on this issue. sample yaml: source:  type: oracle  config:   # Coordinates   host_port: localhost:2115   # Credentials   username: sampleuser   password: sample   service_name: SAMPLE   include_views: False   table_pattern:     ignoreCase: False   schema_pattern:     ignoreCase: False   view_pattern:     ignoreCase: False sink:  type: "datahub-rest"  config:   server: "http://localhost:8080"
    s
    • 2
    • 4
  • n

    nice-autumn-10105

    12/28/2021, 5:35 PM
    Does anyone using the mssql ingestor support integrated auth. vs. uid and pwd? Our database environment here only allows integrated auth,
    i
    f
    • 3
    • 6
  • c

    curved-magazine-23582

    12/28/2021, 8:38 PM
    Hello team, where can I find more info about current implementation of PK/FK support, info such as UI, supported platform / data stores, etc?
    l
    • 2
    • 6
  • l

    lemon-cartoon-14299

    12/29/2021, 12:13 AM
    Hello All, I am pretty new to the data hub tool and started off with the docker installed on my laptop. I was able to import some of the metadata from trino datasource into data hub. Have couple of issues. Would appreciate if some one can help me here. 1. I turned on the profiling for trino data source but I still dont see any stats around it. The stats and lineage tabs are disabled always. 2. Is there a way to setup lineage manually between data sources?
    c
    l
    d
    • 4
    • 15
  • b

    better-orange-49102

    12/30/2021, 7:34 AM
    for the command datahub docker quickstart, there is an option to build locally (ie --build-locally). Why does it still do a docker-compose pull if we specify to build locally?
    e
    • 2
    • 1
  • n

    nice-country-99675

    12/30/2021, 12:13 PM
    👋 HI team! I have a pretty vague question.. and I would like to make it as concise as possible... I have a Redshift ingestion coded as Airflow DAG, which runs a pipeline that looks like this
    Copy code
    pipeline = Pipeline.create(
            {
                "source": {
                    "type": source,
                    "config": {
                        "username": f"{conn.login}",
                        "password": f"{conn.password}",
                        "database": f"{conn.schema}",
                        "host_port": f"{conn.host}:{conn.port}",
                        "database_alias": alias,
                        "env": "PROD",
                        "schema_pattern": {
                            "deny": deny_schemas
                        },
                    },
                },
                "transformers": transformers,
                "sink": {
                    "type": "datahub-rest",
                    "config": {"server": f"{datahub.host}"},
                },
            }
        )
        pipeline.run()
        pipeline.raise_from_status()
    The thing is the DAG ends with
    Copy code
    {local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL
    It's the only thing that is actually logged.... it seems it fails as soon as the process starts. At first, I thought it was a memory issue, we increase the pods memory, and now we are pretty far from the memory limit. It even fails when I ran dray_run. Locally it's working fine. Locally I'm using Airflow 2.2.2 while in production I'm using Airflow 2.2.1. I would really appreciate any suggestion...
    teamwork 1
    h
    b
    +2
    • 5
    • 6
  • g

    gentle-florist-49869

    12/30/2021, 2:42 PM
    Hi team, please anyone here have a initial tutorial to see datahub logs/data into elasticsearch or kibana? both alreay up and working
    i
    r
    • 3
    • 9
  • d

    damp-ambulance-34232

    01/03/2022, 4:25 AM
    Did datahub support ingest hive table with kudu format
    d
    • 2
    • 2
  • r

    red-pizza-28006

    01/03/2022, 12:52 PM
    Hi team - I started ingesting data from confluent cloud kafka but it seems to be super slow to ingest the data. Here is an example. You can see it takes about 2 mins per topic which is not scalable for us
    Copy code
    [2022-01-03 13:44:47,679] INFO     {datahub.cli.ingest_cli:81} - Starting metadata ingestion
    [2022-01-03 13:47:19,428] INFO     {datahub.ingestion.run.pipeline:77} - sink wrote workunit kafka-<topic1>
    [2022-01-03 13:49:50,867] INFO     {datahub.ingestion.run.pipeline:77} - sink wrote workunit kafka-<topic2>
    • 1
    • 1
  • b

    better-orange-49102

    01/03/2022, 1:27 PM
    For the access tokens, do they work with the edit policies? Ie if person A does not have edit rights to a dataset and passes in a MCE/MCP about that dataset to :9002/api/gms with his access token, will he be rejected? also, if i do not allow users to generate their own token, can I still query for user's tokens in the backend (via custom UI code) and use it to ingest metadata on their behalf?
    h
    b
    • 3
    • 17
  • g

    gentle-florist-49869

    01/03/2022, 6:14 PM
    Hi Team, happy new year - I'm trying to create a datahub-mae-consumer docker container via yml - https://github.com/linkedin/datahub/blob/master/docker/docker-compose.consumers.yml - but received the error: *************************** APPLICATION FAILED TO START *************************** Description: Field systemAuthentication in com.linkedin.metadata.kafka.config.EntityHydratorConfig required a bean of type 'com.datahub.authentication.Authentication' that could not be found. The injection point has the following annotations: - @org.springframework.beans.factory.annotation.Autowired(required=true) - @org.springframework.beans.factory.annotation.Qualifier(value=systemAuthentication) Action: Consider defining a bean of type 'com.datahub.authentication.Authentication' in your configuration. 2022/01/03 163547 Command exited with error: exit status 1
    h
    b
    • 3
    • 39
  • a

    adventurous-apple-98365

    01/04/2022, 2:14 AM
    Hey all - wondering if anyone has any ideas about creating new tags when they are ingested as part of a dataset. When we ingest a dataset we are adding custom tags that don't yet exist in the
    GlobalTags
    aspect. The tag itself isn't ingested(not in the elastic tag index so we cant search!) but does properly appear on the dataset and in the list of 'filter checkboxes' when viewing data sets. Is there anyway to have the tag also created outside of ingesting the tag separately before ingesting the data set? Not sure if it makes sense to try solve it in one place (within dathaub itself) versus in each of our ingestion plugins
    plus1 1
    b
    e
    • 3
    • 3
1...222324...144Latest