https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • b

    boundless-student-48844

    09/08/2022, 3:36 PM
    Hi team, the
    :metadata-ingestion:lint
    task failed due to lint errors when running
    mypy
    command. There are 72 errors, listed in thread. A suggestion - do you think if lint check can be enforced when there are PRs to
    metadata-ingestion
    for better QA? 😅
    mypy src/ tests/ examples/
    h
    a
    • 3
    • 8
  • c

    clean-tomato-22549

    09/09/2022, 5:33 AM
    hi team, I have a question for lookml ingest. It seems it requests to specify base_folder which is the local lookml git repo position. Why we need this as required, since we can specify github_info.repo
    h
    s
    • 3
    • 12
  • j

    jolly-library-86177

    09/09/2022, 8:56 AM
    Looking through the connections for DataHub, and wondering if anyone is using DataHub as part of a Logical Data Warehouse archtechture? I.E. not connecting directly to data sources but instead ingestion through an access layer such as Denodos, TIBCOs etc?
  • s

    silly-finland-62382

    09/09/2022, 9:14 AM
    Hey team, As we are using Datahub Spark lineage via Databricks to populate spark lineage, lineage is created successfully but, the following error we are facing while running this command :
    Copy code
    df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/nishchay.agarwal@meesho.com/services_classification.csv")
    df.write.mode("overwrite").saveAsTable("new_p")
    
    While I am running this command via Databricks Cluster, pipeline is created successfully as per name given in cluster spark conf spark.datahub.databricks.cluster shell_dbx, but 
    while I am running delta table command, I am getting error :
    22/09/09 09:06:56 ERROR DatasetExtractor: class org.apache.spark.sql.catalyst.plans.logical.Project is not supported yet. Please contact datahub team for further support. 
    
    Also, I am not able to see schema of dataset that I build using spark-lineage, also both upstream & downstream table is showing same as per screenshot (that's not expected)
    Also, can you help me, how to enable Delta catalog support from databricks, because its not working on Databricks
    d
    l
    • 3
    • 3
  • f

    fresh-cricket-75926

    09/09/2022, 10:26 AM
    Hi All, is there any way that we can ingest Oracle schema and tables metadata without select table privilege
    g
    f
    • 3
    • 3
  • r

    rich-battery-25772

    09/09/2022, 11:05 AM
    Hi all! I found that ingestion process from deltalake could use a lot of memory (in my case more then 8G) and it looks like memory reduction. And the reduction is critical as for me. Datahub ingestion library uses deltalake’s library (in python). And the deltalake’s library creates a vector with all parquet file-names for all delta-table’s states. The vector could be big. Very big! Huge! Dramatically huge! Datahub needs the vector to calculate number of files only. The deltalake’s python library uses a deltalake’s library on rust. And the rust-library has special flag (require_files) which can handle if the files-vector has to be created or not. And avoiding using the vector has to save memory.
    Copy code
    pub struct DeltaTableLoadOptions {
    	..............
        /// Indicates whether DeltaTable should track files.
        /// This defaults to `true`
        ///
        /// Some append-only applications might have no need of tracking any files.
        /// Hence, DeltaTable will be loaded with significant memory reduction.
        pub require_files: bool,
    }
    The main problem is that the flag couldn’t be managed from the python deltalake’s library (it needs to be changed to manage the flag). And also a question is how we can calculate the number of files in alternative way. • Datahub’s code (using of DeltaTable class): https://github.com/datahub-project/datahub/blob/083ab9bc0e7b9d8ba293afcf9fae4ffb71c4f86c/metadata-ingestion/src/datahub/ingestion/source/delta_lake/delta_lake_utils.py#L24 • Deltalake’s python library: - DeltaTable class: https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/python/deltalake/table.py#L72 - RawDeltaTable class: https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/python/src/lib.rs#L78 • Deltalake’s rust library: - DeltaTableBuilder class (require_files is in the options: DeltaTableLoadOptions field): https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/rust/src/builder.rs#L116
    h
    • 2
    • 1
  • w

    witty-butcher-82399

    09/09/2022, 2:06 PM
    Is there any mechanism preventing to send the ingestion events if there is any failure in the pipeline during the ingestion? I’m asking because I noted the
    process_commit
    function in
    pipeline.py
    . It checks if there are errors or not, and depending on that and the commit policy, it will commit or not the checkpoint. https://github.com/datahub-project/datahub/blob/23b929ea10daded7447f806f8860447626[…]e573a6/metadata-ingestion/src/datahub/ingestion/run/pipeline.py However, I don’t see such a behaviour with the ingestion events themselves. Which means that ingestion pipeline could be publishing some events via the Sink and not committing the checkpoint. In my opinion, publishing policy in the Sink should be aligned with committing policy. WDYT?
    h
    • 2
    • 2
  • b

    busy-glass-61431

    09/12/2022, 5:11 AM
    Hi I have setup datahub with AWS OpenSearch and Managed Postgres. There seem to be some issue with my opensearch domain and I need to recreate it, is there a way to restore data from postgres to ES?
    h
    • 2
    • 1
  • c

    creamy-controller-55842

    09/12/2022, 8:22 AM
    Hi, I was integrating hive with datahub and ingesting metadata from UI, but I can see the partition Column Info is not present, I checked the code , it's written in code that if the row contain partition information, the loop breaks in hive.Py . May I know the reason behind this ?
    h
    • 2
    • 4
  • m

    many-hairdresser-79517

    09/12/2022, 10:03 AM
    Hi, I'm ingesting Redash with Datahub, And enable the parse_table_names_from_sql: true as the following the doc https://datahubproject.io/docs/generated/ingestion/sources/redash It works fine to get the table name to the inputs, but the sources it still unknown (It support be databricks hive table, detail in the image) Do we have any options to enable us to get the datasources name as well? Thank you so much.
    h
    • 2
    • 7
  • f

    famous-florist-7218

    09/12/2022, 10:59 AM
    Hi guys, I got this ERROR when integrate Spark to DataHub. It seems start event didn’t work.
    McpEmitter: REST Emitter Configuration
    is missing. Any thoughts?
    Copy code
    22/09/12 17:54:35 ERROR DatahubSparkListener: Application end event received, but start event missing for appId local-1662980072825
    Spark version: v3.1.1
  • i

    important-answer-79732

    09/12/2022, 11:04 AM
    Hi team, I'm getting the below error while creating the BigQuery integration in the Kubernetes deployment while a similar integration is successful with same configurations in my localhost (with quickstart).
    Copy code
    ~~~~ Execution Summary ~~~~
    
    RUN_INGEST - {'errors': [],
     'exec_id': '7f529d57-21f5-4d39-a8e8-2b92580692ab',
     'infos': ['2022-09-12 10:22:14.801662 [exec_id=7f529d57-21f5-4d39-a8e8-2b92580692ab] INFO: Starting execution for task with name=RUN_INGEST',
               '2022-09-12 10:22:14.855554 [exec_id=7f529d57-21f5-4d39-a8e8-2b92580692ab] INFO: Caught exception EXECUTING '
               'task_id=7f529d57-21f5-4d39-a8e8-2b92580692ab, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 71, in execute\n'
               '    validated_args = SubProcessIngestionTaskArgs.parse_obj(args)\n'
               '  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n'
               '  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n'
               'pydantic.error_wrappers.ValidationError: 1 validation error for SubProcessIngestionTaskArgs\n'
               'debug_mode\n'
               '  extra fields not permitted (type=value_error.extra)\n']}
    Execution finished with errors.
    h
    e
    +2
    • 5
    • 11
  • c

    chilly-scientist-91160

    09/12/2022, 11:34 AM
    We’re using mongoDB a lot, so I tried to hook up the mongodb ingestion plugin: https://datahubproject.io/docs/generated/ingestion/sources/mongodb/ Whilst it is able to see the data, it does not seem to extract the validation json schema as metadata - is this correct?
    h
    • 2
    • 3
  • b

    busy-glass-61431

    09/12/2022, 11:40 AM
    Hi I am running into an issue and not sure how to debug this. I've tried ingesting data using both kafka as well as rest emitter. I can see that the entries are getting created in postgres but not getting ingested in elastic. Any leads how I can debug this? I've checked logs for gms but dont see any errors
    h
    o
    • 3
    • 5
  • s

    silly-finland-62382

    09/12/2022, 5:18 PM
    Hey, Can someone tell me why spark.sql("") in datahub treated as Input Dataset as HDFS dataset instead of Hive because in spark.sql() I passed select * from hive table name
    h
    • 2
    • 2
  • b

    bland-sundown-49496

    09/12/2022, 10:49 PM
    hello, I am new to Datahub. I am getting error to ingest metadata from source S3. Would you please help me on these questions. 1. Can I use sink as "file" type for s3 source? . I got error saying that I cant use file type sink 2. When I use gms as sink, its failing? Please help me. Thanks
    g
    h
    • 3
    • 14
  • s

    stocky-truck-96371

    09/13/2022, 7:45 AM
    Hi Team, We are ingesting metadata from Hive platform using sql alchemy plugin. But it's not picking up the column descriptions of the tables. We are on the version v0.8.43. Can anyone help on this?
    g
    h
    • 3
    • 12
  • g

    great-branch-515

    09/13/2022, 9:15 AM
    @here does cli support ingestion from SSL only mysql databases? I am getting error
    Copy code
    (pymysql.err.OperationalError) (3159, 'Connections using insecure transport are prohibited while --require_secure_transport=ON.')
    (Background on this error at: <http://sqlalche.me/e/13/e3q8>) due to 
    		'(3159, 'Connections using insecure transport are prohibited while --require_secure_transport=ON.')'.
    Any idea?
    • 1
    • 1
  • b

    better-orange-49102

    09/13/2022, 2:21 PM
    Is there a way to use the Python SDK to retrieved past versions of an aspect? I mean I could go to RDBMS to retrieve the stored string but thats not very ideal. iirc I can use curl commands
    g
    g
    • 3
    • 3
  • b

    brave-pencil-21289

    09/13/2022, 2:23 PM
    Can we use tns details to ingest oracle source. Any sample recipe code on how to use the tns details in the recipe.
    g
    • 2
    • 3
  • g

    gentle-camera-33498

    09/13/2022, 2:33 PM
    Hello guys! I soft deleted all entities and forced new ingestion to make a full update. But, the Status aspect did not update after the ingestion. It's right? Is this expected?
    g
    c
    • 3
    • 12
  • c

    cool-actor-73767

    09/13/2022, 7:19 PM
    Hi Everyone, I'm having issues with ingestion from Metabase. I'm receiving this errors printed bellow. Did ingestion process stop after this error and other chart/dashboard metadata weren't load? Can anyone know what I need to do to solve this?
    g
    m
    g
    • 4
    • 12
  • r

    rhythmic-sundown-12093

    09/13/2022, 6:13 AM
    Copy code
    source:
      type: "dbt"
      config:
        # Coordinates
        # To use this as-is, set the environment variable DBT_PROJECT_ROOT to the root folder of your dbt project
        manifest_path: "${DBT_PROJECT_ROOT}/target/manifest.json"
        catalog_path: "${DBT_PROJECT_ROOT}/target/catalog.json"
        test_results_path: "${DBT_PROJECT_ROOT}/target/run_results.json" # optional for recording dbt test results after running dbt test
    
        # Options
        target_platform: "redshift" # e.g. bigquery/postgres/etc.
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    m
    g
    • 3
    • 2
  • m

    many-hairdresser-79517

    09/13/2022, 4:15 AM
    Hello, About Redash metadata of the dashboard that have the table chart type Is there any way to also ingest that list of columns in the table chart to datahub?
    g
    • 2
    • 4
  • p

    polite-art-12182

    09/14/2022, 5:38 AM
    Hi, Is there a way to use NiFi as a source with a self signed cert? I have a NiFi instance I want to pull from. Right now in dev, it's in just the default configuration with self signed cert and single user sign on. When I try to connect DataHub to it, the connection fails with:
    Copy code
    "retries exceeded with url: /nifi-api/access/token (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] "
    Any help resolving this without having to re-configure NiFi certs would be appreciated.
    h
    c
    • 3
    • 8
  • b

    blue-boots-43993

    09/14/2022, 5:39 AM
    hey everyone, could you please assist in understanding dashboards <> containers as I see here (and several other places as well), dashboard entity has a container aspect, however when looking here I don't see Container as a supported aspect. I am writing a custom ingestion source for Qlik Sense where I map so-called Streams and Apps to containers. Streams are basically collections of Apps and Apps contain so-called Sheets (mapped as Dashboards), datasets, charts and Load Scripts (mapped as DataFlows). I would like to be able to see all of the entities that are part of one App in the respective container's entity list, which I currently cannot. Thanks in advance for any help provided datahub
    h
    • 2
    • 4
  • b

    bumpy-journalist-41369

    09/14/2022, 7:14 AM
    How do I increase the log of the Ingestion Run Details that you can see in the UI to Debug when running ingestions from UI? I have deployed Datahub on a Kubernetes cluster using helm charts provided in this repository - https://github.com/acryldata/datahub-helm.
    g
    • 2
    • 6
  • b

    bland-orange-13353

    09/14/2022, 7:30 AM
    This message was deleted.
    h
    • 2
    • 1
  • m

    microscopic-mechanic-13766

    09/14/2022, 8:23 AM
    Good morning, so I was trying to ingest metadata from Kafka using the following recipe:
    Copy code
    source:
        type: kafka
        config:
            platform_instance: <platform_instance>
            connection:
                consumer_config:
                    security.protocol: SASL_PLAINTEXT
                    sasl.username: <user>
                    sasl.mechanism: PLAIN
                    sasl.password: <password>
                bootstrap: 'broker1:9092'
                schema_registry_url: '<http://schema-registry:8081>'
    When I got the following error:
    Copy code
    File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 98, in _read_output_lines\n'
               '    line_bytes = await ingest_process.stdout.readline()\n'
               '  File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
               '    raise ValueError(e.args[0])\n'
               'ValueError: Separator is not found, and chunk exceed the limit\n']}
    Mention that recipe worked in previous versions (the current version is v0.8.44) Thanks in advance!
    g
    a
    a
    • 4
    • 20
  • t

    thankful-vr-12699

    09/14/2022, 8:48 AM
    Hi everyone, Since the Browe Paths Upgrade of August, we have to remove the table name in our transformer to change the path from: platform/db/schema/table to platform/db/schema . In the documentation for browse paths transformer, the only option we have is to use DATASET_PARTS which include the table name. Is there an other variable we can use to remove the table name from DATASET_PARTS? Or a split of DATASET_PARTS to have only the db name and the schema? Thank you for your help!
    h
    • 2
    • 2
1...697071...144Latest