https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • l

    limited-forest-73733

    12/05/2022, 8:03 AM
    Hey team which default cli version should i use for ingestion? Thanks in advance
    d
    • 2
    • 7
  • f

    fresh-rocket-98009

    12/05/2022, 9:08 AM
    Hi guys, Can anoyone help with ingestion error. I deployed datahub to gke kubernetes but cannot ingest bigquery having this error below
    d
    g
    p
    • 4
    • 50
  • c

    clever-lamp-13963

    12/05/2022, 12:17 PM
    I am ingesting data using Python API
    datahub.ingestion.run.pipeline
    , how do I get the ids's of all entities created during the execution?
    a
    g
    • 3
    • 3
  • b

    billowy-telephone-52349

    12/05/2022, 3:53 PM
    Hi all! I had a question. I am getting the below error
  • b

    billowy-telephone-52349

    12/05/2022, 3:54 PM
    is the connection with Microsoft Azure SQL Database supported within the Datahub? I am from Q2 Inc and recently adopted Datahub as our data catalogue solution and trying to import data assets from a Microsoft Azure SQL Database. I am able to import successfully from a SQL server DB, but not from a Microsoft Azure SQL Database. I am getting error the attached
    g
    • 2
    • 4
  • c

    cuddly-dinner-641

    12/05/2022, 6:26 PM
    We are noticing that Tags and Containers ingested as part of a Dataset's aspects are not showing up in Search. The entities are created but don't appear to be indexed in search and don't show as part of the UI autocomplete. Is this a known issue?
    g
    b
    b
    • 4
    • 20
  • p

    plain-controller-95961

    12/05/2022, 9:07 PM
    Hi, we have an upcoming PR for enhancing the Vertica Source connector. We would also like add the changes to the documentation at https://datahubproject.io/docs/generated/ingestion/sources/vertica#install-the-plugin. Could you please let us know where and how we could do this? Trying to edit this page leads to 404.
    g
    • 2
    • 3
  • b

    boundless-piano-94348

    12/06/2022, 5:57 AM
    Hi all. I have a problem where I can't see the list of metadata in an environment. This is example of the screenshot. Here, I have DEV environment with a
    data-staging
    BQ project and
    data-master
    dataset having 11 tables. When I click on
    data-master
    it doesn't show anything. I also want to hard delete everything in the DEV environment using the datahub command, but still gets error saying
    Command failed: Did not delete all entities, try running this command again!
    . Please kindly help. Thank you.
    g
    c
    s
    • 4
    • 9
  • b

    best-wire-59738

    12/06/2022, 7:09 AM
    Hello team can you let me know the reason for the below error. I found this exception very often in my GMS logs and we are having some issues in UI. Might be this would be the issue.
    Copy code
    07:07:04.181 [ForkJoinPool.commonPool-worker-7] ERROR c.datahub.telemetry.TrackingService:105 - Failed to send event to Mixpanel
    java.net.SocketTimeoutException: connect timed out
    	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
    	at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
    	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
    	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
    	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    	at java.base/java.net.Socket.connect(Socket.java:609)
    	at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:305)
    	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)
    	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:507)
    	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:602)
    	at java.base/sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:266)
    	at java.base/sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:373)
    	at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:207)
    	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
    	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
    	at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:193)
    	at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1367)
    	at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1342)
    	at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:246)
    	at com.mixpanel.mixpanelapi.MixpanelAPI.sendData(MixpanelAPI.java:134)
    	at com.mixpanel.mixpanelapi.MixpanelAPI.sendMessages(MixpanelAPI.java:172)
    	at com.mixpanel.mixpanelapi.MixpanelAPI.deliver(MixpanelAPI.java:103)
    	at com.mixpanel.mixpanelapi.MixpanelAPI.deliver(MixpanelAPI.java:83)
    	at com.mixpanel.mixpanelapi.MixpanelAPI.sendMessage(MixpanelAPI.java:71)
    	at com.datahub.telemetry.TrackingService.emitAnalyticsEvent(TrackingService.java:103)
    	at com.datahub.authentication.AuthServiceController.lambda$track$4(AuthServiceController.java:336)
    	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
    	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692)
    	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
    	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
    	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
    	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
    	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
    b
    g
    e
    • 4
    • 5
  • l

    limited-forest-73733

    12/06/2022, 3:46 PM
    Hey team i am not able to attach whole database hierarchy to the domain.Its attaching database and then tables , skipping schema in the mid.
    g
    • 2
    • 14
  • l

    late-book-30206

    12/06/2022, 4:07 PM
    Hello everyone, I wanted to know if there was a solution to work around a memory problem for the ingestion of datasets from MongoDB. I have collections that have up to 120,000,000 records. I'm trying to manipulate configurations during ingestion like: schemaSamplingSize, maxSchemaSize and enableSchemaInference: true. But this is not enough and I don't have the possibility to increase the memory of my server to be able to consume everything. Is there a workaround for this kind of problem? Use the API to ingest in packages of 10,000 documents? Run several ingestion processes in a row to take all the documents in the collection? Thanks in advance.
    g
    d
    r
    • 4
    • 26
  • e

    echoing-thailand-18014

    12/06/2022, 4:23 PM
    @late-book-30206 Did you try to adjust the chunk size within a shard and increase the initial number of chunks upon the creation of a collection?
  • l

    limited-forest-73733

    12/06/2022, 6:37 PM
    Hey team in acryldata/datahub-frontend:v0.9.3 we are getting jackson databind and log4j vulnerabilities. Any suggestion how to fix them or anyone who is working on the vulnerability fix?
    a
    b
    • 3
    • 9
  • c

    colossal-sandwich-50049

    12/06/2022, 10:13 PM
    Hello, I am trying to create dataset entities via the Java emitter such that they are displayed under a certain folder path on the Datahub UI, can someone let me know how to do this? What I've tried: creating the dataset URN as follows with the folder path below (the below seems to work in some examples that I've tried but not in others)
    Copy code
    DatasetUrn datasetUrn = new DatasetUrn(
                someDataPlatformUrn,
                "some.folder.path." + datasetName,
                FabricType.NON_PROD
        );
    g
    • 2
    • 3
  • f

    fresh-nest-42426

    12/07/2022, 12:47 AM
    Hi all, has anybody encountered Redshift ingestion error like this (or maybe not limited to redshift)
    Copy code
    RUN_INGEST - {'errors': [],
     'exec_id': '651fa19e-403e-43e3-b325-5e8d66447208',
    .....
    .....
              '[2022-12-06 11:01:36,048] WARNING  {datahub.ingestion.source.sql.redshift:526} - parsing-query => Error parsing query \n'
           ......
           .......
               'Error was too many values to unpack (expected 2).\n'
    we are using
    v0.9.0
    and i'll add more ingestion recipe details in thread Thanks in advance!
    g
    d
    a
    • 4
    • 8
  • s

    steep-vr-39297

    12/07/2022, 2:07 AM
    Hi Team! I'm going to load the protobuf schema into the datahub. I proceeded with the command according to the guide, but an error occurred.
    ✅ 1
    m
    a
    b
    • 4
    • 19
  • a

    abundant-flag-19546

    12/07/2022, 8:03 AM
    Hello DataHub Team! I’m creating a lineage with Airflow backend, but it seems
    inlets
    and
    outlets
    doesn’t render Airflow context variables. I want to make lineage by Airflow Parameter like this:
    Copy code
    inlets=[
        Dataset("bigquery", "{{ params.input_table}}"),
    ],
    outlets=[
        Dataset("bigquery", "{{ params.output_table}}"),
    ],
    Is there any way to render jinja templates from inlets and outlets, or any other workaround to make lineage by parameters?
    ✅ 1
    a
    g
    w
    • 4
    • 7
  • l

    little-spring-72943

    12/07/2022, 8:06 AM
    Does new Databricks "unity-catalog" supports profiling?
    d
    • 2
    • 1
  • l

    little-spring-72943

    12/07/2022, 8:07 AM
    Any plans? Or how can we achieve this?
    d
    m
    d
    • 4
    • 16
  • c

    colossal-smartphone-90274

    12/07/2022, 12:32 PM
    Hi everyone, I would like to add dataset usage for some MSSQL tables however from the looks of it the suggested bigquery-usage recipe cannot be used for my case as we don't use Google Identity / BigQuery -> Are there any plans for a different dataset usage component to be added so these requirements are not needed?
    ✅ 1
    a
    • 2
    • 1
  • m

    microscopic-mechanic-13766

    12/07/2022, 12:44 PM
    Hi everyone, I am trying to ingest metadata from Ozone S3 but I think it might not be possible just yet, as it might be needed some recipe's properties to indicate things like the host of the Ozone Service, the region that uses, ... Has anyone tried to ingest from it?? Thanks in advance!
    ✅ 1
    a
    • 2
    • 1
  • r

    rhythmic-gpu-99609

    12/07/2022, 3:42 PM
    Hi! We are trying to ingest data from Dremio. For that we are using SQLAlchemy with dialect for Dremio. It says in documentation that in order to use SQLAlchemy connection, you need to install dialect for it. Our Datahub is running on AKS. So, in order to install Dremio dialect, we tried to execute
    pip install sqlalchemy-dremio
    inside datahub-acryl-datahub-actions-xxxxxxx-yyyy pod. However, once we tried to ingest data from dremio, we haven't had a success. I believe it's because that Datahub creates a separate worker for ingesting data and that worker doesn't have dialect installed. This is log output:
    Copy code
    ~~~~ Execution Summary ~~~~
    
    RUN_INGEST - {'errors': [],
     'exec_id': '6d5bc7ed-f859-4b43-82fe-1a1474127b4b',
     'infos': ['2022-12-07 15:23:50.976900 [exec_id=6d5bc7ed-f859-4b43-82fe-1a1474127b4b] INFO: Starting execution for task with name=RUN_INGEST',
               '2022-12-07 15:23:55.056420 [exec_id=6d5bc7ed-f859-4b43-82fe-1a1474127b4b] INFO: stdout=venv setup time = 0\n'
               'This version of datahub supports report-to functionality\n'
               'datahub  ingest run -c /tmp/datahub/ingest/6d5bc7ed-f859-4b43-82fe-1a1474127b4b/recipe.yml --report-to '
               '/tmp/datahub/ingest/6d5bc7ed-f859-4b43-82fe-1a1474127b4b/ingestion_report.json\n'
               '[2022-12-07 15:23:53,575] INFO     {datahub.cli.ingest_cli:182} - DataHub CLI version: 0.9.1\n'
               '[2022-12-07 15:23:53,615] INFO     {datahub.ingestion.run.pipeline:175} - Sink configured successfully. DataHubRestEmitter: configured '
               'to talk to <http://datahub-datahub-gms:8080>\n'
               '[2022-12-07 15:23:54,236] ERROR    {datahub.entrypoints:192} - \n'
               'Traceback (most recent call last):\n'
               '  File "/tmp/datahub/ingest/venv-sqlalchemy-0.9.1/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 196, in '
               '__init__\n'
               '    self.source: Source = source_class.create(\n'
               '  File "/tmp/datahub/ingest/venv-sqlalchemy-0.9.1/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_generic.py", line 51, in '
               'create\n'
               '    config = SQLAlchemyGenericConfig.parse_obj(config_dict)\n'
               '  File "pydantic/main.py", line 526, in pydantic.main.BaseModel.parse_obj\n'
               '  File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__\n'
               'pydantic.error_wrappers.ValidationError: 1 validation error for SQLAlchemyGenericConfig\n'
               'platform\n'
               '  field required (type=value_error.missing)\n'
               '\n'
               'The above exception was the direct cause of the following exception:\n'
               '\n'
               'Traceback (most recent call last):\n'
               '  File "/tmp/datahub/ingest/venv-sqlalchemy-0.9.1/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 197, in run\n'
               '    pipeline = Pipeline.create(\n'
               '  File "/tmp/datahub/ingest/venv-sqlalchemy-0.9.1/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 317, in create\n'
               '    return cls(\n'
               '  File "/tmp/datahub/ingest/venv-sqlalchemy-0.9.1/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 202, in '
               '__init__\n'
               '    self._record_initialization_failure(\n'
               '  File "/tmp/datahub/ingest/venv-sqlalchemy-0.9.1/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 129, in '
               '_record_initialization_failure\n'
               '    raise PipelineInitError(msg) from e\n'
               'datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure source (sqlalchemy)\n'
               '[2022-12-07 15:23:54,237] ERROR    {datahub.entrypoints:195} - Command failed: \n'
               '\tFailed to configure source (sqlalchemy) due to \n'
               "\t\t'1 validation error for SQLAlchemyGenericConfig\n"
               'platform\n'
               "  field required (type=value_error.missing)'.\n"
               '\tRun with --debug to get full stacktrace.\n'
               "\te.g. 'datahub --debug ingest run -c /tmp/datahub/ingest/6d5bc7ed-f859-4b43-82fe-1a1474127b4b/recipe.yml --report-to "
               "/tmp/datahub/ingest/6d5bc7ed-f859-4b43-82fe-1a1474127b4b/ingestion_report.json'\n",
               "2022-12-07 15:23:55.056821 [exec_id=6d5bc7ed-f859-4b43-82fe-1a1474127b4b] INFO: Failed to execute 'datahub ingest'",
               '2022-12-07 15:23:55.057049 [exec_id=6d5bc7ed-f859-4b43-82fe-1a1474127b4b] INFO: Caught exception EXECUTING '
               'task_id=6d5bc7ed-f859-4b43-82fe-1a1474127b4b, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
               '    task_event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
               '    return future.result()\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 168, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    Execution finished with errors.
    h
    a
    +4
    • 7
    • 11
  • a

    acoustic-secretary-69712

    12/07/2022, 6:30 PM
    What is the recommended way for ingesting metadata for OOTB data sources? Should you use an external scheduler or rely on datahub's scheduler?
    g
    m
    • 3
    • 8
  • g

    gifted-knife-16120

    12/08/2022, 5:19 AM
    hi all, can anyone share how to do manual column-field lineage for postgres? now, i am using python scrip
    a
    • 2
    • 3
  • g

    gifted-knife-16120

    12/08/2022, 6:29 AM
    hi, just found that Metabase ingestion not capture the card that using
    Model
    properly. based on https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/metabase.py#L447 it will consider as data source if
    dataset_query == query
    . by right,
    source-table = card__55
    is a model not a data source hence, I get
    'failures': {'metabase-table-card__55': ['Unable to retrieve source table. Reason: 404 Client Error: Not Found for url: "
    error https://www.metabase.com/learn/data-modeling/models - info
    Copy code
    ....
    "dataset_query": {
        "type": "query",
        "query": {
            "source-table": "card__55",
    ....
    ✅ 2
    g
    f
    • 3
    • 7
  • p

    purple-printer-15193

    12/08/2022, 7:21 AM
    Hi team! If I was to run a Metadata Ingestion through the UI which pod does the
    datahub ingest
    get executed? Is it datahub-gms or datahub-frontend?
    h
    • 2
    • 1
  • c

    crooked-rose-22807

    12/08/2022, 10:18 AM
    Hi all, how do I include
    Related Entities
    in business glossary recipe? I don't find the keys here . Would be nice if there is a way that I might missed? I don't want users to do it via UI, if possible. Thank you
    m
    b
    • 3
    • 6
  • q

    quiet-school-18370

    12/08/2022, 1:51 PM
    hi team, I am working on lookml ingestion through airflow DAG, but i am receiving the 401 error as screenshot, i also updated the datahub auth token in the airflow connection. Can anyone please help me to solve this issue.
    d
    g
    a
    • 4
    • 13
  • c

    cuddly-state-92920

    12/08/2022, 3:59 PM
    Hi everyone, I am new in the DataHub and in Slack. I Hame some doubts. Here are the right place to put then, or is there some forum to do that?
  • c

    cuddly-state-92920

    12/08/2022, 4:01 PM
    One of my inumerous doubts in DataHub is: How can I remove a table from my dataset. In on momment the table was removed from my database, but the table still in my DataHub. Regards, Amanda Lima
    b
    • 2
    • 1
1...888990...144Latest