https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • c

    creamy-smartphone-10810

    05/12/2022, 11:25 AM
    Hi team! Is there any way to specify to a
    recipe.yaml
    the scheduling? I’m executing it with
    datahub ingest -c recipe.yaml
    but it performs a oneshoot ingestion, and I would to schedule them! thx in advance!
    c
    m
    • 3
    • 8
  • a

    agreeable-army-26750

    05/12/2022, 2:18 PM
    Hi everyone! I have built a custom datahub cli (metadata-ingestion)
    Copy code
    pip install -e '.[dev]'
    and quickstarted the system. Is it possible to configure the UI (or the metadata-service) to call my new custom datahub cli ingestion instead of the default one? Maybe I have to configure something on the UI? Thanks for your answers in advance!
    s
    • 2
    • 2
  • f

    fresh-garage-83780

    05/12/2022, 5:16 PM
    Hi all, this is hopefully a simple one but I can't find the answer in the docs. I'm doing a PoC with Datahub and trying to hook it up to Trino. My test Trino has no password but does require TLS, but I can't find the connection settings to get the Trino Source to connect correctly:
    Copy code
    source:
      type: trino
      config:
        env: CORP
        platform: trino
        host_port: "<http://trino.example.com:443|trino.example.com:443>"
        database: dbname
        username: foo
    This recipe above connects via http and so times out. If I add a
    password
    value it connects correctly via https because of this line in dialect.py, but this doesn't help as Trino throws a 401 (counter-intuitively)
    Copy code
    curl -X POST <https://foo:bar@trino.example.com:443/v1/statement>
    401 Password not allowed for insecure authentication
    I can't seem to find any options for the options block that can override this. I tried using
    sqlalchemy_uri
    too, but likewise couldn't find a way to set http_scheme through the connection string. Hope someone can point me in the right direction?
    o
    h
    d
    • 4
    • 5
  • h

    handsome-football-66174

    05/12/2022, 6:06 PM
    Hi Everyone, How to we assign Domains via recipe ( I see the configurations present, but unable to use it in a recipe )
    o
    s
    • 3
    • 2
  • m

    millions-sundown-65420

    05/12/2022, 7:38 PM
    Hi team. I have deployed Kafka Bridge onto my Kubernetes cluster. Can I use the bridge to send messages to Kafka topics using AMQP protocol rather than HTTP?
    o
    • 2
    • 2
  • c

    chilly-gpu-46080

    05/13/2022, 3:38 AM
    Hello everyone, is lineage supported for SQL Server ingestion?
    s
    o
    • 3
    • 2
  • r

    rich-policeman-92383

    05/13/2022, 12:15 PM
    Hello Can you share the sample spark code that was used to create orders_cleanup_flow spark pipeline on datahub demo site.
    o
    • 2
    • 2
  • b

    billions-table-9927

    05/13/2022, 1:47 PM
    Hi, The UI ingestion is working successfully when the job triggered manually but the schedule is not picking up and the jobs are NOT trigged at the scheduled time, any help please
    o
    • 2
    • 2
  • s

    salmon-midnight-86020

    05/13/2022, 7:39 PM
    Hi everyone! I'm wondering if it might make sense to forward our debezium/CDC messages through an intermediate kafka streams transformer and produce transformed datahub-compatible messages directly to datahub metadata topics. The idea would be to leverage our existing CDC system instead of having datahub listen directly to mysql. I saw this old thread that mentioned something similar but I'm not sure where it landed. Is this feasible?
    o
    m
    • 3
    • 8
  • c

    cuddly-arm-8412

    05/15/2022, 2:53 AM
    hi,team.I am learning pushed-base integrations. i am trying for python emiiter and java emitter.The official demos are MCP messages. What I want to confirm is that push-integration only supports MCP messages at present?
    • 1
    • 1
  • a

    alert-football-80212

    05/15/2022, 9:49 AM
    Hi all, If my s3 datalake look like this: datahubTestBucket: • folderTest1 ◦ date=2020-01-01 ▪︎ part0001.parquet ▪︎ part0002.parquet ◦ date=2020-02-01 ▪︎ part0001.parquet ▪︎ part0002.parquet • folderTest2 ◦ date=2020-01-01 ▪︎ part0001.parquet ▪︎ part0002.parquet ◦ date=2020-02-01 ▪︎ part0001.parquet ▪︎ part0002.parquet • folderTest3 ◦ date=2020-01-01 ▪︎ part0001.parquet ▪︎ part0002.parquet ◦ date=2020-02-01 ▪︎ part0001.parquet ▪︎ part0002.parquet how should my recipe to look like? path_specs: include: ??? Thank you!
    h
    • 2
    • 1
  • e

    echoing-farmer-38304

    05/15/2022, 2:26 PM
    Hi everyone, I want to Meta Data Ingestion From the Power BI Report Server to DataHub, similar to this one https://github.com/datahub-project/datahub/blob/a9ad13817290331ac107f00b73c7916107d63376/metadata-ingestion/src/datahub/ingestion/source/powerbi.py I found this mapping for ready ingestion, but mapping for reports is N/A Which concept should I use for mapping power bi report to?
    g
    • 2
    • 1
  • b

    best-umbrella-24804

    05/16/2022, 6:15 AM
    Hello, I am trying to push validations to datahub using great expectations. At the moment I am getting the following logs
    Copy code
    WARNING - DataHubValidationAction does not recognize this GE data asset type - <class 'great_expectations.validator.validator.Validator'>.
    INFO - Metadata not sent to datahub. No datasets found.
    My code looks like this I'm not sure how to specify what datasets should be mapped?
    h
    • 2
    • 1
  • m

    microscopic-mechanic-13766

    05/16/2022, 8:40 AM
    Hi team, quick doubt, for what versions of datahub is the presto-on-hive module enabled?? I am currently using the 0.8.32.1 (as there are no newer versions for the actions component as far as I know) and when I try installing the mentioned plugin I get the following:
    WARNING: acryl-datahub 0.8.32.1 does not provide the extra 'presto-on-hive'
    b
    • 2
    • 2
  • c

    cuddly-arm-8412

    05/16/2022, 10:13 AM
    hi,team.I defined a new entity aspect in metadata-model-custom。How do I load this class in the ingestion module and assign values to it
    i
    • 2
    • 3
  • b

    bland-morning-36590

    05/16/2022, 10:08 PM
    Hi all, I am trying to ingest metadata from teradata. I am using the attached sqlalchemy recipe. I am hitting “NoSuchModuleError”. Is there something wrong with the connection string? Thanks for the help
    b
    • 2
    • 7
  • b

    brave-pager-62740

    05/16/2022, 11:16 PM
    Hello team! I’m trying to ingest data using json files as the input, and I’m wondering is there a doc or example regarding how to define the json content so that it can be correctly translated to the metadata I want in datahub. thanks!
    i
    • 2
    • 2
  • c

    cuddly-arm-8412

    05/17/2022, 2:23 AM
    hi team,I learn the ingestion module is for the overall ingestion.Are there fine-grained changes, such as MCPs that build tables separately?I hope I can fine-grained change the table when the table changes only。
    i
    • 2
    • 10
  • b

    best-umbrella-24804

    05/17/2022, 5:45 AM
    I took away, action_list argument, and it was still happening. I found that I had to remove the import
    Copy code
    from datahub.integrations.great_expectations.action import DataHubValidationAction
    Before it would stop hanging, it seems that merely importing this package causes the hanging
    h
    • 2
    • 6
  • p

    prehistoric-salesclerk-23462

    05/17/2022, 2:54 PM
    Quick Question: Is it possible to ingest metadata from snowflake realtime?
    i
    • 2
    • 2
  • m

    millions-waiter-49836

    05/17/2022, 4:40 PM
    Hey guys, a quick question: Is there any demo or docs of the “Validation” tab next to “Stats” tab? I like to know what that page look like and how to emit & query data from there
    i
    • 2
    • 2
  • a

    alert-football-80212

    05/17/2022, 7:23 PM
    Hi, I want to clear from datahub all data in s3 platform. cant find in datahub docs command to clear platform only specific urn or only datasets. someone know how can I completely clear specific platform? Thank you!!! 🙏
    h
    l
    • 3
    • 10
  • c

    chilly-gpu-46080

    05/18/2022, 7:38 AM
    is it possible to ingest from SQL server using Windows authentication?
    f
    • 2
    • 2
  • p

    polite-application-51650

    05/18/2022, 12:25 PM
    Hi, is there any way to ingest Google Cloud Storage data to my local datahub instance?
    b
    • 2
    • 2
  • p

    powerful-librarian-82760

    05/18/2022, 12:42 PM
    Hi Question on Oracle database dataset to Oracle database dataset File Based Lineage Ingestion : I first imported the tables I targeted from two Oracle databases that lead to datasets in 2 different platform 'oracle' thanks to platform instance name ✅ When I tried to add lineage information between 2 tables in these 2 different platforms, the File Based Lineage Ingestion module seems to not take into account the schema in which tables are. If in the yml file, I specify the entity name as schema.table_name, nothing is added in datahub (I also tried schema/table_name) If I specify the entity name as table_name, it creates a new dataset at the root of the platform
    • 1
    • 1
  • p

    powerful-librarian-82760

    05/18/2022, 5:20 PM
    #ingestion Oracle metadata Ingestion How to deal with container ? By default with Oracle Ingestion Module, I got meaningless container created.
    l
    d
    +5
    • 8
    • 14
  • c

    cuddly-arm-8412

    05/19/2022, 1:08 AM
    hi,team,I want to know how to deploy the metadata-ingression module?Because I didn't see it in the docker compose-file but there is one datahub-actions,What is the relationship between metadata-ingestion and datahub-actions
    i
    s
    • 3
    • 16
  • p

    polite-application-51650

    05/19/2022, 5:27 AM
    Can anyone guide me on metadata profiling using Datahub.
    i
    • 2
    • 1
  • b

    best-wolf-3369

    05/19/2022, 9:30 AM
    Hi all, I am trying to create a GlossaryTerm using the Rest.li API and it's almost working. The item is created but the camelCase notation used at the name is not respected and everything goes to lowercase. This problem is not present while using yml ingestion method, but we need to use Rest.li API in our development. As I said, it creates the GlossaryTerm, but with the following urn
    urn:li:glossaryTerm:camelcaseobject
    instead of the good one
    urn:li:glossaryTerm:camelCaseObject.
    Could you provide some insight?
    Copy code
    import requests
    import json
    
    url = "<http://host>:port/entities?action=ingest"
    
    payload = json.dumps({
      "entity": {
        "value": {
          "com.linkedin.metadata.snapshot.GlossaryTermSnapshot": {
            "urn": "urn:li:glossaryTerm:camelCaseObject",
            "aspects": [
              {
                "com.linkedin.glossary.GlossaryTermInfo": {
                  "definition": "Object definition",
                  "parentNode": "urn:li:glossaryTerm:camelCaseObjectParent",
                  "sourceRef": "DataHub",
                  "sourceUrl": "<https://github.com/linkedin/datahub/>",
                  "termSource": "INTERNAL"
                }
              }
            ]
          }
        }
      }
    })
    headers = {
      'Content-Type': 'application/json'
    }
    
    response = requests.request("POST", url, headers=headers, data=payload)
    Thank you very much.
    l
    • 2
    • 2
  • g

    great-cpu-72376

    05/19/2022, 10:26 AM
    Hi, I am trygin to ingest a dataset through Rest.li API, I executed a post on GMS
    Copy code
    <http://localhost:9090/openapi/entities/v1/>
    And I pasted this json (I copied what is reported in https://datahubproject.io/docs/how/add-custom-data-platform):
    Copy code
    {
      "entity": {
        "value": {
          "com.linkedin.metadata.snapshot.DataPlatformSnapshot": {
            "aspects": [
              {
                "com.linkedin.dataplatform.DataPlatformInfo": {
                  "datasetNameDelimiter": "/",
                  "name": "filesystem",
                  "type": "FILE_SYSTEM",
                  "doc": "local filesystem"
                }
              }
            ],
            "urn": "urn:li:dataPlatform:filesystem"
          }
        }
      }
    }
    On gms I found this error:
    Copy code
    WARN  o.s.w.s.m.s.DefaultHandlerExceptionResolver:208 - Resolved [org.springframework.http.converter.HttpMessageNotReadableException: JSON parse error: Cannot deserialize value of type `java.util.ArrayList<io.datahubproject.openapi.dto.UpsertAspectRequest>` from Object value (token `JsonToken.START_OBJECT`); nested exception is com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize value of type `java.util.ArrayList<io.datahubproject.openapi.dto.UpsertAspectRequest>` from Object value (token `JsonToken.START_OBJECT`)<EOL> at [Source: (org.springframework.util.StreamUtils$NonClosingInputStream); line: 1, column: 1]]
    h
    o
    • 3
    • 6
1...424344...144Latest