https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • t

    thankful-jackal-96705

    06/23/2023, 9:46 AM
    Hello Team, I am adding postgres db as ingestion source. I have deployed datahub in private kubernetes cluster. Getting error "could not find a version that satisfies the requirement wheel". Since the cluster is private, cannot install packages while adding the sources. Any idea which packages to be installed and to which docker image?
    ✅ 1
    g
    • 2
    • 1
  • r

    rich-restaurant-61261

    06/23/2023, 8:30 PM
    Hi team, I am trying to connect datahub with Superset, the connect_uri I used the superset ingress, but it throwing
    PipelineInitError: Failed to configure the source (superset): Exceeded 30 redirects.
    The username and password I used in here is correct, is anyone know what's the error means in here? and how can I solve the issue?
    Copy code
    source:
        type: superset
        config:
            connect_uri: '<https://di-superset.aoc.xxx.com>'
            username: xxx
            password: xxx
            provider: db
    ✅ 1
    h
    g
    • 3
    • 3
  • m

    microscopic-room-90690

    06/26/2023, 5:19 AM
    Hi team, I'm wondering how to distinguish between tables and views in Datahub. For Hive, I create a hive view, while it is shown as a table in Datahub. For Trino, tables and views share the same format of URN. These really puzzles me. Can anyone help?
    g
    d
    • 3
    • 10
  • a

    acceptable-morning-73148

    06/26/2023, 7:12 AM
    Hi, I'd like to search for text in the Logic part of a view definition. Executing a GraphQL query like this:
    Copy code
    query($urn: String!) {
        dataset(urn: $urn) {
            viewProperties {
                logic
            }
        }
    }
    retrieves the viewProperties and the
    logic
    attribute:
    Copy code
    {
      "data": {
        "dataset": {
          "viewProperties": {
            "logic": "let source = Sql.Database ........"
          }
        }
      },
      "extensions": {}
    }
    How can I query the contents of it? For example this query doesn't produce any results:
    Copy code
    query ($source: String!) {
        searchAcrossEntities(input: {
            start: 0, 
            count: 100, 
            query: "", 
            orFilters: [{
                and: [
                    {field: "logic", condition: CONTAIN, negated: false, values: ["Sql.Database"]}
                ]}]
            }
        ) {
            searchResults {
                entity {
                    urn,
                    __typename
                }
            }
        }
    }
    Note how I'm trying to match the
    logic
    attribute to a value it might contain.
    ✅ 1
    g
    b
    • 3
    • 6
  • f

    fast-judge-41877

    06/26/2023, 8:27 AM
    hi team, I am trying to add a new plugin 'couchbase' in ingestion using command _"pip install -e '.[couchbase]'"_*.* The installation is successful, But when I check plugins using command " datahub check plugins", I am getting below error: terminate called after throwing an instance of 'std::bad_cast' what(): std::bad_cast Aborted Can anyone help me with this error? Thanks in advance.
    g
    • 2
    • 11
  • b

    billions-rose-75566

    06/26/2023, 8:50 AM
    Hi DataHub, When we send data through kafka-sink, which topics will get a new message?
    g
    • 2
    • 4
  • m

    microscopic-room-90690

    06/26/2023, 9:29 AM
    Hi team, I found it very useful to use Python script to ingest metadata. https://github.com/datahub-project/datahub/tree/master/metadata-ingestion/examples/library Even if we can use a different data source by changing the 'platform', I'm wondering how to ingest the specific features for each source, such as the S3 path? Can anyone help?
    ✅ 1
    b
    • 2
    • 1
  • m

    millions-addition-50023

    06/26/2023, 9:52 AM
    LinkedIn的DataHub 有哪些插件
  • m

    millions-addition-50023

    06/26/2023, 9:53 AM
    LinkedIn的DataHub 有哪些插件 用中文
    g
    • 2
    • 1
  • f

    fierce-restaurant-41034

    06/26/2023, 3:06 PM
    Hi all, I have ingested DBT & Snowflake to Datahub, and have gotten “*Composed Of”* of DBT and snowflake objects as expected. Now, I am trying to delete the DBT platform from the datahub with the “hard” delete option. Although the DBT objects were deleted, I still see them as “composed of” SNF objects. Looking into the database, I found rows with the sibling content of SNF objects that contained the relation to DBT. When I deleted those rows from the DB, the DBT was completely removed from the UI. This looks like a good solution for me, but is it the best way to delete siblings? Why weren’t the siblings removed by the hard delete? It looks like the URNs of the snowflake and DBT are different because of capital letters. However, I don’t know if this is something that can affect the deletion (as in the pic). Thanks
    g
    a
    e
    • 4
    • 9
  • r

    rich-restaurant-61261

    06/26/2023, 7:19 PM
    Hi Team, I successfully ingest trino data into datahub, and when I browse the data, I saw there are a validation tab under the table, and it is grey out, anyone know what is that planning to provide? and how can I enable it?
    ✅ 1
    a
    • 2
    • 2
  • r

    ripe-stone-30144

    06/26/2023, 8:28 PM
    Hi guys! Could you please advise if it is possible to ingest Cassandra's metadata?
    ✅ 1
    r
    • 2
    • 2
  • a

    ambitious-bird-91607

    06/26/2023, 9:27 PM
    Hi there! I've noticed a discrepancy between the metadata stored in
    schemaMetadata
    and
    editableSchemaMetadata
    . After making changes directly in my ClickHouse (
    schemaMetadata
    ), I've observed that the DataHub user interface still displays the metadata stored in
    editableSchemaMetadata
    without reflecting the changes made. I would like to better understand how this situation is handled and if there is any mechanism to automatically synchronize the metadata between both sources. Should I manually update
    editableSchemaMetadata
    to reflect the changes made in
    schemaMetadata
    ? Does DataHub always gives priority to
    editableSchemaMetadata
    , regardless of any recent updates in
    schemaMetadata
    ?
    ✅ 1
    g
    a
    • 3
    • 3
  • b

    billions-journalist-13819

    06/27/2023, 6:49 AM
    Hi, Team... Compared to other DBs, databricks unity-catalog provides insufficient stats information. I hope more stats information of databricks unity-catalog is added. Could this be possible?
    d
    • 2
    • 3
  • q

    quiet-scientist-40341

    06/27/2023, 8:20 AM
    Hi, Everyone. If you met this question? I emit DataProcessInstance for Datajob use java client. But it can not work.
    public void testDataProcessInstance() throws IOException, ExecutionException, InterruptedException, URISyntaxException {
    KafkaEmitterConfig.KafkaEmitterConfigBuilder builder = KafkaEmitterConfig.
    _builder_();
    KafkaEmitterConfig config = builder.build(); KafkaEmitter emitter = new KafkaEmitter(config); if (emitter.testConnection()) { DataFlowUrn dataFlowUrn = new DataFlowUrn(“test_01”,“mannual_001",“PROD”); String flowId = DigestUtils.
    _md5DigestAsHex_(dataFlowUrn.toString().getBytes());
    Urn jobFlowRunUrn = Urn.
    _createFromTuple_("dataProcessInstance", flowId);
    DataJobUrn dataJobUrn = new DataJobUrn(dataFlowUrn, “mannual_job_001”); String jobId = DigestUtils.
    _md5DigestAsHex_(dataJobUrn.toString().getBytes());
    Urn jobRunUrn = Urn.
    _createFromTuple_("dataProcessInstance", jobId);
    DataProcessInstanceProperties dataProcessInstanceProperties = new DataProcessInstanceProperties(); dataProcessInstanceProperties.setName(jobId); AuditStamp auditStamp = new AuditStamp(); auditStamp.setTime(System.
    _currentTimeMillis_());
    auditStamp.setActor(Urn.
    _createFromString_("urn:li:corpuser:datahub"));
    dataProcessInstanceProperties.setCreated(auditStamp); dataProcessInstanceProperties.setType(DataProcessType.
    _BATCH_SCHEDULED_);
    MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.
    _builder_()
    .entityType(“dataProcessInstance”) .entityUrn(jobRunUrn) .upsert() .aspect(dataProcessInstanceProperties) .build(); Future<MetadataWriteResponse> requestFuture = emitter.emit(mcpw, null); MetadataWriteResponse metadataWriteResponse = requestFuture.get(); System.
    _out_.println(metadataWriteResponse.getResponseContent());
    System.
    _out_.println("===========dataProcessInstanceProperties=====");
    DataProcessInstanceRelationships dataProcessInstanceRelationships = new DataProcessInstanceRelationships(); System.
    _out_.println("====jobFlowRunUrn====="+ jobFlowRunUrn.toString());
    System.
    _out_.println("====dataFlowUrn====="+ dataFlowUrn.toString());
    dataProcessInstanceRelationships.setParentInstance(jobFlowRunUrn); dataProcessInstanceRelationships.setParentTemplate(dataJobUrn); mcpw = MetadataChangeProposalWrapper.
    _builder_()
    .entityType(“dataProcessInstance”) .entityUrn(jobRunUrn) .upsert() .aspect(dataProcessInstanceRelationships) .build(); requestFuture = emitter.emit(mcpw, null); metadataWriteResponse = requestFuture.get(); System.
    _out_.println(metadataWriteResponse.getResponseContent());
    System.
    _out_.println("===========dataProcessInstanceRelationships=====");
    DataProcessInstanceRunEvent dataProcessInstanceRunEvent = new DataProcessInstanceRunEvent(); dataProcessInstanceRunEvent.setStatus(DataProcessRunStatus.
    _STARTED_);
    dataProcessInstanceRunEvent.setTimestampMillis(System.
    _currentTimeMillis_());
    dataProcessInstanceRunEvent.setAttempt(1); mcpw = MetadataChangeProposalWrapper.
    _builder_()
    .entityType(“dataProcessInstance”) .entityUrn(jobRunUrn) .upsert() .aspect(dataProcessInstanceRunEvent) .build(); requestFuture = emitter.emit(mcpw, null); metadataWriteResponse = requestFuture.get(); System.
    _out_.println(metadataWriteResponse.getResponseContent());
    System.
    _out_.println("===========dataProcessInstanceRunEvent=====");
    } System.
    _out_.println("====================ab");
    }
    ✅ 1
    g
    • 2
    • 5
  • m

    microscopic-room-90690

    06/27/2023, 9:54 AM
    Hi all, I want to use
    make_dataset_urn
    in python script to ingest metadata from source S3. I tried
    make_dataset_urn(platform="s3", name="bdp/ingest/test/test/account")
    and get the result below. It shows the path I need, but the name of dataset is expected to be account. Anyone can help?
    ✅ 1
    g
    • 2
    • 1
  • a

    ancient-queen-15575

    06/27/2023, 12:20 PM
    I’m trying to use the s3 source, but have paths that look like •
    bucket_name/db_name/2022/10/03/customers_20221003.csv
    •
    bucket_name/db_name/2022/10/03/customers_20221003_modified.csv
    •
    bucket_name/db_name/2022/10/03/account_renewals_03102022.csv
    So the format is something like
    bucket_name/db_name/yyyy/mm/dd/table_name_[yyyymmdd|ddmmyyyy]_[modified|].csv
    . The problem I have is that the
    {table}
    name is in the final filename, rather than a subdirectory in the path, and that it has variable text after the table name. Is it possible for datahub to read the right table name in these circumstances? If not could I create a transformer to do this? I’m unclear on where I’d start if I do have to create a transformer
    ✅ 1
    g
    • 2
    • 1
  • n

    numerous-address-22061

    06/27/2023, 7:14 PM
    I there any way to bring in the XCOM from airflow task runs into Datahub. I already have the runs and the time, run_id, status, etc. But seeing the XCOM would be awesome.
    ✅ 1
    r
    • 2
    • 7
  • d

    dazzling-rainbow-96194

    06/27/2023, 8:37 PM
    Hi All, I am trying to do an ingest from SNOWFLAKE and I am using a role that has access to only one schema. I am filtering just 5 tables in the initial ingest to test the setup. But I see that the datahub is somehow scanning all schemas available in snowflake. Even the ones that are not accessible to the user and the role. It does say Skipping operations for table <table name>, as table schema is not accessible. But since it is scanning everything, it is timing out cause we have a lot of objects in Snowflake. Is there a way to restrict this action so that we can keep the ingest restricted to the schema and tables of interest?
    g
    a
    • 3
    • 9
  • b

    blue-honey-61652

    06/28/2023, 6:39 AM
    Hi everyone, (I am using the PowerBI ingestion) Is there any ways to prevent the ingestion from using a cached access token ? Because right now I have to wait for the access token to expire each time I do something in the Azure APP Permission to see if it has the expected result...
    ✅ 1
    g
    • 2
    • 2
  • s

    shy-dog-84302

    06/28/2023, 12:27 PM
    Hi, I see that BigQuery metadata ingestion plugin adds a nice
    externalUrl
    to the containers and datasets pointing to the GCP console. Similarly I would like to add an external link to all my Kafka topics ingested from Kafka plugin. That gives more info about a topic to my users. Is there an out of box config or solution available for that? I tried exploring GraphQL API to see if there is a mutation I can use to update it offline. But could not fine one. I only found updatedataset(s) mutation that adds a link in the institutionalMemory. Not sure if this is the best way to go. Any experience/suggestions on this would be greatly appreciated 🙂 Thanks.
    ✅ 1
    g
    • 2
    • 2
  • b

    blue-holiday-20644

    06/29/2023, 9:26 AM
    Hi, we're trying to get dbt integration under MWAA for Datahub ingestion and are hitting dependency issues. For a basic requirements.txt containing:
    Copy code
    acryl-datahub==0.10.4
    acryl-datahub-airflow-plugin==0.10.4
    dbt-core==1.4.0
    dbt-redshift==1.4.0
    We get:
    Copy code
    ERROR: Cannot install acryl-datahub-airflow-plugin and dbt-core because these package versions have conflicting dependencies.
    This error makes it look like a straight clash between the datahub plugin and dbt-core. But it's really a three-way clash. When we install those two in a requirements file locally they are compatible. Even adding
    apache-airflow==2.5.1
    to a local requirements.txt works. It is the combination of datahub plugin, dbt and MWAA that clashes. We're removing the 0.10.4 version locks next but if anyone's encountered this or has any ideas that would be great.
    ✅ 1
    g
    • 2
    • 3
  • d

    dazzling-rainbow-96194

    06/29/2023, 12:11 PM
    Hi, we are attempting to ingest Snowflake as a source using the Kubernetes setup and the ingestion succeeded. However, 1. The first run I disabled lineage and Profiling and kept only Include tables and views. I see that a couple of tables don't appear in the list of tables. 2. I did a 2nd run with a filter to limit only 5 tables using a pattern while keeping lineage and profiling on. I see that it has picked up only one table and shows no lineage or profiling information. Is there something else that needs to be done that I am missing? What's the best way to debug and make sure all tables and views appear?
    m
    • 2
    • 6
  • q

    quiet-exabyte-77821

    06/29/2023, 8:10 PM
    Hi, Is there a way to make the compiled SQL script on dbt models within datahub UI ? thanks
    g
    a
    • 3
    • 2
  • s

    steep-vr-39297

    06/30/2023, 1:55 AM
    Hi, Tim. When I debug with the
    master branch
    , it runs fine, but when I run with the
    datahub-protobuf-0.10.5-SNAPSHOT.jar
    file via build, I get an error. I need your help.
    g
    • 2
    • 3
  • j

    jolly-airline-17196

    06/30/2023, 7:23 AM
    hey, is there any way we can delete glossary term groups, which have multiple term groups nested inside them using CLI or APIs, Using the UI takes a lot of time.
    ✅ 1
    g
    • 2
    • 1
  • j

    jolly-airline-17196

    06/30/2023, 7:26 AM
    Also, does glossary ingestion support something like overwrite_existing=true?
    g
    • 2
    • 1
  • c

    cool-gpu-21169

    06/30/2023, 3:18 PM
    Hello. We are planning to ingest a proprietary database metadata into datahub. Since there is no source available, should I write a custom source? Or is there a different method to ingest metadata for this case? I can extract the metadata in the format Datahub is expecting. Any suggestions before we start developing custom source? Thanks
    ✅ 1
    g
    • 2
    • 1
  • a

    ancient-kitchen-28586

    07/01/2023, 11:25 AM
    Hi guys, is there anybody using TimeXtender (DWH automation, https://www.timextender.com/)? I'm thinking about making a script or a plugin for ingesting from a TimeXtender repository. Would you be interested in that? Or maybe you're working on something similar?
    ✅ 1
    g
    l
    • 3
    • 2
  • a

    average-nail-72662

    07/02/2023, 11:29 PM
    Hi guys, I have a question, are there transformer to documentation?
    ✅ 1
    g
    • 2
    • 1
1...127128129...144Latest