DataHub #ingestion

thankful-jackal-96705

06/23/2023, 9:46 AM

Hello Team, I am adding postgres db as ingestion source. I have deployed datahub in private kubernetes cluster. Getting error "could not find a version that satisfies the requirement wheel". Since the cluster is private, cannot install packages while adding the sources. Any idea which packages to be installed and to which docker image?

✅ 1

rich-restaurant-61261

06/23/2023, 8:30 PM

Hi team, I am trying to connect datahub with Superset, the connect_uri I used the superset ingress, but it throwing

PipelineInitError: Failed to configure the source (superset): Exceeded 30 redirects.

The username and password I used in here is correct, is anyone know what's the error means in here? and how can I solve the issue?

Copy code

source:
    type: superset
    config:
        connect_uri: '<https://di-superset.aoc.xxx.com>'
        username: xxx
        password: xxx
        provider: db

✅ 1

microscopic-room-90690

06/26/2023, 5:19 AM

Hi team, I'm wondering how to distinguish between tables and views in Datahub. For Hive, I create a hive view, while it is shown as a table in Datahub. For Trino, tables and views share the same format of URN. These really puzzles me. Can anyone help?

acceptable-morning-73148

06/26/2023, 7:12 AM

Hi, I'd like to search for text in the Logic part of a view definition. Executing a GraphQL query like this:

Copy code

query($urn: String!) {
    dataset(urn: $urn) {
        viewProperties {
            logic
        }
    }
}

retrieves the viewProperties and the

logic

attribute:

Copy code

{
  "data": {
    "dataset": {
      "viewProperties": {
        "logic": "let source = Sql.Database ........"
      }
    }
  },
  "extensions": {}
}

How can I query the contents of it? For example this query doesn't produce any results:

Copy code

query ($source: String!) {
    searchAcrossEntities(input: {
        start: 0, 
        count: 100, 
        query: "", 
        orFilters: [{
            and: [
                {field: "logic", condition: CONTAIN, negated: false, values: ["Sql.Database"]}
            ]}]
        }
    ) {
        searchResults {
            entity {
                urn,
                __typename
            }
        }
    }
}

Note how I'm trying to match the logic
attribute to a value it might contain.

✅ 1

fast-judge-41877

06/26/2023, 8:27 AM

hi team, I am trying to add a new plugin 'couchbase' in ingestion using command _"pip install -e '.[couchbase]'"_*.* The installation is successful, But when I check plugins using command " datahub check plugins", I am getting below error: terminate called after throwing an instance of 'std::bad_cast' what(): std::bad_cast Aborted Can anyone help me with this error? Thanks in advance.

billions-rose-75566

06/26/2023, 8:50 AM

Hi DataHub, When we send data through kafka-sink, which topics will get a new message?

microscopic-room-90690

06/26/2023, 9:29 AM

Hi team, I found it very useful to use Python script to ingest metadata. https://github.com/datahub-project/datahub/tree/master/metadata-ingestion/examples/library Even if we can use a different data source by changing the 'platform', I'm wondering how to ingest the specific features for each source, such as the S3 path? Can anyone help?

✅ 1

millions-addition-50023

06/26/2023, 9:52 AM

LinkedIn的DataHub 有哪些插件

millions-addition-50023

06/26/2023, 9:53 AM

LinkedIn的DataHub 有哪些插件用中文

fierce-restaurant-41034

06/26/2023, 3:06 PM

Hi all, I have ingested DBT & Snowflake to Datahub, and have gotten “*Composed Of”* of DBT and snowflake objects as expected. Now, I am trying to delete the DBT platform from the datahub with the “hard” delete option. Although the DBT objects were deleted, I still see them as “composed of” SNF objects. Looking into the database, I found rows with the sibling content of SNF objects that contained the relation to DBT. When I deleted those rows from the DB, the DBT was completely removed from the UI. This looks like a good solution for me, but is it the best way to delete siblings? Why weren’t the siblings removed by the hard delete? It looks like the URNs of the snowflake and DBT are different because of capital letters. However, I don’t know if this is something that can affect the deletion (as in the pic). Thanks

rich-restaurant-61261

06/26/2023, 7:19 PM

Hi Team, I successfully ingest trino data into datahub, and when I browse the data, I saw there are a validation tab under the table, and it is grey out, anyone know what is that planning to provide? and how can I enable it?

✅ 1

ripe-stone-30144

06/26/2023, 8:28 PM

Hi guys! Could you please advise if it is possible to ingest Cassandra's metadata?

✅ 1

ambitious-bird-91607

06/26/2023, 9:27 PM

Hi there! I've noticed a discrepancy between the metadata stored in

schemaMetadata

and

editableSchemaMetadata

. After making changes directly in my ClickHouse (

schemaMetadata

), I've observed that the DataHub user interface still displays the metadata stored in

editableSchemaMetadata

without reflecting the changes made. I would like to better understand how this situation is handled and if there is any mechanism to automatically synchronize the metadata between both sources. Should I manually update

editableSchemaMetadata

to reflect the changes made in

schemaMetadata

? Does DataHub always gives priority to

editableSchemaMetadata

, regardless of any recent updates in

schemaMetadata

✅ 1

billions-journalist-13819

06/27/2023, 6:49 AM

Hi, Team... Compared to other DBs, databricks unity-catalog provides insufficient stats information. I hope more stats information of databricks unity-catalog is added. Could this be possible?

quiet-scientist-40341

06/27/2023, 8:20 AM

Hi, Everyone. If you met this question? I emit DataProcessInstance for Datajob use java client. But it can not work.

public void testDataProcessInstance() throws IOException, ExecutionException, InterruptedException, URISyntaxException {

KafkaEmitterConfig.KafkaEmitterConfigBuilder builder = KafkaEmitterConfig.

_builder_();

KafkaEmitterConfig config = builder.build(); KafkaEmitter emitter = new KafkaEmitter(config); if (emitter.testConnection()) { DataFlowUrn dataFlowUrn = new DataFlowUrn(“test_01”,“mannual_001",“PROD”); String flowId = DigestUtils.

_md5DigestAsHex_(dataFlowUrn.toString().getBytes());

Urn jobFlowRunUrn = Urn.

_createFromTuple_("dataProcessInstance", flowId);

DataJobUrn dataJobUrn = new DataJobUrn(dataFlowUrn, “mannual_job_001”); String jobId = DigestUtils.

_md5DigestAsHex_(dataJobUrn.toString().getBytes());

Urn jobRunUrn = Urn.

_createFromTuple_("dataProcessInstance", jobId);

DataProcessInstanceProperties dataProcessInstanceProperties = new DataProcessInstanceProperties(); dataProcessInstanceProperties.setName(jobId); AuditStamp auditStamp = new AuditStamp(); auditStamp.setTime(System.

_currentTimeMillis_());

auditStamp.setActor(Urn.

_createFromString_("urn:li:corpuser:datahub"));

dataProcessInstanceProperties.setCreated(auditStamp); dataProcessInstanceProperties.setType(DataProcessType.

_BATCH_SCHEDULED_);

MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.

_builder_()

.entityType(“dataProcessInstance”) .entityUrn(jobRunUrn) .upsert() .aspect(dataProcessInstanceProperties) .build(); Future<MetadataWriteResponse> requestFuture = emitter.emit(mcpw, null); MetadataWriteResponse metadataWriteResponse = requestFuture.get(); System.

_out_.println(metadataWriteResponse.getResponseContent());

System.

_out_.println("===========dataProcessInstanceProperties=====");

DataProcessInstanceRelationships dataProcessInstanceRelationships = new DataProcessInstanceRelationships(); System.

_out_.println("====jobFlowRunUrn====="+ jobFlowRunUrn.toString());

System.

_out_.println("====dataFlowUrn====="+ dataFlowUrn.toString());

dataProcessInstanceRelationships.setParentInstance(jobFlowRunUrn); dataProcessInstanceRelationships.setParentTemplate(dataJobUrn); mcpw = MetadataChangeProposalWrapper.

_builder_()

.entityType(“dataProcessInstance”) .entityUrn(jobRunUrn) .upsert() .aspect(dataProcessInstanceRelationships) .build(); requestFuture = emitter.emit(mcpw, null); metadataWriteResponse = requestFuture.get(); System.

_out_.println(metadataWriteResponse.getResponseContent());

System.

_out_.println("===========dataProcessInstanceRelationships=====");

DataProcessInstanceRunEvent dataProcessInstanceRunEvent = new DataProcessInstanceRunEvent(); dataProcessInstanceRunEvent.setStatus(DataProcessRunStatus.

_STARTED_);

dataProcessInstanceRunEvent.setTimestampMillis(System.

_currentTimeMillis_());

dataProcessInstanceRunEvent.setAttempt(1); mcpw = MetadataChangeProposalWrapper.

_builder_()

.entityType(“dataProcessInstance”) .entityUrn(jobRunUrn) .upsert() .aspect(dataProcessInstanceRunEvent) .build(); requestFuture = emitter.emit(mcpw, null); metadataWriteResponse = requestFuture.get(); System.

_out_.println(metadataWriteResponse.getResponseContent());

System.

_out_.println("===========dataProcessInstanceRunEvent=====");

} System.

_out_.println("====================ab");

}

✅ 1

microscopic-room-90690

06/27/2023, 9:54 AM

Hi all, I want to use

make_dataset_urn

in python script to ingest metadata from source S3. I tried

make_dataset_urn(platform="s3", name="bdp/ingest/test/test/account")

and get the result below. It shows the path I need, but the name of dataset is expected to be account. Anyone can help?

✅ 1

ancient-queen-15575

06/27/2023, 12:20 PM

I’m trying to use the s3 source, but have paths that look like •

bucket_name/db_name/2022/10/03/customers_20221003.csv

•

bucket_name/db_name/2022/10/03/customers_20221003_modified.csv

•

bucket_name/db_name/2022/10/03/account_renewals_03102022.csv

So the format is something like

bucket_name/db_name/yyyy/mm/dd/table_name_[yyyymmdd|ddmmyyyy]_[modified|].csv

. The problem I have is that the

{table}

name is in the final filename, rather than a subdirectory in the path, and that it has variable text after the table name. Is it possible for datahub to read the right table name in these circumstances? If not could I create a transformer to do this? I’m unclear on where I’d start if I do have to create a transformer

✅ 1

numerous-address-22061

06/27/2023, 7:14 PM

I there any way to bring in the XCOM from airflow task runs into Datahub. I already have the runs and the time, run_id, status, etc. But seeing the XCOM would be awesome.

✅ 1

dazzling-rainbow-96194

06/27/2023, 8:37 PM

Hi All, I am trying to do an ingest from SNOWFLAKE and I am using a role that has access to only one schema. I am filtering just 5 tables in the initial ingest to test the setup. But I see that the datahub is somehow scanning all schemas available in snowflake. Even the ones that are not accessible to the user and the role. It does say Skipping operations for table <table name>, as table schema is not accessible. But since it is scanning everything, it is timing out cause we have a lot of objects in Snowflake. Is there a way to restrict this action so that we can keep the ingest restricted to the schema and tables of interest?

blue-honey-61652

06/28/2023, 6:39 AM

Hi everyone, (I am using the PowerBI ingestion) Is there any ways to prevent the ingestion from using a cached access token ? Because right now I have to wait for the access token to expire each time I do something in the Azure APP Permission to see if it has the expected result...

✅ 1

shy-dog-84302

06/28/2023, 12:27 PM

Hi, I see that BigQuery metadata ingestion plugin adds a nice

externalUrl

to the containers and datasets pointing to the GCP console. Similarly I would like to add an external link to all my Kafka topics ingested from Kafka plugin. That gives more info about a topic to my users. Is there an out of box config or solution available for that? I tried exploring GraphQL API to see if there is a mutation I can use to update it offline. But could not fine one. I only found updatedataset(s) mutation that adds a link in the institutionalMemory. Not sure if this is the best way to go. Any experience/suggestions on this would be greatly appreciated 🙂 Thanks.

✅ 1

blue-holiday-20644

06/29/2023, 9:26 AM

Hi, we're trying to get dbt integration under MWAA for Datahub ingestion and are hitting dependency issues. For a basic requirements.txt containing:

Copy code

acryl-datahub==0.10.4
acryl-datahub-airflow-plugin==0.10.4
dbt-core==1.4.0
dbt-redshift==1.4.0

We get:

Copy code

ERROR: Cannot install acryl-datahub-airflow-plugin and dbt-core because these package versions have conflicting dependencies.

This error makes it look like a straight clash between the datahub plugin and dbt-core. But it's really a three-way clash. When we install those two in a requirements file locally they are compatible. Even adding

apache-airflow==2.5.1

to a local requirements.txt works. It is the combination of datahub plugin, dbt and MWAA that clashes. We're removing the 0.10.4 version locks next but if anyone's encountered this or has any ideas that would be great.

✅ 1

dazzling-rainbow-96194

06/29/2023, 12:11 PM

Hi, we are attempting to ingest Snowflake as a source using the Kubernetes setup and the ingestion succeeded. However, 1. The first run I disabled lineage and Profiling and kept only Include tables and views. I see that a couple of tables don't appear in the list of tables. 2. I did a 2nd run with a filter to limit only 5 tables using a pattern while keeping lineage and profiling on. I see that it has picked up only one table and shows no lineage or profiling information. Is there something else that needs to be done that I am missing? What's the best way to debug and make sure all tables and views appear?

quiet-exabyte-77821

06/29/2023, 8:10 PM

Hi, Is there a way to make the compiled SQL script on dbt models within datahub UI ? thanks

steep-vr-39297

06/30/2023, 1:55 AM

Hi, Tim. When I debug with the

master branch

, it runs fine, but when I run with the

datahub-protobuf-0.10.5-SNAPSHOT.jar

file via build, I get an error. I need your help.

jolly-airline-17196

06/30/2023, 7:23 AM

hey, is there any way we can delete glossary term groups, which have multiple term groups nested inside them using CLI or APIs, Using the UI takes a lot of time.

✅ 1

jolly-airline-17196

06/30/2023, 7:26 AM

Also, does glossary ingestion support something like overwrite_existing=true?

cool-gpu-21169

06/30/2023, 3:18 PM

Hello. We are planning to ingest a proprietary database metadata into datahub. Since there is no source available, should I write a custom source? Or is there a different method to ingest metadata for this case? I can extract the metadata in the format Datahub is expecting. Any suggestions before we start developing custom source? Thanks

✅ 1

ancient-kitchen-28586

07/01/2023, 11:25 AM

Hi guys, is there anybody using TimeXtender (DWH automation, https://www.timextender.com/)? I'm thinking about making a script or a plugin for ingesting from a TimeXtender repository. Would you be interested in that? Or maybe you're working on something similar?

✅ 1

average-nail-72662

07/02/2023, 11:29 PM

Hi guys, I have a question, are there transformer to documentation?

✅ 1