DataHub #ingestion

strong-kite-83354

10/19/2022, 3:00 PM

Hi All - I've been experimenting with ingesting data to the Validation tab which is now working nicely for the "assertions" side of things but I see that there are also "Tests" which can be populated - as in this dataset on the demo site: https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:postgres,calm[…]fle_shop.customers,PROD)/Validation?is_lineage_mode=false How do I populate "Tests" in the validation tab? Is there any example Python code? The closest I got was finding a TestInfo aspect.

shy-lion-56425

10/19/2022, 3:39 PM

Hi All is there any support for files (tsv/csv) without headers? Like an option to specify the schema/column names?

green-lion-58215

10/19/2022, 5:34 PM

Hello, Is there a way to map tables to domains using the DBT ingestion? I see that we can use domain transforms. But What I want to do is to allow developers to provide the domain name within the meta section of dbt models. similar to how we can add terms/glossary and ownership in meta section, can we add domains as well?

billowy-book-26360

10/19/2022, 8:58 PM

I use MSSQL recipe to ingest tables from Schema-A. When I navigate to Schema-A in DataHub, it lists the tables as expected. I now add a glossary term (my Data Concept label) to Schema-A. My expectation is that when I navigate to my Data Concept glossary term, I'll see Schema-A (or the tables in Schema-A) under the Related Entities tab, but the tab is empty. How to I achieve this, short of labeling each table at ingestion time?

few-carpenter-93837

10/20/2022, 7:08 AM

Hey all, just to confirm, is field level lineage only available for python emitter ingestion? It's not doable through the CLI and recipes?

alert-fall-82501

10/20/2022, 10:47 AM

Error while pulling images. Going to attempt to move on to docker compose up assuming the images have been built locally

alert-fall-82501

10/20/2022, 10:47 AM

can anybody suggest on this ? trying to install datahub

billowy-alarm-46123

10/20/2022, 11:40 AM

Not sure if this is a right place but I’ll try it. I saw a lot of suggestion to use cli for marking objects as deleted. However I’m not able to find api (programatic way) for it. We have an ETL (pipeline) which updates certain platform, I would like to run something (not cli) and mark all entities of a single platform as deleted so then I run ETL i could mark stuff which is not deleted. Can someone point me to documentation or something regarding programmatic way for deleting entities Thank you

brainy-crayon-53549

10/20/2022, 12:04 PM

Is there anyway to customize dashboard reports in datahub

billowy-book-26360

10/19/2022, 8:28 PM

I have S3 structure like: test-s3-bucket └── orders ├── year=2021 │ └── month=12 │ └── 1.parquet └── year=2022 ├── month=01 │ └── 1.parquet └── month=02 └── 1.parquet I'm using {table} in path_spec to ingest at the table-level (thanks Tamas!) but am unclear about partition_key[i], partition[i] in https://datahubproject.io/docs/generated/ingestion/sources/s3#path-specs Do they exist to capture partitioning metadata, or do they specify S3 scanning more precisely & efficiently? Q1) Which of these should I use and why? My partitioned directories are all consistently named like above. # explicit partition_key[i], partition[i] for the full path s3://foo/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet # hardcoded year, month partition keys. s3://foo/{table}/year={partition[0]}/month={partition[1]}/*.parquet # wildcard directories. s3://foo/{table}/{partition_key[0]}={partition[0]}/*/*.parquet Q2) For each table, I'd like to capture the min & max partition metadata to know the data range. I have a custom script that lists the min and max file for a given table. Can I incorporate this into the {table} level recipe to capture the min/max during ingestion? I'd like to avoid the data profiling route due to data volume/egress cost.

green-tent-78669

10/20/2022, 2:52 PM

Hi, I try to add stats for tables with profiling. Ingestion is done but in ui datahub stats still disable. Thanks

rapid-fall-7147

10/20/2022, 5:59 PM

Hi All, we have a mongodb ingestion job using datahub-cli but cli is going after system.views collection for aggregation we want to restrict that donot see option in the yaml , is there any way we can do that also is there way to see actual query being fired tried to use --debug option but no luck

Copy code

'errmsg': 'not authorized on vv-db to execute command { aggregate: "system.views", pipeline: [ { $addFields: { temporary_doc_size_field: { $bsonSize: "$$ROOT" } } }, { $match: { temporary_doc_size_field: { $lt: 16793600 } } }, { $project: { temporary_doc_size_field: 0 } }, { $sample: { size: 1000 } } ], allowDiskUse: true, cursor: {},

mysterious-advantage-78411

10/21/2022, 1:38 PM

Hi all, Could someone tell me how to make work the hierarchy of projects in the Tableau? it seems there are duplicates in different projects \ subprojects after successful ingestion from Tableau. Is there any suggestion to work with tableau Projects \ sub-projects?

best-umbrella-88325

10/21/2022, 1:40 PM

Hi All. We've deployed Datahub on EKS, which has given us 2 classic load balancers, one for GMS and the other for Frontend. However, while ingesting metadata from the UI, it fails every time. Nothing major in the logs. We provided the host of the LB which hosts GMS, but that didn't help either. It works from the CLI though, when I configure the GMS host using datahub init. Are we missing something? Thanks in advance. This is the recipe that works from the CLI but not from the UI. Error logs attached as part of thread.

Copy code

sink:
    type: datahub-rest
    config:
        server: '<http://a35f8626d7XXXXXbeec24fdaa5720-XXX.us-west-1.elb.amazonaws.com:8080/>'
source:
    type: s3
    config:
        path_spec:
            include: '<s3://XX-bkt/*.*>'
        platform: s3
        aws_config:
            aws_access_key_id: XXXXXXX
            aws_region: us-west-1
            aws_secret_access_key: XXXXXXXXX
pipeline_name: 'urn:li:dataHubIngestionSource:f751376f-ec1a-4dee-a71f-7f4f96c3cdda'

high-gigabyte-86638

10/21/2022, 2:20 PM

Hello all, I have a custom source dataset and i want to use the Queries Column in the DataHub (see in the picture). Do you know how i can ingest this? Thank you!

gentle-camera-33498

10/21/2022, 2:44 PM

Hello everyone, I'm trying to ingest Assertion metadata following this example, but no evaluations is appearing on UI. Does anyone know what I could be doing wrong? Ingestion steps I coded: 1. Create the AssertionInfo object and emit. 2. Issue the 'dataPlatformInstance' aspect to the created AssertionInfo. 3. Create the 'assertionRunEvent' aspect and emit. All these steps using the same table and assertion URN.

agreeable-park-13466

10/21/2022, 5:32 PM

Hi Team, I have a long list of dataset which I need to do soft delete. Is there any way to delete it using openAPI.

plus1 2

brave-farmer-39785

10/21/2022, 9:50 PM

Hi team, I have created a new entity called vendor with aspect vendorInfo under the project metadata-models-custom. It is deployed and I ingest instance value using curl command with a POST request that has ingestProposal action. When I use the curl command to get the aspect: curl --header "X-RestLi-Protocol-Version:2.0.0" "http://localhost:8080/entitiesV2/urn:li:vendor:SandP?aspects=List(vendorInfo)", it seems to work fine: { "urn": "urnlivendor:SandP", "aspects": { "vendorKey": { "created": { _"actor": "urnlicorpuser:__datahub_system",_ "time": 1666387872419 }, "name": "vendorKey", "type": "VERSIONED", "version": 0, "value": { "id": "SandP" } }, "vendorInfo": { "created": { _"actor": "urnlicorpuser:__datahub_system",_ "time": 1666385764774 }, "name": "vendorInfo", "type": "VERSIONED", "systemMetadata": { "registryVersion": "0.0.1", "runId": "no-run-id-provided", "registryName": "mycompany-dq-model", "lastObserved": 1666385764780 }, "version": 0, "value": { "name": "SandP", "phone": "8007528878", "url": https://www.spglobal.com/, "contact": "Dimitra Manis" } } }, "entityName": "vendor" } The index has also been created: { _"vendorindex_v2": {_ "aliases": {}, "mappings": { "properties": { "name": { "type": "text", "fields": { "keyword": { "type": "keyword", _"ignore_above": 256_ } } }, "runId": { "type": "text", "fields": { "keyword": { "type": "keyword", _"ignore_above": 256_ } } }, "urn": { "type": "text", "fields": { "keyword": { "type": "keyword", _"ignore_above": 256_ } } } } }, "settings": { "index": { _"creation_date": "1666385765356",_ _"number_of_shards": "1",_ _"number_of_replicas": "1",_ "uuid": "5P1DKXUKRmCC5D1VRSlNtQ", "version": { "created": "7090399" }, _"provided_name": "vendorindex_v2"_ } } } } However, when I tried to search the entity with aspect name 'SandP': curl -X POST http://localhost:8080/entities?action=search --data "@sandp.json" where sandp,json has the following data: { "input": "SandP", "entity": "vendor", "start": 0, "count": 10 } it returns an empty value: Copy code
{ "value": { "numEntities": 0, "pageSize": 0, "from": 0, "metadata": {}, "entities": [] } }
Here is the vendoInfo aspect: @Aspect = { "name": "vendorInfo" } record VendorInfo { @Searchable = { "fieldType": "KEYWORD", "enableAutocomplete": true, "queryByDefault": true, "boostScore": 10.0 } name: string phone: string contact: optional string url: optional string } Here is the error message from Docker log:

Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [<http://elasticsearch:9200>], URI [/vendorindex_v2/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true], status line [HTTP/1.1 400 Bad Request]

{"error":{"root_cause":[{"type":"query_shard_exception","reason":"[query_string] analyzer [custom_keyword] not found","index_uuid":"5P1DKXUKRmCC5D1VRSlNtQ","index":"vendorindex_v2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"vendorindex_v2","node":"Z4CXuUUEQuSblawDWxJSow","reason":{"type":"query_shard_exception","reason":_*"[query_string] analyzer [custom_keyword] not found"*_,"index_uuid":"5P1DKXUKRmCC5D1VRSlNtQ","index":"vendorindex_v2"}}]},"status":400}

Any idea what is missing? Thanks!

proud-table-38689

10/21/2022, 11:05 PM

if I create a custom DataHub source does that get added to the DataHub Actions process?

microscopic-tailor-94417

10/24/2022, 8:55 AM

Hi all, I have big query sources and use bigquery and bigquery_usage modules for ingestion. But also I wonder what we use bigquery_beta module for? When I use it for ingestion I am not able to see any differences from the last ingestion. I would be very happy if you could help me.

few-air-56117

10/24/2022, 10:19 AM

Hi folks, I tried to do a statefull ingestion in datahub from bigquery (in order to delete tables/views if they are not anymore in bigquery) but even if the job is done with no errors , the tables how are deleted from bigquery are still in datahub. Thx

rhythmic-school-70923

10/24/2022, 1:22 PM

hi all, I'm trying to integrate datahub with spark, but in my Spark job log I see a NullPointer ERROR DatahubSparkListener: java.lang.NullPointerException at datahub.spark.DatahubSparkListener.processExecution (DatahubSparkListener.java:296) does anyone know what it is? My spark job reads from an S3 Minio folder and writes to a deltatable always on s3

abundant-airport-72599

10/24/2022, 4:53 PM

hey all, I’m working on posting lineage from our flink applications to datahub where the upstream and downstream kafka topics that make up the graph are in some cases dynamic -- we won’t know what they are until we encounter them during runtime. I’d like to be able to remove elements from the graph when we haven’t seen them for a while, but that’s hard to do from the context of a running application where we may simply not have seen them yet. Wondering if any of you have wrestled with this same scenario and have any ideas on how to handle? Some kind of ttl on specific upstream/downstream flows would be ideal but I don’t think datahub has any kind of support for that? My thinking is that tagging a last_seen timestamp on the dataflows and leaving for someone to manually prune in the UI could work fine, but wondering if there’s a better approach? Thank you!

steep-midnight-37232

10/24/2022, 4:59 PM

Hi guys, I'm so happy the column lineage feature is now available! I was trying to check it in my datahub instance with snowflake and looker but I'm able to see the list of columns only and not the lineage. I have updated datahub to v.0.9, reingested looker and snowflake data and I have switched on the control "Show column" in the lineage page UI but no link between snowflake tables and looker. Do you know if I have to change/add some parameter in the recipes ?? Or Is there something I need to add or define to see the column lineage? Thanks 🙂

important-night-50346

10/24/2022, 7:14 PM

Hi. I’m trying to configure Airflow metadata ingestion and it does not seem to work with custom

datahub.cluster

set to sometging like MY_CLUSTER (instead of dev, prod, qa). Could you please advise if cluster has to be something specific or any string is accepted? We do not capture tags_info it it matters.

rough-activity-61346

10/25/2022, 12:01 AM

Is there a way to run an Ingestion created in the UI through the CLI?

loud-journalist-47725

10/25/2022, 6:31 AM

Hi, I'm trying to ingest a postgres data source and run a transformer through a YAML recipe. The transformer fails with 'extra fields not permitted' when using '`replace_existing`' and '`semantics`'. My acryl-datahub CLI client version is 0.8.43

lively-sugar-7233

10/25/2022, 7:23 AM

Hello. First time being here. I have a question regarding hive ingestion. I’m ingesting via linux environment with CLI interface where hive client is installed with kerberos authorization enabled. Simple ingestion works fine but problem occurs when I try “profiling”. Problem: The Hive I’m ingesting from is set, so that query that could require large resource is not allowed, unless some configuration, such as “hive.mapred.mode” is set to “nonstrict”. So this is the error message I’m getting while ingesting with profiliing:

Copy code

...If you know what you are doing, please sethive.strict.checks.cartesian.product to false and that hive.mapred.mode is not set to 'strict' to proceed. Note that if you may get errors or incorrect results if you make a mistake while using some of the unsafe features.

Try: This kind of error is very common when querying and usually I would put some “set” dialect in front of an actual query. Yet, as I don’t have control over datahubs’ queries, edited client’s(where ingesting is running) hive-site.xml located in the directory set by “HIVE_CONF_DIR” environment variable. Concequence: Same error occurs even after modifying hive-site.xml with the values of “hive.strict.checks.cartesian.product” as “false” and “hive.mapred.mode” as “nonstrict”. What I want to know: 1. Doesn’t datahub read hive-site.xml specified by “HIVE_CONF_DIR” environ? Is there a way to feed hive configuration file when running hive ingestion? 2. Or is there a way to add “set” dialect to the hive ingestion queries without going development mode? Thank you for reading. I think datahub is awesome and hope find the way integrate it into our workflow.

bland-orange-13353

10/25/2022, 8:48 AM

This message was deleted.

bland-orange-13353

10/25/2022, 1:27 PM

This message was deleted.