DataHub #ingestion

chilly-ability-77706

09/28/2022, 7:19 PM

Hello All, is there plan to add support for Azure Data Explorer and ADLS Gen2. I hope I did not miss these in the documents

future-animal-95178

09/28/2022, 7:56 PM

Hi all - curious if anyone has ran into this issue with lookml ingestion. I’m hoping to get lineage from my underlying warehouse tables to looker views, however i can’t seem to get them working. I think my issue is that we have defined our looker view sql table names to use jinja/user attributes to allow our developers to point their looker dev environment to an arbitrary project and dataset in bigquery. I’ll thread some details:

wonderful-notebook-20086

09/28/2022, 10:50 PM

I have two redshift clusters that have the same database name ... sometimes these clusters also have tables with the same name. How do I setup an ingestion run to distinguish between the two?

refined-energy-76018

09/29/2022, 3:35 AM

Is there a way to disable the Airflow Datahub plugin locally which doesn't involve removing the dependency? For example, if I use the lineage backend, I can just inject

AIRFLOW__LINEAGE__BACKEND

set to an empty string to 'disable' the integration. When I set

AIRFLOW__DATAHUB__CONN_ID

to an empty string or remove the

[datahub]

configuration from

airflow.cfg

entirely, I still get the error messages in the task logs which is noisy.

limited-breakfast-31442

09/29/2022, 3:55 AM

Hi, I am running an ingestion on s3 data source, but facing the following error:

limited-breakfast-31442

09/29/2022, 3:57 AM

Copy code

Traceback (most recent call last):
  File "C:\Users\user\anaconda3\lib\site-packages\datahub\cli\ingest_cli.py", line 197, in run
    pipeline = Pipeline.create(
  File "C:\Users\user\anaconda3\lib\site-packages\datahub\ingestion\run\pipeline.py", line 317, in create
    return cls(
  File "C:\Users\user\anaconda3\lib\site-packages\datahub\ingestion\run\pipeline.py", line 160, in __init__
    self._record_initialization_failure(e, "Failed to set up framework context")
  File "C:\Users\user\anaconda3\lib\site-packages\datahub\ingestion\run\pipeline.py", line 129, in _record_initialization_failure
    raise PipelineInitError(msg) from e
datahub.ingestion.run.pipeline.PipelineInitError: Failed to set up framework context
[2022-09-29 11:48:37,229] ERROR    {datahub.entrypoints:195} - Command failed:
        Failed to set up framework context due to
                'Failed to connect to DataHub' due to
                        'HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001D222581AC0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))'.

better-insurance-34701

09/29/2022, 4:19 AM

Hi, I'm on 0.8.45 & encoutered this error when ingest dbt source, please give some help

datahub --debug ingest -c /git/dwh_dev/datahub.yml

[2022-09-29 11:13:51,202] DEBUG    {datahub.telemetry.telemetry:210} - Sending init Telemetry

[2022-09-29 11:13:52,261] DEBUG    {datahub.telemetry.telemetry:243} - Sending Telemetry

[2022-09-29 11:13:52,726] INFO     {datahub.cli.ingest_cli:182} - DataHub CLI version: 0.8.45

[2022-09-29 11:13:52,746] DEBUG    {datahub.cli.ingest_cli:196} - Using config: {'source': {'type': 'dbt', 'config': {'manifest_path': '/git/dwh_dev/target/manifest.json', 'catalog_path': '/git/dwh_dev/target/catalog.json', 'test_results_path': '/git/dwh_dev/target/run_results.json', 'target_platform': 'postgres', 'load_schemas': False, 'meta_mapping': {'business_owner': {'match': '.*', 'operation': 'add_owner', 'config': {'owner_type': 'user', 'owner_category': 'BUSINESS_OWNER'}}, 'data_steward': {'match': '.*', 'operation': 'add_owner', 'config': {'owner_type': 'user', 'owner_category': 'DATA_STEWARD'}}, 'technical_owner': {'match': '.*', 'operation': 'add_owner', 'config': {'owner_type': 'user', 'owner_category': 'TECHNICAL_OWNER'}}, 'has_pii': {'match': True, 'operation': 'add_tag', 'config': {'tag': 'has_pii'}}, 'data_governance.team_owner': {'match': 'Finance', 'operation': 'add_term', 'config': {'term': 'Finance_test'}}, 'source': {'match': '.*', 'operation': 'add_tag', 'config': {'tag': '{{ $match }}'}}}, 'query_tag_mapping': {'tag': {'match': '.*', 'operation': 'add_tag', 'config': {'tag': '{{ $match }}'}}}}}}

[2022-09-29 11:13:52,814] DEBUG    {datahub.ingestion.sink.datahub_rest:116} - Setting env variables to override config

[2022-09-29 11:13:52,814] DEBUG    {datahub.ingestion.sink.datahub_rest:118} - Setting gms config

[2022-09-29 11:13:52,814] DEBUG    {datahub.ingestion.run.pipeline:174} - Sink type:datahub-rest,<class 'datahub.ingestion.sink.datahub_rest.DatahubRestSink'> configured

[2022-09-29 11:13:52,814] INFO     {datahub.ingestion.run.pipeline:175} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://localhost:8080>

[2022-09-29 11:13:52,818] DEBUG    {datahub.ingestion.sink.datahub_rest:116} - Setting env variables to override config

[2022-09-29 11:13:52,818] DEBUG    {datahub.ingestion.sink.datahub_rest:118} - Setting gms config

[2022-09-29 11:13:52,818] DEBUG    {datahub.ingestion.reporting.datahub_ingestion_run_summary_provider:120} - Ingestion source urn = urn:li:dataHubIngestionSource:cli-151c2b7711eb626e440af8c75a9082e9

[2022-09-29 11:13:52,819] DEBUG    {datahub.emitter.rest_emitter:247} - Attempting to emit to DataHub GMS; using curl equivalent to:

curl -X POST -H 'User-Agent: python-requests/2.28.1' -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'Connection: keep-alive' -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'Content-Type: application/json' --data '{"proposal": {"entityType": "dataHubIngestionSource", "entityUrn": "urn:li:dataHubIngestionSource:cli-151c2b7711eb626e440af8c75a9082e9", "changeType": "UPSERT", "aspectName": "dataHubIngestionSourceInfo", "aspect": {"value": "{\"name\": \"[CLI] dbt\", \"type\": \"dbt\", \"platform\": \"urn:li:dataPlatform:unknown\", \"config\": {\"recipe\": \"{\\\"source\\\": {\\\"type\\\": \\\"dbt\\\", \\\"config\\\": {\\\"manifest_path\\\": \\\"${DBT_PROJECT_ROOT}/target/manifest.json\\\", \\\"catalog_path\\\": \\\"${DBT_PROJECT_ROOT}/target/catalog.json\\\", \\\"test_results_path\\\": \\\"${DBT_PROJECT_ROOT}/target/run_results.json\\\", \\\"target_platform\\\": \\\"postgres\\\", \\\"load_schemas\\\": false, \\\"meta_mapping\\\": {\\\"business_owner\\\": {\\\"match\\\": \\\".*\\\", \\\"operation\\\": \\\"add_owner\\\", \\\"config\\\": {\\\"owner_type\\\": \\\"user\\\", \\\"owner_category\\\": \\\"BUSINESS_OWNER\\\"}}, \\\"data_steward\\\": {\\\"match\\\": \\\".*\\\", \\\"operation\\\": \\\"add_owner\\\", \\\"config\\\": {\\\"owner_type\\\": \\\"user\\\", \\\"owner_category\\\": \\\"DATA_STEWARD\\\"}}, \\\"technical_owner\\\": {\\\"match\\\": \\\".*\\\", \\\"operation\\\": \\\"add_owner\\\", \\\"config\\\": {\\\"owner_type\\\": \\\"user\\\", \\\"owner_category\\\": \\\"TECHNICAL_OWNER\\\"}}, \\\"has_pii\\\": {\\\"match\\\": true, \\\"operation\\\": \\\"add_tag\\\", \\\"config\\\": {\\\"tag\\\": \\\"has_pii\\\"}}, \\\"data_governance.team_owner\\\": {\\\"match\\\": \\\"Finance\\\", \\\"operation\\\": \\\"add_term\\\", \\\"config\\\": {\\\"term\\\": \\\"Finance_test\\\"}}, \\\"source\\\": {\\\"match\\\": \\\".*\\\", \\\"operation\\\": \\\"add_tag\\\", \\\"config\\\": {\\\"tag\\\": \\\"{{ $match }}\\\"}}}, \\\"query_tag_mapping\\\": {\\\"tag\\\": {\\\"match\\\": \\\".*\\\", \\\"operation\\\": \\\"add_tag\\\", \\\"config\\\": {\\\"tag\\\": \\\"{{ $match }}\\\"}}}}}}\", \"version\": \"0.8.45\", \"executorId\": \"__datahub_cli_\"}}", "contentType": "application/json"}}}' '<http://localhost:8080/aspects?action=ingestProposal>'

[2022-09-29 11:13:52,849] DEBUG    {datahub.ingestion.run.pipeline:269} - Reporter type:datahub,<class 'datahub.ingestion.reporting.datahub_ingestion_run_summary_provider.DatahubIngestionRunSummaryProvider'> configured.

[2022-09-29 11:13:52,982] DEBUG    {datahub.telemetry.telemetry:243} - Sending Telemetry

[2022-09-29 11:13:53,555] DEBUG    {datahub.entrypoints:168} - File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 196, in __init__

131  def __init__(

132      self,

133      config: PipelineConfig,

134      dry_run: bool = False,

135      preview_mode: bool = False,

136      preview_workunits: int = 10,

137      report_to: Optional[str] = None,

138      no_default_report: bool = False,

139  ):

(...)

192          self._record_initialization_failure(e, "Failed to create source")

193          return

195      try:

--> 196          self.source: Source = source_class.create(

197              self.config.source.dict().get("config", {}), self.ctx

File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/source/dbt.py", line 1001, in create

999  @classmethod

1000  def create(cls, config_dict, ctx):

--> 1001      config = DBTConfig.parse_obj(config_dict)

1002      return cls(config, ctx, "dbt")

File "pydantic/main.py", line 526, in pydantic.main.BaseModel.parse_obj

File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__

ValidationError: 1 validation error for DBTConfig

load_schemas

extra fields not permitted (type=value_error.extra)

The above exception was the direct cause of the following exception:

File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 197, in run

111  def run(

112      ctx: click.Context,

113      config: str,

114      dry_run: bool,

115      preview: bool,

116      strict_warnings: bool,

117      preview_workunits: int,

118      suppress_error_logs: bool,

119      test_source_connection: bool,

120      report_to: str,

121      no_default_report: bool,

122      no_spinner: bool,

123  ) -> None:

(...)

193          _test_source_connection(report_to, pipeline_config)

195      try:

196          logger.debug(f"Using config: {pipeline_config}")

--> 197          pipeline = Pipeline.create(

198              pipeline_config,

File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 317, in create

306  def create(

307      cls,

308      config_dict: dict,

309      dry_run: bool = False,

310      preview_mode: bool = False,

311      preview_workunits: int = 10,

312      report_to: Optional[str] = None,

313      no_default_report: bool = False,

314      raw_config: Optional[dict] = None,

315  ) -> "Pipeline":

316      config = PipelineConfig.from_dict(config_dict, raw_config)

--> 317      return cls(

318          config,

File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 202, in __init__

131  def __init__(

132      self,

133      config: PipelineConfig,

134      dry_run: bool = False,

135      preview_mode: bool = False,

136      preview_workunits: int = 10,

137      report_to: Optional[str] = None,

138      no_default_report: bool = False,

139  ):

(...)

198          )

199          logger.debug(f"Source type:{source_type},{source_class} configured")

200          <http://logger.info|logger.info>("Source configured successfully.")

201      except Exception as e:

--> 202          self._record_initialization_failure(

203              e, f"Failed to configure source ({source_type})"

File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 129, in _record_initialization_failure

128  def _record_initialization_failure(self, e: Exception, msg: str) -> None:

--> 129      raise PipelineInitError(msg) from e

---- (full traceback above) ----

File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 197, in run

pipeline = Pipeline.create(

File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 317, in create

return cls(

File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 202, in __init__

self._record_initialization_failure(

File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 129, in _record_initialization_failure

raise PipelineInitError(msg) from e

PipelineInitError: Failed to configure source (dbt)

[2022-09-29 11:13:53,555] DEBUG    {datahub.entrypoints:198} - DataHub CLI version: 0.8.45 at /home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/__init__.py

[2022-09-29 11:13:53,556] DEBUG    {datahub.entrypoints:201} - Python version: 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0] at /home/thinh/datahub_venv/bin/python3 on Linux-5.15.0-48-generic-x86_64-with-glibc2.35

[2022-09-29 11:13:53,556] DEBUG    {datahub.entrypoints:204} - GMS config {'models': {}, 'patchCapable': True, 'versions': {'linkedin/datahub': {'version': 'v0.8.45', 'commit': '21a8718b1093352bc1e3a566d2ce0297d2167434'}}, 'managedIngestion': {'defaultCliVersion': '0.8.42', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'quickstart'}, 'noCode': 'true'}

crooked-holiday-47153

09/29/2022, 8:53 AM

Hi All, I am trying to ingest a single table's data from Snowflake without success. I tried various regular expression values in the ingestion config but none of them seems to work. The table I want to ingest full name is SANDBOX_DB.P_USERA.LAB_STRUCTURE and these are the config params I am using, am I missing anything (sharing the relevant part from the config)?

Copy code

source:
    type: snowflake
    config:
...
        schema_pattern:
            allow:
                - ^SANDBOX_DB\.P_USERA$
        database_pattern:
            allow:
                - ^SANDBOX_DB$
        table_pattern:
            allow:
                - ^SANDBOX_DB\.P_USERA\.LAB_STRUCTURE$
...

I used the same config for snowflake-usage ingestion as well. Both of them finish successfully but the table doesn't turn up when searching it in the catalog.

ancient-policeman-73437

09/29/2022, 9:14 AM

Dear support, could we link the metadata of Athena and Glue to show owners and descriptions for the tables from Athena by metadata from Glue ? Is there any step by step explanation for that ?

ancient-policeman-73437

09/29/2022, 9:15 AM

Dear support, Module

csv-enricher

says that it is possible to use csv to join descriptions, tags and so on on the field level, but it doesnt describe how to do that. Could you give more information about it please ? For example how urn of a field should look like ?

teamwork 1

nice-helmet-40615

09/29/2022, 10:54 AM

Hey everyone, after every round of ingestion, I see extra rows in the backend with urns like 'urnlidataJob:(urnlidataFlow:(datahub,stateful_ingestion_pipeline_name%' I assume that they are technical items because they have 'datahub' as the Pipeline name in the 'urn' structure. But looks strange that these items are available via UI (search query: 'NOT name: *' and then check Tasks) Is it a bug that should be reported or it is expected behavior?

plus1 1

gifted-diamond-19544

09/29/2022, 11:18 AM

Hello! Anyone here ingesting data from redshift via the UI using temporary credentials? I am unsure how to set this up, because I don’t have a fixed password. Rather, the password for a given user get’s automatically generated whenever I make the request to redshift. Anyone has a similar setup?

alert-fall-82501

09/29/2022, 1:10 PM

Hi Team - I am working CSV sources to get the configured with datahub . I gone through documentation but did not get the clear idea about it , can anybody have example for this ? or config file ?

chilly-truck-63841

09/29/2022, 2:56 PM

Hi DataHub team -- we are ingesting metadata from Snowflake, and running into an issue where some tables in our DB are being pulled into DataHub as a

table

entity and pulling in the schema/stats/etc as expected, and other tables in the same DB/schema are being ingested as

dataset

entities and are then missing the schema & other metadata -- any ideas as to what may be causing this?

creamy-pizza-80433

09/30/2022, 2:30 AM

Hello I want to ingest local csv files with S3 source with this folder structure └─staging ├───table_a │ a.csv │ ├───table_b │ b.csv │ └───table_c c.csv How do I configure the

path specs

so that datahub ingest the files to

file > table_a > a.csv

format instead of

file > staging > table_a > a.csv

? I've tried run datahub ingest in the

staging

folder with path spec

*/*.csv

but to no avail. Thank you.

careful-action-61962

09/30/2022, 8:49 AM

Hey Folks, I want to know if there is any retry mechanism config that we can specify in datahub ingestion recipe from UI

careful-action-61962

09/30/2022, 8:53 AM

Hey Folks, Is there a way If I can configure Databricks Sql Warehouse in Datahub? https://docs.databricks.com/sql/admin/sql-endpoints.html

able-controller-81727

09/30/2022, 11:05 AM

Hej Team! I am trying to get the metadata ingestion going from one of our Oracle DBs. The configuration appears to be straight forward - I have verified the source connections. The ingestion fails though. I have a quickstart setup running v0.8.33. I suspect the problem is with the sink (datahub-rest) because there is a 404 response from localhost:8080. And I have verified the datahub-gms container - it is healthy and running fine (even restarted it). I have attached the error log from the ingestion and also the error message when hitting localhost:8080 I have looked at the troubleshooting tips from https://datahubproject.io/docs/ui-ingestion - but it doesn’t appear to help my case - can you perhaps point me in the right direction? Slack Conversation

orange-flag-48535

09/30/2022, 11:23 AM

Does the RestEmitter in Java support delete via a MetadataChangeProposal? My analysis of the code tells me the below type of code should work, but I'm not seeing the delete happen:

DataMap dataMap = new DataMap();

dataMap.put("entityType", "dataset");

dataMap.put("entityUrn", "urn:li:dataset:(urn:li:dataPlatform:myDp,myDB.myTable,DEV)");

dataMap.put("changeType", "UPSERT");

dataMap.put("aspectName", "status");

DataMap aspectMap = new DataMap();

DataMap aspectValueMap = new DataMap();

aspectValueMap.put("removed", true);

aspectMap.put("value", aspectValueMap);

aspectMap.put("contentType", "application/json");

dataMap.put("aspect", aspectMap);

MetadataChangeProposal mcp = new MetadataChangeProposal(dataMap);

RestEmitter client = RestEmitter.create(b -> b.server("<http://localhost:8080>"));

client.emit(mcp, callback);

My alternative would be to fallback to the OpenAPI endpoint (https://demo.datahubproject.io/openapi/swagger-ui/index.html#/Entities/deleteEntities), but I'd rather use RestEmitter and avoid doing raw Http myself. Thanks.

alert-fall-82501

09/30/2022, 12:18 PM

Hi team - Can anybody suggest on this error ? I am trying to ingest metadata from redshift source to datahub rest private server .

alert-fall-82501

09/30/2022, 12:21 PM

Copy code

[5:49 PM] 'Failed to connect to DataHub' due to
[2022-09-30, 09:12:06 UTC] {{subprocess.py:89}} INFO - 			'HTTPSConnectionPool(host='<http://datahub-gms.amer-prod.xxx.com|datahub-gms.amer-prod.xxx.com>', port=8080): Max retries exceeded with url: /config (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f7342c77990>, 'Connection to <http://datahub-gms.amer-prod.xxx.com|datahub-gms.amer-prod.xxx.com> timed out. (connect timeout=30)'))'

stocky-truck-96371

09/30/2022, 1:37 PM

Hi Team, I'm trying re use the tags which are ingested though recipe. But I ended up in creating it again in UI. Is there any way to use the tags(add to other datasets) which are ingested through recipe?

ancient-policeman-73437

09/30/2022, 2:21 PM

Dear all, could you advise how you get the lineage for Athena tables ?

best-sunset-26241

09/30/2022, 11:55 PM

Hi everyone, I need help ingesting dbt in my Datahub. My team deployed it in Kubernetes (GCP GKE) and now I don’t know where I should store manifest.json and catalog.json. I’ve tried in a bucket on Google Cloud Storage and passed the URI as an argument, but it didn’t work. Now I am wondering if I should have stored them in a bucket inside GKE (If there is one, sorry I know almost nothing about Kubernetes yet).

sparse-coat-12944

10/01/2022, 5:25 AM

Does the ingestion work for elasticsearch5.7 - i.e ingesting data from elasticsearch 5.7?

limited-forest-73733

09/30/2022, 4:33 PM

Hey team! I have upgraded datahub images to 0.8.45 then getting snowflake views in Upper case previously showing all the snowflake entities in lower case.Is this expected in latest release? Due to this i am not getting the lineage properly.

green-honey-91903

10/02/2022, 7:11 PM

I’m seeing a lot of ingestion errors when i’m ingesting fivetran loaded snowflake tables:

Copy code

SQL compilation error: syntax error line 1 at position 29 unexpected 'START'. syntax error line 1 at position 28 unexpected '('.

SELECT APPROX_COUNT_DISTINCT(START) 
FROM "SURVEYMONKEY"."FIVETRAN_AUDIT"

I believe the issue is that

START

is a column in

FIVETRAN_AUDIT

tables and

START

is also a reserved snowflake keyword. The solution at the query level is to wrap the reserved keyword in quotes

APPROX_COUNT_DISTINCT("START")

. Since datahub is executing these queries, should this be fixed within datahub? the tables themselves are fivetran-service tables so i believe there’s no ability to map/rename these columns. Has any1 else ran into this? I’m on the latest datahub via a helm/k8s deployment.

limited-forest-73733

10/03/2022, 11:15 AM

Hey everyone! I upgraded all images to 0.8.45 and datahub-ingestion to 0.8.45.1 but i am not getting snapshots in the lineage, unable to see the table transformations.

limited-forest-73733

10/03/2022, 11:19 AM

This is the image

steep-airplane-60304

10/03/2022, 11:19 AM

Hi I am doing Postgres Ingestion to the data in the Postgtres Container. My

config file

looks like this.

source:

type: "postgres"

config:

# Coordinates

host_port: "localhost:5435"

database: "srcdb"

# Credentials

username: "source"

password: "gsRABSy6xvWsSTE3"

sink:

type: "datahub-rest"

config:

server: "<http://localhost:8080>"

I am getting the following error: