chilly-ability-77706
09/28/2022, 7:19 PMfuture-animal-95178
09/28/2022, 7:56 PMwonderful-notebook-20086
09/28/2022, 10:50 PMrefined-energy-76018
09/29/2022, 3:35 AMAIRFLOW__LINEAGE__BACKEND
set to an empty string to 'disable' the integration. When I set AIRFLOW__DATAHUB__CONN_ID
to an empty string or remove the [datahub]
configuration from airflow.cfg
entirely, I still get the error messages in the task logs which is noisy.limited-breakfast-31442
09/29/2022, 3:55 AMlimited-breakfast-31442
09/29/2022, 3:57 AMTraceback (most recent call last):
File "C:\Users\user\anaconda3\lib\site-packages\datahub\cli\ingest_cli.py", line 197, in run
pipeline = Pipeline.create(
File "C:\Users\user\anaconda3\lib\site-packages\datahub\ingestion\run\pipeline.py", line 317, in create
return cls(
File "C:\Users\user\anaconda3\lib\site-packages\datahub\ingestion\run\pipeline.py", line 160, in __init__
self._record_initialization_failure(e, "Failed to set up framework context")
File "C:\Users\user\anaconda3\lib\site-packages\datahub\ingestion\run\pipeline.py", line 129, in _record_initialization_failure
raise PipelineInitError(msg) from e
datahub.ingestion.run.pipeline.PipelineInitError: Failed to set up framework context
[2022-09-29 11:48:37,229] ERROR {datahub.entrypoints:195} - Command failed:
Failed to set up framework context due to
'Failed to connect to DataHub' due to
'HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001D222581AC0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))'.
better-insurance-34701
09/29/2022, 4:19 AMdatahub --debug ingest -c /git/dwh_dev/datahub.yml
[2022-09-29 11:13:51,202] DEBUG {datahub.telemetry.telemetry:210} - Sending init Telemetry
[2022-09-29 11:13:52,261] DEBUG {datahub.telemetry.telemetry:243} - Sending Telemetry
[2022-09-29 11:13:52,726] INFO {datahub.cli.ingest_cli:182} - DataHub CLI version: 0.8.45
[2022-09-29 11:13:52,746] DEBUG {datahub.cli.ingest_cli:196} - Using config: {'source': {'type': 'dbt', 'config': {'manifest_path': '/git/dwh_dev/target/manifest.json', 'catalog_path': '/git/dwh_dev/target/catalog.json', 'test_results_path': '/git/dwh_dev/target/run_results.json', 'target_platform': 'postgres', 'load_schemas': False, 'meta_mapping': {'business_owner': {'match': '.*', 'operation': 'add_owner', 'config': {'owner_type': 'user', 'owner_category': 'BUSINESS_OWNER'}}, 'data_steward': {'match': '.*', 'operation': 'add_owner', 'config': {'owner_type': 'user', 'owner_category': 'DATA_STEWARD'}}, 'technical_owner': {'match': '.*', 'operation': 'add_owner', 'config': {'owner_type': 'user', 'owner_category': 'TECHNICAL_OWNER'}}, 'has_pii': {'match': True, 'operation': 'add_tag', 'config': {'tag': 'has_pii'}}, 'data_governance.team_owner': {'match': 'Finance', 'operation': 'add_term', 'config': {'term': 'Finance_test'}}, 'source': {'match': '.*', 'operation': 'add_tag', 'config': {'tag': '{{ $match }}'}}}, 'query_tag_mapping': {'tag': {'match': '.*', 'operation': 'add_tag', 'config': {'tag': '{{ $match }}'}}}}}}
[2022-09-29 11:13:52,814] DEBUG {datahub.ingestion.sink.datahub_rest:116} - Setting env variables to override config
[2022-09-29 11:13:52,814] DEBUG {datahub.ingestion.sink.datahub_rest:118} - Setting gms config
[2022-09-29 11:13:52,814] DEBUG {datahub.ingestion.run.pipeline:174} - Sink type:datahub-rest,<class 'datahub.ingestion.sink.datahub_rest.DatahubRestSink'> configured
[2022-09-29 11:13:52,814] INFO {datahub.ingestion.run.pipeline:175} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://localhost:8080>
[2022-09-29 11:13:52,818] DEBUG {datahub.ingestion.sink.datahub_rest:116} - Setting env variables to override config
[2022-09-29 11:13:52,818] DEBUG {datahub.ingestion.sink.datahub_rest:118} - Setting gms config
[2022-09-29 11:13:52,818] DEBUG {datahub.ingestion.reporting.datahub_ingestion_run_summary_provider:120} - Ingestion source urn = urn:li:dataHubIngestionSource:cli-151c2b7711eb626e440af8c75a9082e9
[2022-09-29 11:13:52,819] DEBUG {datahub.emitter.rest_emitter:247} - Attempting to emit to DataHub GMS; using curl equivalent to:
curl -X POST -H 'User-Agent: python-requests/2.28.1' -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'Connection: keep-alive' -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'Content-Type: application/json' --data '{"proposal": {"entityType": "dataHubIngestionSource", "entityUrn": "urn:li:dataHubIngestionSource:cli-151c2b7711eb626e440af8c75a9082e9", "changeType": "UPSERT", "aspectName": "dataHubIngestionSourceInfo", "aspect": {"value": "{\"name\": \"[CLI] dbt\", \"type\": \"dbt\", \"platform\": \"urn:li:dataPlatform:unknown\", \"config\": {\"recipe\": \"{\\\"source\\\": {\\\"type\\\": \\\"dbt\\\", \\\"config\\\": {\\\"manifest_path\\\": \\\"${DBT_PROJECT_ROOT}/target/manifest.json\\\", \\\"catalog_path\\\": \\\"${DBT_PROJECT_ROOT}/target/catalog.json\\\", \\\"test_results_path\\\": \\\"${DBT_PROJECT_ROOT}/target/run_results.json\\\", \\\"target_platform\\\": \\\"postgres\\\", \\\"load_schemas\\\": false, \\\"meta_mapping\\\": {\\\"business_owner\\\": {\\\"match\\\": \\\".*\\\", \\\"operation\\\": \\\"add_owner\\\", \\\"config\\\": {\\\"owner_type\\\": \\\"user\\\", \\\"owner_category\\\": \\\"BUSINESS_OWNER\\\"}}, \\\"data_steward\\\": {\\\"match\\\": \\\".*\\\", \\\"operation\\\": \\\"add_owner\\\", \\\"config\\\": {\\\"owner_type\\\": \\\"user\\\", \\\"owner_category\\\": \\\"DATA_STEWARD\\\"}}, \\\"technical_owner\\\": {\\\"match\\\": \\\".*\\\", \\\"operation\\\": \\\"add_owner\\\", \\\"config\\\": {\\\"owner_type\\\": \\\"user\\\", \\\"owner_category\\\": \\\"TECHNICAL_OWNER\\\"}}, \\\"has_pii\\\": {\\\"match\\\": true, \\\"operation\\\": \\\"add_tag\\\", \\\"config\\\": {\\\"tag\\\": \\\"has_pii\\\"}}, \\\"data_governance.team_owner\\\": {\\\"match\\\": \\\"Finance\\\", \\\"operation\\\": \\\"add_term\\\", \\\"config\\\": {\\\"term\\\": \\\"Finance_test\\\"}}, \\\"source\\\": {\\\"match\\\": \\\".*\\\", \\\"operation\\\": \\\"add_tag\\\", \\\"config\\\": {\\\"tag\\\": \\\"{{ $match }}\\\"}}}, \\\"query_tag_mapping\\\": {\\\"tag\\\": {\\\"match\\\": \\\".*\\\", \\\"operation\\\": \\\"add_tag\\\", \\\"config\\\": {\\\"tag\\\": \\\"{{ $match }}\\\"}}}}}}\", \"version\": \"0.8.45\", \"executorId\": \"__datahub_cli_\"}}", "contentType": "application/json"}}}' '<http://localhost:8080/aspects?action=ingestProposal>'
[2022-09-29 11:13:52,849] DEBUG {datahub.ingestion.run.pipeline:269} - Reporter type:datahub,<class 'datahub.ingestion.reporting.datahub_ingestion_run_summary_provider.DatahubIngestionRunSummaryProvider'> configured.
[2022-09-29 11:13:52,982] DEBUG {datahub.telemetry.telemetry:243} - Sending Telemetry
[2022-09-29 11:13:53,555] DEBUG {datahub.entrypoints:168} - File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 196, in __init__
131 def __init__(
132 self,
133 config: PipelineConfig,
134 dry_run: bool = False,
135 preview_mode: bool = False,
136 preview_workunits: int = 10,
137 report_to: Optional[str] = None,
138 no_default_report: bool = False,
139 ):
(...)
192 self._record_initialization_failure(e, "Failed to create source")
193 return
194
195 try:
--> 196 self.source: Source = source_class.create(
197 self.config.source.dict().get("config", {}), self.ctx
File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/source/dbt.py", line 1001, in create
999 @classmethod
1000 def create(cls, config_dict, ctx):
--> 1001 config = DBTConfig.parse_obj(config_dict)
1002 return cls(config, ctx, "dbt")
File "pydantic/main.py", line 526, in pydantic.main.BaseModel.parse_obj
File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
ValidationError: 1 validation error for DBTConfig
load_schemas
extra fields not permitted (type=value_error.extra)
The above exception was the direct cause of the following exception:
File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 197, in run
111 def run(
112 ctx: click.Context,
113 config: str,
114 dry_run: bool,
115 preview: bool,
116 strict_warnings: bool,
117 preview_workunits: int,
118 suppress_error_logs: bool,
119 test_source_connection: bool,
120 report_to: str,
121 no_default_report: bool,
122 no_spinner: bool,
123 ) -> None:
(...)
193 _test_source_connection(report_to, pipeline_config)
194
195 try:
196 logger.debug(f"Using config: {pipeline_config}")
--> 197 pipeline = Pipeline.create(
198 pipeline_config,
File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 317, in create
306 def create(
307 cls,
308 config_dict: dict,
309 dry_run: bool = False,
310 preview_mode: bool = False,
311 preview_workunits: int = 10,
312 report_to: Optional[str] = None,
313 no_default_report: bool = False,
314 raw_config: Optional[dict] = None,
315 ) -> "Pipeline":
316 config = PipelineConfig.from_dict(config_dict, raw_config)
--> 317 return cls(
318 config,
File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 202, in __init__
131 def __init__(
132 self,
133 config: PipelineConfig,
134 dry_run: bool = False,
135 preview_mode: bool = False,
136 preview_workunits: int = 10,
137 report_to: Optional[str] = None,
138 no_default_report: bool = False,
139 ):
(...)
198 )
199 logger.debug(f"Source type:{source_type},{source_class} configured")
200 <http://logger.info|logger.info>("Source configured successfully.")
201 except Exception as e:
--> 202 self._record_initialization_failure(
203 e, f"Failed to configure source ({source_type})"
File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 129, in _record_initialization_failure
128 def _record_initialization_failure(self, e: Exception, msg: str) -> None:
--> 129 raise PipelineInitError(msg) from e
---- (full traceback above) ----
File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 197, in run
pipeline = Pipeline.create(
File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 317, in create
return cls(
File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 202, in __init__
self._record_initialization_failure(
File "/home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 129, in _record_initialization_failure
raise PipelineInitError(msg) from e
PipelineInitError: Failed to configure source (dbt)
[2022-09-29 11:13:53,555] DEBUG {datahub.entrypoints:198} - DataHub CLI version: 0.8.45 at /home/thinh/datahub_venv/lib/python3.10/site-packages/datahub/__init__.py
[2022-09-29 11:13:53,556] DEBUG {datahub.entrypoints:201} - Python version: 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0] at /home/thinh/datahub_venv/bin/python3 on Linux-5.15.0-48-generic-x86_64-with-glibc2.35
[2022-09-29 11:13:53,556] DEBUG {datahub.entrypoints:204} - GMS config {'models': {}, 'patchCapable': True, 'versions': {'linkedin/datahub': {'version': 'v0.8.45', 'commit': '21a8718b1093352bc1e3a566d2ce0297d2167434'}}, 'managedIngestion': {'defaultCliVersion': '0.8.42', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'quickstart'}, 'noCode': 'true'}
crooked-holiday-47153
09/29/2022, 8:53 AMsource:
type: snowflake
config:
...
schema_pattern:
allow:
- ^SANDBOX_DB\.P_USERA$
database_pattern:
allow:
- ^SANDBOX_DB$
table_pattern:
allow:
- ^SANDBOX_DB\.P_USERA\.LAB_STRUCTURE$
...
I used the same config for snowflake-usage ingestion as well.
Both of them finish successfully but the table doesn't turn up when searching it in the catalog.ancient-policeman-73437
09/29/2022, 9:14 AMancient-policeman-73437
09/29/2022, 9:15 AMcsv-enricher
says that it is possible to use csv to join descriptions, tags and so on on the field level, but it doesnt describe how to do that. Could you give more information about it please ? For example how urn of a field should look like ?nice-helmet-40615
09/29/2022, 10:54 AMgifted-diamond-19544
09/29/2022, 11:18 AMalert-fall-82501
09/29/2022, 1:10 PMchilly-truck-63841
09/29/2022, 2:56 PMtable
entity and pulling in the schema/stats/etc as expected, and other tables in the same DB/schema are being ingested as dataset
entities and are then missing the schema & other metadata -- any ideas as to what may be causing this?creamy-pizza-80433
09/30/2022, 2:30 AMpath specs
so that datahub ingest the files to file > table_a > a.csv
format instead of file > staging > table_a > a.csv
?
I've tried run datahub ingest in the staging
folder with path spec */*.csv
but to no avail.
Thank you.careful-action-61962
09/30/2022, 8:49 AMcareful-action-61962
09/30/2022, 8:53 AMable-controller-81727
09/30/2022, 11:05 AMorange-flag-48535
09/30/2022, 11:23 AMDataMap dataMap = new DataMap();
dataMap.put("entityType", "dataset");
dataMap.put("entityUrn", "urn:li:dataset:(urn:li:dataPlatform:myDp,myDB.myTable,DEV)");
dataMap.put("changeType", "UPSERT");
dataMap.put("aspectName", "status");
DataMap aspectMap = new DataMap();
DataMap aspectValueMap = new DataMap();
aspectValueMap.put("removed", true);
aspectMap.put("value", aspectValueMap);
aspectMap.put("contentType", "application/json");
dataMap.put("aspect", aspectMap);
MetadataChangeProposal mcp = new MetadataChangeProposal(dataMap);
RestEmitter client = RestEmitter.create(b -> b.server("<http://localhost:8080>"));
client.emit(mcp, callback);
My alternative would be to fallback to the OpenAPI endpoint (https://demo.datahubproject.io/openapi/swagger-ui/index.html#/Entities/deleteEntities), but I'd rather use RestEmitter and avoid doing raw Http myself. Thanks.alert-fall-82501
09/30/2022, 12:18 PMalert-fall-82501
09/30/2022, 12:21 PM[5:49 PM] 'Failed to connect to DataHub' due to
[2022-09-30, 09:12:06 UTC] {{subprocess.py:89}} INFO - 'HTTPSConnectionPool(host='<http://datahub-gms.amer-prod.xxx.com|datahub-gms.amer-prod.xxx.com>', port=8080): Max retries exceeded with url: /config (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f7342c77990>, 'Connection to <http://datahub-gms.amer-prod.xxx.com|datahub-gms.amer-prod.xxx.com> timed out. (connect timeout=30)'))'
stocky-truck-96371
09/30/2022, 1:37 PMancient-policeman-73437
09/30/2022, 2:21 PMbest-sunset-26241
09/30/2022, 11:55 PMsparse-coat-12944
10/01/2022, 5:25 AMlimited-forest-73733
09/30/2022, 4:33 PMgreen-honey-91903
10/02/2022, 7:11 PMSQL compilation error: syntax error line 1 at position 29 unexpected 'START'. syntax error line 1 at position 28 unexpected '('.
SELECT APPROX_COUNT_DISTINCT(START)
FROM "SURVEYMONKEY"."FIVETRAN_AUDIT"
I believe the issue is that START
is a column in FIVETRAN_AUDIT
tables and START
is also a reserved snowflake keyword. The solution at the query level is to wrap the reserved keyword in quotes APPROX_COUNT_DISTINCT("START")
.
Since datahub is executing these queries, should this be fixed within datahub? the tables themselves are fivetran-service tables so i believe there’s no ability to map/rename these columns. Has any1 else ran into this? I’m on the latest datahub via a helm/k8s deployment.limited-forest-73733
10/03/2022, 11:15 AMlimited-forest-73733
10/03/2022, 11:19 AMsteep-airplane-60304
10/03/2022, 11:19 AMconfig file
looks like this.
source:
type: "postgres"
config:
# Coordinates
host_port: "localhost:5435"
database: "srcdb"
# Credentials
username: "source"
password: "gsRABSy6xvWsSTE3"
sink:
type: "datahub-rest"
config:
server: "<http://localhost:8080>"
I am getting the following error: