few-grass-66826
08/10/2022, 2:43 PMjolly-balloon-85466
08/10/2022, 3:51 PM3
"2022-08-10 15:00:55.500831 [exec_id=c22f29ad-b1a9-4ad6-a1e8-ca6831fc6e47] INFO: Failed to execute 'datahub ingest'",
1072
'2022-08-10 15:00:55.506885 [exec_id=c22f29ad-b1a9-4ad6-a1e8-ca6831fc6e47] INFO: Caught exception EXECUTING '
1071
'task_id=c22f29ad-b1a9-4ad6-a1e8-ca6831fc6e47, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
1070
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
1069
' self.event_loop.run_until_complete(task_future)\n'
1068
' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
1067
' return f.result()\n'
1066
' File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
1065
' raise self._exception\n'
1064
' File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
1063
' result = coro.send(None)\n'
1062
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
1061
' raise TaskError("Failed to execute \'datahub ingest\'")\n'
1060
"acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
jolly-balloon-85466
08/10/2022, 3:53 PM\'failures\': {\'lineage-gcp-logs\': ["Error was \'datasetId\'"]},\n'
jolly-balloon-85466
08/10/2022, 3:53 PMrapid-house-76230
08/10/2022, 10:22 PMmicroscopic-mechanic-13766
08/11/2022, 7:49 AMsource:
type: postgres
config:
host_port: 'postgresql:5432'
database: <db>
username: <usr>
password: <psswd>
include_tables: true
include_views: true
profiling:
enabled: True
max_workers: 20
sink:
type: datahub-rest
config:
server: '<http://datahub-gms:8080>'
But after the source is created, it looks like this:
sink:
type: datahub-rest
config:
server: '<http://datahub-gms:8080>'
source:
type: postgres
config:
include_tables: true
database: <db>
profiling:
max_workers: 20
enabled: true
host_port: 'postgresql:5432'
include_views: true
username: <usr>
password: <psswd>
pipeline_name: 'urn:li:dataHubIngestionSource:7c70f090-79cc-432a-a757-7c01b8c091b9'
Is this intended? If so, may I know why the change of structure?
Thanks in advance!!alert-fall-82501
08/11/2022, 8:22 AMalert-fall-82501
08/11/2022, 8:22 AMalert-fall-82501
08/11/2022, 8:22 AMraise KeyError(f"Did not find a registered class for {key}")
KeyError: 'Did not find a registered class for s3'
famous-florist-7218
08/11/2022, 9:18 AMechoing-farmer-38304
08/11/2022, 1:10 PMsource:
type: "delta-lake"
config:
base_path: "<s3://my-bucket/my-folder/sales-table>"
s3:
aws_config:
tried to use it with minio credentials but it doesn't see my data(run successfully but doesn’t load data). It happens, in my opinion, because of troubles with base_path.
Is there any opportunity to use delta lake ingestion with minio credentials? And if there are any ready solutions and we want this feature should we implement it as an additional module or we can just add some changes to the current module so it could use both aws and minio?gifted-knife-16120
08/11/2022, 1:29 PMathena
platform.
and I would like to delete 2 of them. how it can be done?damp-queen-61493
08/11/2022, 8:26 PMv0.8.43
(dev env)
## Recipe
source:
type: kafka
config:
platform_instance: poc_cluster_0
connection:
bootstrap: 'xxxxxx.gcp.confluent.cloud:9092'
consumer_config:
security.protocol: SASL_SSL
sasl.mechanism: PLAIN
sasl.username: '${CLUSTER_API_KEY_ID}'
sasl.password: '${CLUSTER_API_KEY_SECRET}'
schema_registry_url: '<https://xxxxx.gcp.confluent.cloud>'
schema_registry_config:
<http://basic.auth.user.info|basic.auth.user.info>: '${REGISTRY_API_KEY_ID}:${REGISTRY_API_KEY_SECRET}'
colossal-sandwich-50049
08/11/2022, 8:50 PMdatahub ingest -c my-delta-recipe.yml
locally and getting the error below; can someone assist?
Note, I am running DataHub using datahub quickstart
###### Recipe
source:
type: "delta-lake"
config:
env: "PROD"
platform_instance: "my-delta-lake"
platform: "delta-lake"
base_path: "<s3://my-bucket/data/v3/>"
s3:
aws_config:
aws_region: "eu-west-1"
aws_endpoint_url: "http://<local-ip>:4566"
sink:
type: "datahub-rest"
config:
server: "http://<local-ip>:8080" # local IP
###### Logs
[2022-08-11 16:46:47,941] INFO {datahub.cli.ingest_cli:170} - DataHub CLI version: 0.8.43
[2022-08-11 16:46:48,004] INFO {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://10.12.242.238:8080>
[2022-08-11 16:46:49,063] ERROR {logger:26} - Please set env variable SPARK_VERSION
[2022-08-11 16:46:49,565] INFO {datahub.cli.ingest_cli:119} - Starting metadata ingestion
[2022-08-11 16:46:49,565] INFO {datahub.cli.ingest_cli:123} - Source (delta-lake) report:
{'workunits_produced': '0',
'workunit_ids': [],
'warnings': {},
'failures': {},
'cli_version': '0.8.43',
'cli_entry_location': '/usr/local/lib/python3.9/site-packages/datahub/__init__.py',
'py_version': '3.9.9 (main, Nov 21 2021, 03:23:42) \n[Clang 13.0.0 (clang-1300.0.29.3)]',
'py_exec_path': '/usr/local/opt/python@3.9/bin/python3.9',
'os_details': 'macOS-12.4-x86_64-i386-64bit',
'filtered': []}
[2022-08-11 16:46:49,565] INFO {datahub.cli.ingest_cli:126} - Sink (datahub-rest) report:
{'records_written': '0', 'warnings': [], 'failures': [], 'gms_version': 'v0.8.43'}
[2022-08-11 16:46:50,061] ERROR {datahub.entrypoints:188} - Command failed with argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'. Run with --debug to get full trace
[2022-08-11 16:46:50,061] INFO {datahub.entrypoints:191} - DataHub CLI version: 0.8.43 at /usr/local/lib/python3.9/site-packages/datahub/__init__.py
kind-whale-32412
08/11/2022, 8:54 PMgraph.get_aspect_v2
part where you always have to make a GET request to the server first to obtain all existing tags;
then append if it's a new tag;
and then emit it to DataHub.
I find this design a little bit odd that client side has to know what all the tags are, and then server side is completely stateless.
I attempted to bypass this getting the aspect and tried out to just construct MetadataChangeProposalWrapper
with GlobalTagsClass(tags=[tag_association_to_add])
no matter what the state is. I noticed that this removes all the other tags. I was expecting that this would append only the tag that I am attempting to add, not remove other tags.
Is this intended by design? Is there a way to change this by having a flag or any other way to submit?
One big issue here is the race condition, if I am submitting these changes through kafka events (or even synchronous parallel way) and there happens to be multiple MCPW of the same column, other tags could be lost.dazzling-insurance-83303
08/11/2022, 9:55 PMKeyError: 'Did not find a registered class for simple_add_dataset_domain'
I am able to associate a domain to database tables via the domains section under the source configuration section. I am trying to group all the ownerships, associations in the same section, i.e., transformers.
Under the transformers section per the documentation the code should look like below:
transformers:
- type: "simple_add_dataset_domain"
config:
semantics: OVERWRITE
domains:
- urn:li:domain:engineering
My construct is as follows which is throwing the error:
transformers:
- type: "simple_add_dataset_ownership"
...
- type: "simple_add_dataset_domain"
config:
domains:
- "urn:li:domain:xxxxxxxx-xxxx-xxxx-xxxx-bffdfb8b977d" # Finance
Thoughts?
Thanks!alert-fall-82501
08/12/2022, 6:19 AMalert-fall-82501
08/12/2022, 6:19 AMalert-fall-82501
08/12/2022, 6:23 AMfull-chef-85630
08/12/2022, 6:49 AMbright-cpu-56427
08/12/2022, 7:15 AMaverage-rocket-98592
08/12/2022, 11:56 AMfancy-thailand-73281
08/12/2022, 2:00 PMv0.8.36
in AWS EKS cluster with helm charts, we are using AWS MSK (KAFKA), with as a Bootstrap servers :*SASL/SCRAM* and ZooKeeper TLS. Everything works fine till here.
But we are not able to Ingest data(SNOWFLAKE) from UI , I see 'N/A' when I try to run ingestion.
We found in Datahub docs(https://datahubproject.io/docs/ui-ingestion/) saying that we need to enable datahub-actions
, then we deployed the public.ecr.aws/datahub/acryl-datahub-actions pods.
The pod is not running(CrashLoopBackOff) and we see error logs shows as below
[2022-08-11 19:56:59,004] ERROR {datahub.entrypoints:138} - File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 77, in run │
│ 67 def run(config: str, dry_run: bool, preview: bool, strict_warnings: bool) -> None:
│
│ KafkaException: KafkaError{code=_INVALID_ARG,val=-186,str="Failed to create consumer: No provider for SASL mechanism GSSAPI: recompile librdkafka with libsasl2 or openssl support. Current build options: PLAIN SASL_SCRAM OAUTHBEARER"} │
│ 2022/08/11 19:56:59 Command exited with error: exit status 1
could someone please help us and thanks in Advancerapid-fall-7147
08/12/2022, 4:29 PMdelightful-zebra-4875
08/15/2022, 11:46 AMchilly-sundown-93656
08/15/2022, 1:18 PMquery = """
mutation
{
createIngestionSource(input: {
name: "$name",
type: "kafka",
description: "$name",
config: {
recipe: "$recipe",
executorId: "default"
}
})
}
"""
The rationale is to allow users from different teams to add their own Data stores and see what we already ingesting.
Not sure if this is a legit method because the service experiences a lot of trouble pulling topics from several Clusters in parallel.
There are a lot of MySql, Kafka, and Schema Registry exceptions.
sometimes I need to rerun the execution to make it pass.
appreciate any advice on that matter.
thanksbright-receptionist-94235
08/15/2022, 4:49 PMboundless-mechanic-19488
08/15/2022, 6:18 PMsource:
type: glue
config:
aws_region: eu-central-1
aws_access_key_id: YYY
aws_secret_access_key: XXX
i ve attached the following policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:GetDatabases",
"glue:GetTables"
],
"Resource": [
"arn:aws:glue:eu-central-1:XXX:catalog",
"arn:aws:glue:eu-central-1:XXX:database/*",
"arn:aws:glue:eu-central-1:XXX:table/*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDataflowGraph",
"glue:GetJobs"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::*"
]
}
]
}
does anyone has a hint? the pipeline has been executed successfully in 2.5 seconds all the time with ingestion 0 values ando not reporting any issues but a Client-Server Incompatible Warning
x1b[36mYour client version 0.8.42 is older than your server version 0.8.43. Upgrading the '
'cli to 0.8.43 is recommended.\n'
quick-megabyte-61846
08/12/2022, 1:17 PM❯ datahub get --urn "urn:li:dataPlatform:dbt"
{
"dataPlatformInfo": {
"datasetNameDelimiter": ".",
"displayName": "dbt",
"logoUrl": "/assets/platforms/dbtlogo.png",
"name": "dbt",
"type": "OTHERS"
},
"dataPlatformKey": {
"platformName": "dbt"
}
}
alert-fall-82501
08/16/2022, 6:21 AM