lively-jackal-83760
09/08/2022, 10:37 AMThe calling method's class, org.springframework.beans.factory.xml.XmlBeanDefinitionReader, is available from the following locations:
jar:file:/Users/dseredenko/.m2/repository/org/springframework/spring-beans/5.3.21/spring-beans-5.3.21.jar!/org/springframework/beans/factory/xml/XmlBeanDefinitionReader.class
jar:file:/Users/dseredenko/.m2/repository/io/acryl/datahub-client/0.8.44/datahub-client-0.8.44.jar!/org/springframework/beans/factory/xml/XmlBeanDefinitionReader.class
seems like datahub-client has its own springframework which conflicts with mine. Don't see any dependency in pom file of datahub-client and don't understand how to exclude spring from datahub-client.
Can someone help?bland-orange-13353
09/08/2022, 11:41 AMagreeable-belgium-70840
09/08/2022, 12:33 PMastonishing-byte-5433
09/08/2022, 5:55 PMvolumes:
- ../mysql/init.sql:/docker-entrypoint-initdb.d/init.sql
Throwing following error when running first time on a new machine and causing the mysql container to fail:
Can't initialize batch_readline - may be the input source is a directory or a block device.
After running the compose file a second time it doesn't throw this issue again and everything works fine.
Also removed the volume from the script and it seems to run fine on a fresh machine.cuddly-butcher-39945
09/08/2022, 2:42 PMkind-lunch-14188
09/09/2022, 2:05 PMdatahub get --urn "urn:li:dataset:(urn:li:dataPlatform:bigquery,project_id.dataset.table,DEV)" --aspect schemaMetadata | jq >schema.json
Then I was trying to upload this schema to new&“empty” datahub. To achieve that I tried with
datahub ingest -c ingest_schema.dhub.yaml
, where ingest_schema.dhub.yaml
is as follow:
# see <https://datahubproject.io/docs/generated/ingestion/sources/file> for complete documentation
source:
type: "file"
config:
filename: schema.json
# see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub> for complete documentation
sink:
type: "datahub-rest"
config:
server: "<https://gms.topic-dev.name.dev>"
anyway I’m still receiving the same error:
Command failed with com.linkedin.pegasus2avro.usage.UsageAggregation is missing required field: bucket.
In fact there is no UsageAggregation
in generated json file. My good guess I’m generating json file in wrong way. Are you able to give me a hint how to do it correctly?brainy-tent-14503
09/09/2022, 2:41 PMspark-lineage
smoke tests fail because of a difference in the browse path. I am working off the main branch with only a few slight fixes to get the test to run (see diff). I am wondering if the golden json files might be out of date from what is expected or if something else is off. For example the tests expected this
'com.linkedin.common.BrowsePaths': {
'paths': ['/spark/spark_spark-master_7077/pythonhdfsin2hivecreatetable']
}
however the actual looks like this with the /pythonhdfsin2hivecreatetable
part missing.
'com.linkedin.common.BrowsePaths': {
'paths': ['/spark/spark_spark-master_7077']
}
the part that is missing is seen in the properties as appName
so ... maybe I should update the expected golden json?billowy-truck-48700
09/09/2022, 7:16 PMbitter-lizard-32293
09/09/2022, 8:00 PMsearchAcrossLineage
graphQL query) and we've been seeing super high latencies (> 10s) which executing that query. We spent some time digging into things and it looks like we're spending the bulk of our time in the getLineage call in the ESGraphQueryDao class (as we use ES as our graph store too). I did find one minor bug in that the search lineage results were meant to be cached but that is actually not being done - https://github.com/datahub-project/datahub/pull/5892. This does help us fix repeated calls for the same URN, but first time calls are still taking a while. Does anyone have any recommendations on how we could tune / speed things up here?
Ballparks wise our graph_service_v1
index has around 36M docs (4.8GB on disk) and is currently running 1 shard and 1 replica (wonder if this is too low)great-branch-515
09/11/2022, 7:06 PMlate-rocket-94535
09/12/2022, 7:45 AMtransformers:
- type: "simple_add_dataset_ownership"
config:
owner_urns:
- "urn:li:corpGroup:edwh"
ownership_type: "TECHNICAL_OWNER"
and after last upgrading datahub (v0.8.44) I see errors like [2022-09-12, 07:08:52 UTC] {process_utils.py:173} INFO - [[34m2022-09-12, 07:08:52 UTC[0m] {[34mpipeline.py:[0m57} ERROR[0m - failed to write record with workunit txform-urn:li:dataPlatform:postgres-edwh.raw_tn_marketing.t_customer-TEST-ownership with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /lastModified/actor :: "Provided urn " is invalid\n\n\tat com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$3(AspectResource.java:149)', 'message': 'Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /lastModified/actor :: "Provided urn " is invalid', 'status': 422}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /lastModified/actor :: "Provided urn " is invalid\n\n\tat com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$3(AspectResource.java:149)', 'message': 'Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /lastModified/actor :: "Provided urn " is invalid', 'status': 422}[0m
Is it bug or not?sparse-quill-63288
09/12/2022, 12:27 PMwitty-journalist-16013
09/12/2022, 3:50 PMaws_connection
is an extra field.witty-journalist-16013
09/12/2022, 3:52 PM'1 validation error for DBTConfig\n'
'aws_connection\n'
' extra fields not permitted (type=value_error.extra)\n',
witty-journalist-16013
09/12/2022, 3:57 PMearly-airplane-84388
09/12/2022, 6:24 PM~~~~ Execution Summary ~~~~
RUN_INGEST - {'errors': [],
'exec_id': '8cc88fff-6200-465e-9d09-703160d417fc',
'infos': ['2022-09-12 13:20:11.799891 [exec_id=8cc88fff-6200-465e-9d09-703160d417fc] INFO: Starting execution for task with name=RUN_INGEST',
'2022-09-12 13:20:11.801144 [exec_id=8cc88fff-6200-465e-9d09-703160d417fc] INFO: Caught exception EXECUTING '
'task_id=8cc88fff-6200-465e-9d09-703160d417fc, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
' self.event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
' return f.result()\n'
' File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
' raise self._exception\n'
' File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
' result = coro.send(None)\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 71, in execute\n'
' validated_args = SubProcessIngestionTaskArgs.parse_obj(args)\n'
' File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n'
' File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n'
'pydantic.error_wrappers.ValidationError: 1 validation error for SubProcessIngestionTaskArgs\n'
'debug_mode\n'
' extra fields not permitted (type=value_error.extra)\n']}
Execution finished with errors.
I'm running DataHub in GCP Kubernetes Engine and have also tried uninstalling and reinstalling with helm.thousands-solstice-2498
09/13/2022, 5:51 AMgreat-branch-515
09/13/2022, 6:22 AMfresh-cricket-75926
09/13/2022, 8:31 AMlively-jackal-83760
09/13/2022, 11:43 AMhallowed-dog-79615
09/13/2022, 2:57 PMgreat-branch-515
09/02/2022, 10:12 AMkind-whale-32412
09/13/2022, 11:42 PMResult window is too large
) for paths that have more than 10,000 assets. I created an issue explaining it in details:
https://github.com/datahub-project/datahub/issues/5928famous-florist-7218
09/14/2022, 8:11 AMlog4j.logger.datahub.spark=DEBUG
log4j.logger.datahub.client.rest=DEBUG
I’ve deployed a spark k8s operator with the latest datahub-spark-lineage jar (0.8.44-2). It has failed to emit metadata to GMS. The error is Application end event received, but start event missing for appId spark-xxx
. So that why I need to enable the log4j debugging mode.purple-balloon-66501
09/14/2022, 8:53 AMlimited-forest-73733
09/14/2022, 10:46 AMwitty-lamp-55264
09/14/2022, 12:46 PMError: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1"
After searching, I found that kubernetes latest versions don't support some type of api's such as PodDisruptionBudget
. I wanna see if there will be a newer version that works with the latest versions of kubernetes or is there another solution.
Slack Conversationwitty-lamp-55264
09/14/2022, 12:46 PMsalmon-angle-92685
09/14/2022, 12:51 PMsource:
type: snowflake
config:
account_id: ${DH_SNOWFLAKE_ACCOUNT_ID}
warehouse: ${DH_SNOWFLAKE_WAREHOUSE}
username: ${DH_SNOWFLAKE_USER}
password: ${DH_SNOWFLAKE_PASSWORD}
role: ${DH_SNOWFLAKE_ROLE}
include_tables: True
include_views: True
ignore_start_time_lineage: true
stateful_ingestion:
enabled: True
remove_stale_metadata: True
profiling:
enabled: true
profile_pattern:
allow:
- 'DATABASE_NAME.SCHEMA_NAME.*'
database_pattern:
allow:
- "OIP"
schema_pattern:
allow:
- "SCHEMA_NAME"
table_pattern:
deny:
- '.*\._AIRBYTE_.*'
pipeline_name: "snowflake_ingestion"
sink:
type: datahub-rest
config:
server: ${DATAHUB_SERVER}
Thank you guys in advance !bright-diamond-60933
09/14/2022, 4:02 PMmlfeatureindex_v2
is duplicated with different timestamps appended to the name. Sometimes we also noticed that the pod doesn't even start up when the standalone consumers flag is enabled :