bland-orange-13353
03/01/2024, 12:18 PMminiature-mouse-35911
03/01/2024, 4:49 PMsource:
type: athena
config:
aws_region: us-east-1
work_group: primary
s3_staging_dir: '<s3://datahubpoc-data/athena-results/>'
catalog_name: datahubpoc-gluecatalog
aws_role_arn: 'arn:aws:iam::<awsaccountid>:role/test-datahubec2-poc-role'
profiling:
enabled: 'True'
bulky-island-74277
03/04/2024, 3:10 AMgifted-diamond-19544
03/04/2024, 8:58 AMAthena
ingestion. I was looking into the permissions and it seems that Datahub needs permissions to run queries on Athena, as well as getting objects from S3. Are these permissions necessary if I just want to ingst metadata from Athena (meaning, no profiling)?bland-application-65186
03/04/2024, 10:01 AM<s3://my-bucket/{dept}/tests/{table}/*.avro>
# specifying keywords to be used in display name
whats the expected result of using {dept}
?purple-addition-48342
03/04/2024, 8:29 PMboundless-bear-68728
03/05/2024, 1:41 AMdatahub-action
servce. Currently, I have assigned 6Gi with max up to 8Gi but still I could see that the service is consuming around 7.6Gi of memory and during this time the application UI renders inactive. Is there any resolution to this issue? Currently, I am trying to ingest metadata for just 1 Snowflake DB with all advanced options turned on. Do I need to cut down on the number of schema I am trying to ingest the data or should I push to datahub-action
for more memory?elegant-salesmen-99143
03/05/2024, 8:07 AMenv
parameter is about to be deprecated. It said use platform_instance
instead. But it looks like platform_instance
is for different use cases and works differently.
For example, I had a recipe that had env: STG
. I tried replacing it with with platform_instance: STG
, but now when I look at database structure, I have a container PROD on upper level ( PROD
is the default value for env
), and inside it I have STG container with my database.
Is that the expected behavior?
Environment is the same thing as instance, how do I specify the environment now?
After env
is deprecated, what will happen to the databases that have PROD as the default value for env, not specified in recipe? Will they behave differently from those where env: PROD
is specified in recipe?
I did it while on Datahub 12.1, I haven't upgraded to 13.0 yet, I wanted to try using replacing the env
first.able-jelly-63005
03/05/2024, 9:24 AMfew-accountant-12561
03/05/2024, 1:20 PMfew-piano-98292
03/05/2024, 7:24 PMboundless-bear-68728
03/05/2024, 10:53 PMTraceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/acryl/executor/dispatcher/default_dispatcher.py", line 30, in dispatch_async
res = executor.execute(request)
File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/reporting_executor.py", line 94, in execute
self._datahub_graph.emit_mcp(completion_mcp)
File "/usr/local/lib/python3.10/site-packages/datahub/emitter/rest_emitter.py", line 245, in emit_mcp
self._emit_generic(url, payload)
datahub.configuration.common.OperationalError: ('Unable to emit metadata to DataHub GMS', {'message': 'HTTPConnectionPool(host=\'datahub-datahub-gms\', port=8080): Max retries exceeded with url: /aspects?action=ingestProposal (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'datahub-datahub-gms\', port=8080): Read timed out. (read timeout=30)"))'})
2024-03-05T22:44:20.989419034Z
Can you please help me with this issuefresh-river-19527
03/06/2024, 12:23 PMsome-alligator-9844
03/06/2024, 2:43 PMsome-alligator-9844
03/06/2024, 2:46 PM['env is deprecated and will be removed in a future release. Please use platform_instance instead.']
recipe.yaml
source:
type: hive
config:
platform_instance: ANA.OCE.DEV
env: DEV
host_port: 'xxxxxxx.visa.com:10000'
username: xxxxxxx
options:
connect_args:
auth: KERBEROS
kerberos_service_name: hive
sink:
type: datahub-rest
config:
server: '${DATAHUB_GMS_HOST}'
token: '${DATAHUB_GMS_TOKEN}'
max_threads: 1
Datahub CLI version: 0.12.1.3happy-branch-193
03/06/2024, 3:12 PMincalculable-sundown-8765
03/06/2024, 7:18 PMdatahub delete
. I want to hard delete everything related to redshift.
However, I encounter this issue:
% datahub delete --platform redshift --dry-run
[2024-03-06 20:13:35,266] INFO {datahub.cli.delete_cli:341} - Using DataHubGraph: configured to talk to <http://localhost:8080>
[2024-03-06 20:13:36,009] ERROR {datahub.entrypoints:201} - Command failed: ('Unable to get metadata from DataHub', {'message': '401 Client Error: Unauthorized for url: <http://localhost:8080/api/graphql'}>)
Do I need token to run the command? If so, how can I include the token in the command?
Thank you.
Datahub version: v0.12.1modern-orange-37660
03/06/2024, 9:31 PMcuddly-dinner-641
03/07/2024, 4:02 PMflat-bear-65100
03/08/2024, 2:10 AM'container': ['urn:li:container:8e7ba34c02ebac26523e12b245223254',
'urn:li:container:8f14caa5a1220e7890ee5ca61d5c570d',
'urn:li:container:cee410e83a7898b2dda07dc3440c7cfd',
'urn:li:container:83e2422984342072527ec4f411c231e8',
'urn:li:container:4be6f93ced89cf3af76c4d5aa0a4313f',
'urn:li:container:fbf321045931666f19a792a7bcbd2d2e',
'urn:li:container:7702dc6c60dc4dbdd8ba26f3dc6464ad',
'urn:li:container:8da75ef4e929ee8bdc0dc8287d16cd2b',
'urn:li:container:1d0508f2f359898db300c54bd57ad670',
'urn:li:container:196bbcab079fa9315eb6badccfa8befb',
'... sampled of 21 total elements']},
'aspects': {'dataset': {'datasetProperties': 26, 'schemaMetadata': 26, 'operation': 26, 'container': 26, 'browsePathsV2': 52, 'status': 26},
'container': {'containerProperties': 21,
'status': 21,
'dataPlatformInstance': 21,
'subTypes': 21,
'browsePathsV2': 42,
'container': 20}},
'warnings': {},
'failures': {},
'soft_deleted_stale_entities': [],
'filtered': [],
'start_time': '2024-03-07 21:02:59.023267 (19.02 seconds ago)',
'running_time': '19.02 seconds'}
Sink (datahub-rest) report:
{'total_records_written': 328,
'records_written_per_second': 16,
'warnings': [],
'failures': [],
'start_time': '2024-03-07 21:02:58.256726 (19.79 seconds ago)',
'current_time': '2024-03-07 21:03:18.048629 (now)',
'total_duration_in_seconds': 19.79,
'max_threads': 15,
'gms_version': 'v0.13.0',
'pending_requests': 0}
flat-bear-65100
03/08/2024, 2:11 AMquick-guitar-82682
03/08/2024, 5:17 AMsome-zoo-21364
03/08/2024, 10:26 AMdefault_args = {
'owner': 'mygroup',
}
and the group yaml file contains..
id: mygroup
display_name: "My Group"
email: "mygroup@example.com"
however triggering DAG creates a new user with type CORP_USER
and urn urn:li:corpuser:mygroup
, instead of mapping it the the group entity with urn urn:li:corpGroup:mygroup
gifted-coat-97302
03/08/2024, 12:21 PMdocument_missing_exception
. There seems to be data in the metadata_aspect_v2
database table but nothing in elasticsearch and nothing is visible in the datahub-frontend either.
Datahub Details:
• Version: 0.12.1 (using docker images with this version
• deployment type: Kubernetes (AWS EKS)
• deployment method: Custom internal Helm chart
◦ Frontend deployment separately
◦ GMS deployed with multiple replicas
▪︎ with MCE/MAE turned off
▪︎ metadata-auth enabled
▪︎ Hazelcast enabled (although we are having problems with this, so currently only running one replica)
◦ MAE consumer deployment separately with 2 replicas
◦ MCE consumer deployment separately with 1 replica
Further details in the thread, any help will be much appreciatedminiature-magician-74764
03/08/2024, 7:49 PMurn:li:dataset:(urn:li:dataPlatform:athena,dq_cat_test.mod_cat1_test1,PROD)
• dbt (bottom): urn:li:dataset:(urn:li:dataPlatform:dbt,AwsDataCatalog.dq_cat_test.mod_cat1_test1,PROD)
Is there a way to add the correct Data Catalog into the Athena Ingestion URN? Working with sibling would be impossible due to the volume and the data mesh schema we are developing.
athena_ingestion_nonprod.py
# The pipeline configuration is similar to the recipe YAML files provided to the CLI tool.
pipeline = Pipeline.create(
{
"source": {
"type": "athena",
"config": {
"aws_region": "us-east-2",
"work_group":"primary",
"query_result_location":"REDACTED",
"catalog_name":"AwsDataCatalog"
},
},
"sink": {
"type": "datahub-rest",
"config": {
"server": "REDACTED",
"token": "REDACTED"
},
},
}
)
# Run the pipeline and report the results.
pipeline.run()
pipeline.pretty_print_summary()
recipe.dhub.dbt_nonprod.yaml
source:
type: "dbt"
config:
# Coordinates
# To use this as-is, set the environment variable DBT_PROJECT_ROOT to the root folder of your dbt project
manifest_path: "REDACTED/manifest.json"
catalog_path: "REDACTED/catalog.json"
sources_path: "REDACTED/sources.json" # optional for freshness
test_results_path: "REDACTED/run_results.json" # optional for recording dbt test results after running dbt test
# Options
target_platform: "athena" # e.g. bigquery/postgres/etc.
# incremental_lineage: False # Para cuando queremos borrar el linaje previo
entities_enabled: # Multiple dbt projects
sources: "no"
sink:
type: "datahub-rest"
config:
server: "REDACTED"
token: "REDACTED"
boundless-bear-68728
03/08/2024, 10:13 PMripe-machine-72145
03/09/2024, 2:03 PMworried-agent-2446
03/10/2024, 2:42 PMclean-magazine-98135
03/11/2024, 2:42 AMrich-barista-93413
03/11/2024, 9:24 AM