late-notebook-97260
09/05/2023, 9:17 AMquery getDatasetUpstreams($urn: String!) {
downstream: searchAcrossLineage(
input: {urn: $urn, direction: DOWNSTREAM, count: 1000 , orFilters: [
{
and: [
{
field: "filter_tags",
values: ["critical"]
condition: CONTAIN
}
]
}
] }
) {
total
searchResults {
degree
entity {
type
urn
}
}
}
}
shy-diamond-99510
09/05/2023, 12:39 PMmelodic-match-91677
09/06/2023, 3:34 AMeager-monitor-4683
09/06/2023, 6:39 AMwonderful-library-51057
09/06/2023, 10:24 PM<hdfs://my-orders>
folder. a new file is uploaded each day, but i want my logical data set to be “orders.” i have an airflow job that runs every week on the last week's files. for example, airflow job build-weekly-summary/__scheduled_2023-09-09T01:00:00
reads [<hdfs://my-orders/2023-09-03.parquet>
, <hdfs://my-orders/2023-09-04.parquet>
, etc] and writes <hdfs://order-summary/2023-09-09.parquet>
.
I want to track lineage that shows which files were accessed and written by a specific run of a job. but i can’t find a way to register a file like s3://my-orders/2023-09-06.parquet to the orders data set in data hub. effectively i want to:
1. Go to data hub and click on the "Orders" logical dataset
2. See that this data set is composed of 24 files in data lake, including <hdfs://my-orders/2023-09-03.parquet>
3. Click on <hdfs://my-orders/2023-09-03.parquet>
and see (via lineage) that it was read by the build-weekly-summary/__scheduled_2023-09-09T01:00:00
job.
4. See that this job passed all the validation checks.
5. See that this job also wrote out a file to <hdfs://order-summary/2023-09-09.parquet>
Is that possible with Data Hub? it seems like the s3 data lake tooling supports something similar but hdfs is tied to hive? if it's not natively supported, would it be realistic to implement a custom source.
edited: replaced s3 path examples with hdfs as that's the store we're actually using in our environment.dazzling-rainbow-96194
09/07/2023, 4:50 PMwonderful-library-51057
09/07/2023, 8:56 PMstale-guitar-30481
09/07/2023, 10:53 PMli
stand for in URN expression?
ex. urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)
shy-kangaroo-51257
09/08/2023, 6:37 AMbest-monitor-90704
09/08/2023, 8:32 AMbumpy-computer-90932
09/08/2023, 12:44 PMorange-gpu-90973
09/08/2023, 2:34 PMelegant-machine-46829
09/08/2023, 6:18 PMERROR: Could not find a version that satisfies the requirement acryl-datahub[datahub-kafka,datahub-rest,redshift]==@cliMajorVersion@
Execution finished with errors.
{'exec_id': 'f7a40783-a3ea-4c35-8161-47f449c22e4b',
'infos': ['2023-09-08 16:58:42.877515 INFO: Starting execution for task with name=RUN_INGEST',
"2023-09-08 16:58:55.261806 INFO: Failed to execute 'datahub ingest'",
'2023-09-08 16:58:55.264998 INFO: Caught exception EXECUTING task_id=f7a40783-a3ea-4c35-8161-47f449c22e4b, name=RUN_INGEST, '
'stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
' task_event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
' return future.result()\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
' raise TaskError("Failed to execute \'datahub ingest\'")\n'
"acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
'errors': []}
any easy way this can be fixed? I'm not even sure which container this is coming from. Thanks for any help.icy-umbrella-3214
09/08/2023, 7:50 PMdatahub docker start
but I use colima instead of docker since I am on Macbook m1 and it failsicy-umbrella-3214
09/08/2023, 7:50 PM❯ datahub docker quickstart
Detected M1 machine
[2023-09-08 12:50:31,809] INFO {datahub.cli.quickstart_versioning:144} - Saved quickstart config to /Users/edmondoporcu/.datahub/quickstart/quickstart_version_mapping.yaml.
[2023-09-08 12:50:31,810] INFO {datahub.cli.docker_cli:645} - Using quickstart plan: composefile_git_ref='master' docker_tag='head'
Docker doesn't seem to be running. Did you start it?
❯ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7cadfc70b5d2 postgres:15.4-alpine3.17 "docker-entrypoint.s…" 11 days ago Up 7 days 0.0.0.0:5432->5432/tcp, :::5432->5432/tcp postgres-arroyo
icy-umbrella-3214
09/08/2023, 7:50 PMbest-monitor-90704
09/11/2023, 1:02 AMbest-monitor-90704
09/11/2023, 1:03 AMmicroscopic-spring-39376
09/11/2023, 4:27 AMbest-monitor-90704
09/11/2023, 5:28 AMshy-kangaroo-51257
09/11/2023, 8:48 AMshy-kangaroo-51257
09/11/2023, 9:18 AMwonderful-library-51057
09/11/2023, 4:28 PMalert-angle-39401
09/11/2023, 8:19 PMalert-angle-39401
09/11/2023, 8:19 PMalert-angle-39401
09/11/2023, 8:19 PMalert-angle-39401
09/11/2023, 8:19 PMalert-angle-39401
09/11/2023, 8:20 PMalert-angle-39401
09/11/2023, 8:20 PMalert-angle-39401
09/11/2023, 8:27 PM