wooden-football-7175
02/08/2022, 3:54 PMmysterious-portugal-30527
02/09/2022, 12:43 AMversion 0.8.25
Running docker QuickStart
on Linux and connecting thru Chrome on a MBP, adding an ingestion thru the web application. Choosing Execute
fails.
Why is this failing:
sink:
type: datahub-rest
config:
server: '<http://localhost:8080>'
Log shows:
"ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by "
"NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fae83a81a30>: Failed to establish a new connection: [Errno 111] "
"Connection refused'))\n",
"2022-02-09 00:27:27.263935 [exec_id=e989b898-fb4d-4eec-9d9c-965a78650cb9] INFO: Failed to execute 'datahub ingest'",
'2022-02-09 00:27:27.269727 [exec_id=e989b898-fb4d-4eec-9d9c-965a78650cb9] INFO: Caught exception EXECUTING '
'task_id=e989b898-fb4d-4eec-9d9c-965a78650cb9, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
' self.event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 81, in run_until_complete\n'
' return f.result()\n'
' File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
' raise self._exception\n'
' File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
' result = coro.send(None)\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
' raise TaskError("Failed to execute \'datahub ingest\'")\n'
"acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Curl shows:
curl <http://localhost:8080/config>
{
"models" : { },
"versions" : {
"linkedin/datahub" : {
"version" : "v0.8.25",
"commit" : "306fe0b5ffe3e59857ca5643136c8b29d80d4d60"
}
},
"statefulIngestionCapable" : true,
"retention" : "true",
"noCode" : "true"
}
What am I missing??shy-island-99768
02/09/2022, 7:35 AMfull_name: project-p-p:stats.active_stats
name: active_stats
owners:
- email: <mailto:abel@vanmoof.com|abel@vanmoof.com>
notes:
description: Collect stats...
usage:
- department_name:
example_usage:
- hello
bigquery_link: <https://bigquery.googleapis.com/bigquery/v2/projects/blabla/datasets/bla/tables/active_stats>
columns:
- name: frame_number
description:
is_primary_key:
aliases: []
unit:
relations: []
- name: created_at
description:
is_primary_key:
aliases: []
unit:
relations: []
- name: product_id
description:
is_primary_key:
aliases: []
unit:
relations: []
plain-farmer-27314
02/09/2022, 2:24 PMWe now support the ability to ignore specific users when calculating Top Users of a Dataset/Column — this is useful when you want to exclude users designated for maintenance/automated execution.
So we can yeet our airflow user out of datahub 🙂lively-fall-12210
02/09/2022, 4:03 PMdomain.domain_key.allow
and domain.domain_key.deny
are used. Are they intended to extract domain names from the topic name by a capturing group in the regex? Or are they used to only keep topics that belong to a certain domain? Does somebody have an example? The documentation is a bit short here. Thanks a lot!wooden-football-7175
02/09/2022, 6:25 PMglue pipelines
that I imported from aws source. I could manage to use Airflow backend for lineage
but I do not find documentation how to configure glue
as a job to connect two differents `datasets`(also glue). Anyone has any reference? Thanks in advance!!handsome-football-66174
02/09/2022, 7:50 PMrich-policeman-92383
02/09/2022, 8:13 PMglamorous-house-64036
02/09/2022, 10:18 PMsource:
type: postgres
config:
# Coordinates
host_port: URL:5432
database: DATABASENAME
# Credentials
username: user
password: password
#Options
include_tables: True
include_views: True
sink:
type: "datahub-rest"
config:
server: "<http://localhost:9002/api/gms>" #this path is what UI ingestion tool sugests, I also tried default <http://localhost:8080>" with same result
Both postgres and datahub-rest plugins looks enabled.
Upd: Error log moved into thread.rich-winter-40155
02/10/2022, 4:22 AMbroad-tomato-45373
02/10/2022, 6:31 AMextraVolumes:
- name: user-props
configMap:
name: user-props
extraVolumeMounts:
- name: user-props
mountPath: /datahub-frontend/conf/user.props
3. upgraded the helm chart with the new values
helm upgrade --install -f values.yaml -f overide_chart_values.yml datahub datahub/datahub
But, i didn't succeed in getting new users for login.
I am very much new to K8s and helm charts.
Any help would be much appreciated.gray-spoon-5206
02/10/2022, 6:36 AMadorable-flower-19656
02/10/2022, 6:57 AMsquare-machine-96318
02/10/2022, 6:58 AMfew-air-56117
02/10/2022, 7:33 AMgreat-dusk-47152
02/10/2022, 8:10 AMrhythmic-kitchen-64860
02/10/2022, 8:12 AMdatahub.ingestion.run.pipeline
the whole code is
from datahub.ingestion.run.pipeline import Pipeline
# The pipeline configuration is similar to the recipe YAML files provided to the CLI tool.
pipeline = Pipeline.create(
{
'source':{
"type":"postgres",
"config":{
"username":"postgres",
"password":"strongpass",
"database":"northwind",
"host_port":"localhost:5432",
"database_alias":"test",
"schema_pattern":{
"allow":{
"public"
}
},
"table_pattern":{
"allow":[
"test.public.region",
"test.public.suppliers"
]
}
}
},
"sink":{
"type":"datahub-rest",
"config":{
"server":"<http://localhost:8080>"
}
}
}
)
# Run the pipeline and report the results.
pipeline.run()
pipeline.pretty_print_summary()
and I want to try to make a scheduler from running that config, is it possible to do that?
thank you.few-air-56117
02/10/2022, 8:54 AMsource:
type: bigquery-usage
config:
projects:
- p1
- p2
credential:
project_id:
private_key_id:
private_key: '${PRIVATE_KEY}'
client_email:
client_id:
sink:
type: datahub-rest
config:
server:
I got this error
'1 validation error for BigQueryUsageConfig\n'
'credential\n'
' extra fields not permitted (type=value_error.extra)\n',
so its look like i can add credential on biguqery-usage ( on bigquery its work)silly-beach-19296
02/10/2022, 12:19 PMcrooked-van-51704
02/10/2022, 2:15 PMdbt
? I have seeing an issue when I try to ingest a dbt project, it causes a DuplicateKeyException
When I disable the dbt node creation using disable_dbt_node_creation: True
it works fine. So it must be related to the dbt specific metadata.
Oddly, I can disable the dbt nodes, do the ingestion successfully, re-enable the dbt nodes, and now ingestion works without any errors
The specific error I see in the stack trace is this
'Caused by: java.sql.BatchUpdateException: Duplicate entry '
"'urn:li:dataset:(urn:li:dataPlatform:snowflake,citibike_tripdata.' for key 'metadata_aspect_v2.PRIMARY'\n"
limited-cricket-18852
02/10/2022, 4:29 PMambitious-guitar-89068
02/11/2022, 5:01 AMcurved-truck-53235
02/11/2022, 1:47 PMnarrow-bird-99605
02/11/2022, 2:51 PMhandsome-football-66174
02/11/2022, 4:21 PMentityUrn=builder.make_data_job_urn(
orchestrator="airflow", flow_id="flow1", job_id="job1", cluster="PROD"
),
modern-monitor-81461
02/11/2022, 6:09 PMnone
. In Mysql, a database is pretty much the same as a schema, so it is unclear to me what the database should be... Seeing none
is not what I was expecting. Is it to represent the fact that it is absent from Mysql? If so, would it be better to simply not create that container?
Now if I want to document the hostname of the Mysql server in DataHub, is it when I need to use platform instances? I thought platform instances were used to differentiate different instances of Mysql servers, am I right? Looking for guidance here on how to use those concepts since I want to apply them to Datalakes in an Iceberg source I am currently working on.cool-painting-92220
02/12/2022, 12:13 AMmysterious-nail-70388
02/14/2022, 8:25 AMproud-accountant-49377
02/14/2022, 9:58 AMred-napkin-59945
02/14/2022, 6:20 PM