rhythmic-flag-69887
02/19/2022, 12:29 PMsource:
type: postgres
config:
host_port: '***.com:5432'
database: **
username: **
password: **
However Im getting an error. Is this error because I wrote something wrong in the recipe?
ConnectionRefusedError: [Errno 111] Connection refused
mysterious-nail-70388
02/21/2022, 3:30 AMmysterious-nail-70388
02/21/2022, 5:59 AMbreezy-noon-83306
02/21/2022, 8:29 AMbreezy-noon-83306
02/21/2022, 8:30 AMfierce-waiter-13795
02/21/2022, 9:22 AMgifted-piano-21322
02/21/2022, 10:04 AMsilly-beach-19296
02/21/2022, 7:25 PMmysterious-nail-70388
02/22/2022, 8:12 AMwitty-painting-90923
02/22/2022, 10:28 AMpipeline = Pipeline.create(
# This configuration is analogous to a recipe configuration.
{
"source": {
"type": "elasticsearch",
"config": {
"env": ENV,
"host": es_connection_host_port,
"username": es_connection_login,
"password": es_connection_password,
"index_pattern": {
"deny": [es_deny_index_pattern]
}
},
},
"sink": {
"type": "datahub-rest",
"config": {"server": datahub_server},
},
"transformers": [
{
"type": "set_dataset_browse_path",
"config": {
"path_templates": [f"/ENV/PLATFORM/EsComments/DATASET_PARTS"]
}
},
{
"type": "simple_add_dataset_tags",
"config": {
"tag_urns": [f"urn:li:tag:EsComments"]
}
}
]
})
pipeline.run()
breezy-guitar-97226
02/22/2022, 11:44 AMlively-fall-12210
02/22/2022, 2:24 PMsource:
type: kafka
config:
connection:
bootstrap: 'my-broker:9092'
schema_registry_url: '<http://my-schema-registry:8081>'
topic_patterns:
deny:
- ^_.+
domain:
'urn:li:domain:3215d470-9bb9-4cdf-be43-e971047b4b72':
allow:
- '^foo\.bar*'
'urn:li:domain:a518ea17-b705-4e59-94be-75cd1c600ca7':
allow:
- '^foo\.bazz*'
'urn:li:domain:ea46bbbe-33c2-4a7e-bedd-665037df50fc':
allow:
- '^foo\.blub*'
sink:
type: datahub-rest
config:
server: '<http://my-datahub:8080>'
However, when executing the recipy, I get the validation error:
'1 validation error for KafkaSourceConfig\n'
'domain\n'
' extra fields not permitted (type=value_error.extra)\n',
According to the source, my deployment of DataHub should support the domain field already. Am I doing something subtly wrong here? Thank you very much for your support!plain-farmer-27314
02/22/2022, 2:39 PM[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - return self.main(*args, **kwargs)
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - rv = self.invoke(ctx)
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - return ctx.invoke(self.callback, **ctx.params)
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - return callback(*args, **kwargs)
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - File "discord_data/python/bin/datahub/datahub_looker_ingest", line 33, in datahub_looker_ingest
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - 'sink': {'type': 'datahub-rest', 'config': {'server': f'{server_url}'}},
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 175, in create
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - return cls(config, dry_run=dry_run, preview_mode=preview_mode)
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 116, in __init__
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - preview_mode=preview_mode,
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/api/common.py", line 41, in __init__
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - self.graph = DataHubGraph(datahub_api) if datahub_api is not None else None
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/graph/client.py", line 47, in __init__
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - ca_certificate_path=self.config.ca_certificate_path,
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - File "/usr/local/lib/python3.7/site-packages/datahub/emitter/rest_emitter.py", line 121, in __init__
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - allowed_methods=self._retry_methods,
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - TypeError: __init__() got an unexpected keyword argument 'allowed_methods'
Any thoughts on what could be causing this? Could also be that another dependency needs to be updatedfierce-alligator-27212
02/22/2022, 4:18 PM[2022-02-22 10:38:24,592] INFO {datahub.cli.ingest_cli:86} - Starting metadata ingestion
[2022-02-22 10:38:24,592] INFO {datahub.ingestion.source.sql.bigquery:320} - Populating lineage info via GCP audit logs
[2022-02-22 10:38:25,997] INFO {datahub.ingestion.source.sql.bigquery:381} - Start loading log entries from BigQuery
[2022-02-22 11:00:36,725] INFO {datahub.ingestion.source.sql.bigquery:520} - Creating lineage map: total number of entries=0, number skipped=0.
[2022-02-22 11:00:36,726] INFO {datahub.ingestion.source.sql.bigquery:316} - Built lineage map containing 0 entries.
config:
source:
type: bigquery
config:
project_id: <GCP Project ID>
env: prod
include_table_lineage: True
start_time: 2022-02-20 00:00:00Z
end_time: 2022-02-21 00:00:00Z
sink:
type: "datahub-rest"
config:
server: "<http://localhost:8080>"
plain-farmer-27314
02/22/2022, 4:40 PMsilly-beach-19296
02/22/2022, 4:55 PMhandsome-football-66174
02/22/2022, 9:05 PMsink:
type: "datahub-kafka"
config:
connection:
bootstrap: localhost:9092
schema_registry_url: <http://localhost:8081>
I believe we need to point schema registry to something else than above ?
kafka:
bootstrap:
server: "<bootstrap server>"
zookeeper:
server: "<zookeeper server>"
schemaregistry:
url: "<http://prerequisites-cp-schema-registry:8081>"
handsome-football-66174
02/22/2022, 10:07 PMsilly-beach-19296
02/23/2022, 12:26 PMrhythmic-bear-20384
02/23/2022, 2:08 PMmodern-monitor-81461
02/23/2022, 8:12 PMgroups_pattern
and users_pattern
since I only want to ingest specific users and groups. My AD contains thousands of entries and it creates a huge log of filtered items, which is just polluting the logs and not having any real value. I still want the logs since when things go sideways, I need to know what is going on, so redirecting the logs to /dev/null
is not an option. I could hack it with grep, but I'd like to know if there is way to disable some reporting? From me reading the code, I don't think there is, but I might have missed something. I think the reporting is done via introspection of a dataclass
, so the filtered
list is being printed if defined. Would there be a way (by modifying the existing code) to disable that list using a param passed to the AzureADSourceReport
constructor? And instead of recording all the filtered names, I could simply keep a count...
@dataclass
class AzureADSourceReport(SourceReport):
filtered: List[str] = field(default_factory=list)
def report_filtered(self, name: str) -> None:
self.filtered.append(name)
fierce-airplane-70308
02/23/2022, 10:29 PMfrom typing import List
import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.dataset import (
DatasetLineageTypeClass,
UpstreamClass,
UpstreamLineage,
)
from datahub.metadata.schema_classes import ChangeTypeClass
# Construct upstream tables.
upstream_tables: List[UpstreamClass] = []
upstream_table_1 = UpstreamClass(
dataset=builder.make_dataset_urn("mssql", "Analytics.PDDBI_DL.USERS","PROD"),
type=DatasetLineageTypeClass.TRANSFORMED,
)
upstream_tables.append(upstream_table_1)
upstream_table_2 = UpstreamClass(
dataset=builder.make_dataset_urn("mssql", "<http://Analytics.PDDBI_DL.JOBS|Analytics.PDDBI_DL.JOBS>","PROD"),
type=DatasetLineageTypeClass.TRANSFORMED,
)
upstream_tables.append(upstream_table_2)
# Construct a lineage object.
upstream_lineage = UpstreamLineage(upstreams=upstream_tables)
# Construct a MetadataChangeProposalWrapper object.
lineage_mcp = MetadataChangeProposalWrapper(
entityType="dataset",
changeType=ChangeTypeClass.UPSERT,
entityUrn=builder.make_dashboard_urn(platform="QlikSense", name="14542bf2-65a8-46ee-b140-953a2f67ebee"),
aspectName="upstreamLineage",
aspect=upstream_lineage,
)
# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("<http://localhost:8080>")
# Emit metadata!
emitter.emit_mcp(lineage_mcp)
adorable-flower-19656
02/24/2022, 1:17 AMsquare-machine-96318
02/24/2022, 2:31 AMbetter-orange-49102
02/24/2022, 6:18 AMversion: 1
source: DataHub
owners:
users:
- mjames
url: "<https://github.com/linkedin/datahub/>"
nodes:
- name: Classification
description: A set of terms related to Data Classification
terms:
- name: Sensitive
description: Sensitive Data
custom_properties:
is_confidential: false
That particular field doesn't show up in MySQL and seems to be causing a display bug if you omit, as discussed here: https://datahubspace.slack.com/archives/C029A3M079U/p1644386207180329breezy-controller-54597
02/24/2022, 8:34 AMlate-animal-78943
02/24/2022, 11:19 AMhundreds-memory-3344
02/24/2022, 5:31 PMDatasetPropertiesClass
, the tag is not entered in the Datahub.
1. If I simply append string in tags , doesn’t it get input?
2. Do I need to put urn in tags?
I attach the code I used as a sample
dataset_properties = DatasetPropertiesClass(description="This is Google Sample",
externalUrl="<https://www.google.com>",
customProperties={},
tags = ['Active']
)
metadata_event = MetadataChangeProposalWrapper(
entityType="dataset",
changeType=ChangeTypeClass.UPSERT,
entityUrn=builder.make_dataset_urn("google_sheet", "sample1"),
aspectName="datasetProperties",
aspect=dataset_properties,
)
emitter.emit(metadata_event)
gentle-father-80172
02/24/2022, 6:58 PMmysterious-portugal-30527
02/24/2022, 9:56 PM