most-market-73884
11/10/2022, 3:35 PMdatahub_action
to push checkpoint results into DataHub. In DataHub, I am using database_alias
to use a different name for a Postgres schema, but the URNs generated by great expectations use the original name of it and results won’t show up in DataHub. There is platform_instance_map
for platforms, is there something similar also for the database name?bland-orange-13353
11/10/2022, 5:00 PMworried-flower-88750
11/10/2022, 5:06 PMdazzling-insurance-83303
11/10/2022, 6:43 PMCOMMENT
ingestion.
I am working on making Postgres table/column `COMMENT`s available in DataHub documentation. For some reasons the `COMMENT`s are not getting picked.
The CLI and DataHub version we are using are 0.9.1
The table DDL is as below:
CREATE TABLE IF NOT EXISTS public.accounts
(
id bigint NOT NULL DEFAULT nextval('accounts_id_seq'::regclass),
account_uuid character varying COLLATE pg_catalog."default",
status character varying COLLATE pg_catalog."default",
created_at timestamp(6) without time zone NOT NULL,
updated_at timestamp(6) without time zone NOT NULL,
CONSTRAINT accounts_pkey PRIMARY KEY (id)
);
ALTER TABLE IF EXISTS public.accounts OWNER to mse_accounting_qa_user;
COMMENT ON TABLE public.accounts IS 'Representation of a user account.';
COMMENT ON COLUMN public.accounts.account_uuid IS 'Unique identifier for the account across all services';
COMMENT ON COLUMN public.accounts.status IS 'The current status of the account, default("created")';
-- Recipe file (redacted) is as below
# accounts
source:
type: postgres
config:
# Coordinates
host_port: xxxx:65432
database: accounts_db
# Credentials
username: datahub_user
password: ${DATAHUB_USER_DB_PWD}
env: 'QA'
# allow or deny tables for ingestion
table_pattern:
allow:
- .*
deny: []
# allow or deny schemas for ingestion
schema_pattern:
allow:
- .*
deny:
- "information_schema"
# allow or deny views for ingestion - 'schema_name.view_name'
view_pattern:
allow:
- .*
deny: []
# PostgreSQL DataHub profiler settings
# See README.md for details
profile_pattern:
allow:
- .*
deny: []
profiling:
enabled: true # default false
profile_table_level_only: False # default false
include_field_sample_values: False # default is True.
transformers:
- type: "simple_add_dataset_ownership"
config:
owner_urns:
- "urn:li:corpGroup:d94f1f51-xxxx-4cbc-xxxx-3197b0d9862d" # Team accounts
- "urn:li:corpGroup:ccbf944a-xxxx-4b39-xxxx-65d19ae967d6" # Data Dictionary
- type: "simple_add_dataset_domain"
config:
domains:
- "urn:li:domain:xxxxxxx-51bc-4f87-bc2f-b44dfb8b977d" # Domain
sink:
type: "datahub-kafka"
config:
connection:
bootstrap: "xxxx:9999"
producer_config:
security.protocol: "ssl"
ssl.ca.location: "/secrets/vault_ca_chain.pem"
ssl.certificate.location: "/secrets/vault_cert.pem"
ssl.key.location: "/secrets/vault_key.pem"
schema_registry_url: "<https://schema-registryxxx>"
schema_registry_config:
ssl.ca.location: "/secrets/vault_ca_chain.pem"
ssl.certificate.location: "/secrets/vault_cert.pem"
ssl.key.location: "/secrets/vault_key.pem"
# for `- type: "simple_add_dataset_domain"` to work
datahub_api:
server: "<https://datahub-gms.xxxx:443>"
Could someone please advise if anything is amiss?
TIA! 🙏handsome-football-66174
11/10/2022, 8:23 PMbetter-spoon-77762
11/11/2022, 3:05 AMmetadata-io/src/main/java/com/linkedin/metadata/service/
e.g DomainService, TagService, OwnerService etc
these are only called from the unit tests as of now.
Can someone share whats the long term plan of using theseaverage-dinner-25106
11/11/2022, 4:20 AMlively-jackal-83760
11/11/2022, 8:34 AMmysterious-advantage-78411
11/11/2022, 1:54 PMbetter-spoon-77762
11/12/2022, 6:02 AMsource:
type: "dbt"
config:
# Coordinates
manifest_path: "/Users/asif/dbt_data/manifest.json"
catalog_path: "/Users/asif/dbt_data/catalog.json"
sources_path: "/Users/asif/dbt_data/sources.json"
# Options
target_platform: "snowflake" # e.g. bigquery/postgres/etc.
platform_instance: "snowflake-1" # The instance of the platform that all assets produced by this recipe belong to
sink:
type: datahub-rest # default datahub-rest
config:
server: "<https://localhost:9002/api/gms>"
extra_headers:
token: xxxxx
transformers:
- type: "simple_add_dataset_properties"
config:
semantics: OVERWRITE
properties:
prop1: value1
prop2: value2
But I keep getting this error
File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 142, in run_pipeline_async
return await loop.run_in_executor(
File "/usr/local/Cellar/python@3.9/3.9.15/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 133, in run_pipeline_to_completion
raise e
File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 125, in run_pipeline_to_completion
pipeline.run()
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 376, in run
for record_envelope in self.transform(
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/transformer/base_transformer.py", line 217, in transform
transformed_aspect = self.transform_aspect(
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/transformer/add_dataset_properties.py", line 95, in transform_aspect
assert in_dataset_properties_aspect
AssertionError
Can someone help what could be causing this?bulky-salesclerk-62223
11/12/2022, 9:10 PMv0.9.2
of the CLI and of DataHub).
• When created, I can see the Glossary Term on the DBT Dataset in DataHub (screenshot) which has been automatically assigned using meta_mapping
.
• If I click on the glossary term, it says it exists and I can see all the related entities (screenshot)
• However, if I go to the glossary, the glossary term isn't displayed (screenshot).
• If while on the phantom glossary term's menu where I can see the entities etc, if I click on the three dots and try to move it into a term group, it says "Unkown Error Occured" (screenshot).
I've noticed that I can actually type absolutely anything into the URL urn (/glossaryTerm/urn:li:glossaryTerm:<ANYTHING>/Related%20Entities?is_lineage_mode=false
). I can type any string into where I've put <ANYTHING>
and it'll give me a glossary view of that term. However when they're created in the UI, they are given a long uuid which you can see in the URL.
• Terms created in the UI persist in the Glossary menu, and can be moved into groups
• Terms created via the datahub ingestion CLI (the API) can't do either of the above
• Creating them in the UI first, then syncing the terms up via the ingestion CLI doesn't link the term you created in the UI to the term you've assigned to your datasets via meta_mapping
, because they seem to have different urn:li<BLABLA>
values. The UI is a uuid, and the dbt ingestion cli one is a friendly name.
Any ideas?
Edit: I believe this issue is related but not a duplicate of this: https://datahubspace.slack.com/archives/C029A3M079U/p1666343681646089
(cc @bulky-soccer-26729 @gifted-bird-57147)great-computer-16446
11/14/2022, 8:04 AMripe-belgium-29225
11/14/2022, 9:31 AMbreezy-portugal-43538
11/14/2022, 9:55 AMgreen-hamburger-3800
11/14/2022, 1:48 PMgreen-hamburger-3800
11/14/2022, 3:19 PM15:18:28.491 [qtp1830908236-10] WARN c.d.a.a.AuthenticatorChain:70 - Authentication chain failed to resolve a valid authentication. Errors: [(com.datahub.authentication.authenticator.DataHubSystemAuthenticator,Failed to authenticate inbound request: Authorization header is missing Authorization header.), (com.datahub.authentication.authenticator.DataHubTokenAuthenticator,Failed to authenticate inbound request: Request is missing 'Authorization' header.)]
and I'm not sure where is it coming from since no ingestion/usage exists at the moment (I'd guess an internal thing and I might be missing some configuration?)
Thanks (=bright-motherboard-35257
11/14/2022, 3:28 PM'[2022-11-14 15:21:02,184] ERROR {logger:26} - Please set env variable SPARK_VERSION\n'
'JAVA_HOME is not set\n'
I have JAVA_HOME set...
$ echo $JAVA_HOME
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.322.b06-11.el8.x86_64/
MY SPARK_HOME set...
$ echo $SPARK_HOME
/opt/spark
My pyspark version == 3.0.3
$ pyspark --version
22/11/14 09:23:51 WARN Utils: Your hostname, sa1x-eam-p1 resolves to a loopback address: 127.0.0.1; using 172.30.230.254 instead (on interface ens3)
22/11/14 09:23:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.3
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_322
Branch HEAD
Compiled by user ubuntu on 2021-06-17T04:08:22Z
Revision 65ac1e75dc468f53fc778cd2ce1ba3f21067aab8
Url <https://github.com/apache/spark>
Type --help for more information.
My SPARK_VERSION set...
$ echo $SPARK_VERSION
3.0.3
ancient-apartment-23316
11/14/2022, 6:58 PMacceptable-terabyte-34789
11/15/2022, 7:57 AM{
"_1": "\u0003\u0000*�?Y�&CZ�=9�+�%��݊��a\"entityChangeEvent�\u0005{\"auditStamp\":{\"actor\":\"urn:li:corpuser:datahub\",\"time\":1668075837259},\"entityUrn\":\"urn:li:domain:xxxxxxx\",\"entityType\":\"domain\",\"modifier\":\"urn:li:corpuser:datahub\",\"category\":\"OWNER\",\"operation\":\"ADD\",\"version\":0,\"parameters\":{\"ownerType\":\"TECHNICAL_OWNER\",\"ownerUrn\":\"urn:li:corpuser:datahub\"}} application/json"
}
These are the properties we used with StringConverter as key.converter and value.converter.
connector.class=io.confluent.connect.s3.S3SinkConnector
behavior.on.null.values=ignore
s3.region=eu-west-1
flush.size=1
schema.compatibility=NONE
tasks.max=2
topics=PlatformEvent_v1
key.converter.schemas.enable=false
format.class=io.confluent.connect.s3.format.json.JsonFormat
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
value.converter.schemas.enable=false
value.converter=org.apache.kafka.connect.json.JsonConverter
storage.class=io.confluent.connect.s3.storage.S3Storage
s3.bucket.name=xxxxxxxxx
key.converter=org.apache.kafka.connect.json.JsonConverter
Then we tried changing to JsonConverter but it throws the following error:
[Worker-073fad87bf643ddc8] Caused by: org.apache.kafka.common.errors.SerializationException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'entityChangeEvent': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
How can we consume the records properly?
Thank you!acceptable-terabyte-34789
11/15/2022, 9:49 AMgifted-rocket-7960
11/15/2022, 2:46 PMProvided urn urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar4,DEV),c1)\" is invalid: Failed to find entity with name datasetField in EntityRegistry
curl 'http://localhost:8080/entities/urn:li:dataset:(urn:li:dataPlatform:postgres,bar4,DEV)'
{"value":{"com.linkedin.metadata.snapshot.DatasetSnapshot":{"*urn":"urnlidataset:(urnlidataPlatform:postgres,bar4,DEV*)","aspects":[{"com.linkedin.metadata.key.DatasetKey":{"origin":"DEV","name":"bar4","platform":"urnlidataPlatform:postgres"}},{"com.linkedin.common.BrowsePaths":{"paths":["/dev/postgres"]}},{"com.linkedin.schema.SchemaMetadata":{"fields":[{"*fieldPath":"c1",*"description":"test fine grained lineage","type":{"type":{"com.linkedin.schema.StringType":{}}},"nativeDataType":"VARCHAR(50)"}],"schemaName":"customer","version":0,"platformSchema":{"com.linkedin.schema.MySqlDDL":{"tableSchema":"col1"}},"platform":"urnlidataPlatform:postgres","hash":"hash"}},{"com.linkedin.dataset.DatasetProperties":{"description":"bar2 DataSet"}},{"com.linkedin.common.DataPlatformInstance":{"platform":"urnlidataPlatform:postgres"}}]}}}*%*cuddly-dream-15899
11/15/2022, 4:48 PMworried-branch-76677
11/15/2022, 4:49 PM[{'error': 'Unable to emit metadata to DataHub GMS',
'info': {'message': "HTTPSConnectionPool(host='<http://datahub-gms.net|datahub-gms.net>', port=443): Max retries exceeded "
"with url: /aspects?action=ingestProposal (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of "
"protocol (_ssl.c:2384)')))"}}],
witty-television-74309
11/15/2022, 4:51 PMgentle-tailor-78929
11/15/2022, 5:17 PM./gradlew build
, I get the following error:
/datahub/metadata-models/build.gradle': 1: unable to resolve class io.datahubproject.GenerateJsonSchemaTask
@ line 1, column 1.
import io.datahubproject.GenerateJsonSchemaTask
Any ideas on what the issue may be? Thanks!gentle-tailor-78929
11/15/2022, 5:21 PMdatahub docker quickstart --build-locally
with podman-compose
instead of docker-compose
?miniature-plastic-43224
11/15/2022, 5:45 PMlittle-breakfast-38102
11/15/2022, 10:16 PMbest-napkin-60434
11/16/2022, 1:21 AMbest-napkin-60434
11/16/2022, 1:21 AM