hallowed-kilobyte-916
05/22/2023, 10:08 AMimport logging
import time
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough)
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
# Imports for metadata model classes
from datahub.metadata.schema_classes import (
AuditStampClass,
EditableSchemaFieldInfoClass,
EditableSchemaMetadataClass,
InstitutionalMemoryClass,
)
log = logging.getLogger(__name__)
logging.basicConfig(level=<http://logging.INFO|logging.INFO>)
def get_simple_field_path_from_v2_field_path(field_path: str) -> str:
"""A helper function to extract simple . path notation from the v2 field path"""
if not field_path.startswith("[version=2.0]"):
# not a v2, we assume this is a simple path
return field_path
# this is a v2 field path
tokens = [t for t in field_path.split(".") if not (t.startswith("[") or t.endswith("]"))]
return ".".join(tokens)
dictionary = {
"a": "b",
"c": "d",
}
# Inputs -> owner, ownership_type, dataset
documentation_to_add = "The unique application (service) correlation id on service now"
dataset_name = "a/b/20230511.csv"
dataset_urn = make_dataset_urn(platform="s3", name=dataset_name, env="PROD")
def add_dict(graph, dataset_urn, column, documentation_to_add):
need_write = False
field_info_to_set = EditableSchemaFieldInfoClass(fieldPath=column, description=documentation_to_add)
# Some helpful variables to fill out objects later
now = int(time.time() * 1000) # milliseconds since epoch
current_timestamp = AuditStampClass(time=now, actor="urn:li:corpuser:ingestion")
current_editable_schema_metadata = graph.get_aspect(
entity_urn=dataset_urn, aspect_type=EditableSchemaMetadataClass,
)
# need_write = False
if current_editable_schema_metadata:
for fieldInfo in current_editable_schema_metadata.editableSchemaFieldInfo:
if get_simple_field_path_from_v2_field_path(fieldInfo.fieldPath) == column:
# we have some editable schema metadata for this field
field_match = True
if documentation_to_add != fieldInfo.description:
fieldInfo.description = documentation_to_add
need_write = True
else:
# create a brand new editable dataset properties aspect
current_editable_schema_metadata = EditableSchemaMetadataClass(
editableSchemaFieldInfo=[field_info_to_set], created=current_timestamp,
)
need_write = True
if need_write:
event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
entityUrn=dataset_urn, aspect=current_editable_schema_metadata,
)
graph.emit(event)
<http://log.info|log.info>(f"Documentation added to dataset {dataset_urn}")
else:
<http://log.info|log.info>("Documentation already exists and is identical, omitting write")
# First we get the current owners
gms_endpoint = "<http://localhost:8080>"
graph = DataHubGraph(config=DatahubClientConfig(server=gms_endpoint))
for column, documentation_to_add in dictionary.items():
print(f"column: {column} and documentation_to_add: {documentation_to_add}")
add_dict(graph, dataset_urn, column, documentation_to_add)
However, the code just tells me tells me the dictionary already exists and meanwhile, it doesn't exist in datahub.
column: a and documentation_to_add: b
INFO:__main__:Documentation already exists and is identical, omitting write
column: c and documentation_to_add: d.
INFO:__main__:Documentation already exists and is identical, omitting write
Am I missing anything? I am following instructions from here:
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_add_column_documentation.pylemon-scooter-69730
05/22/2023, 3:27 PMlemon-scooter-69730
05/22/2023, 5:05 PMdbt
for example it's a_specific_dbt_instance
hundreds-airline-29192
05/23/2023, 4:43 AMenough-elephant-56787
05/23/2023, 6:44 AMhundreds-airline-29192
05/23/2023, 8:12 AMhundreds-airline-29192
05/23/2023, 9:21 AMadamant-sunset-13770
05/23/2023, 2:48 PMcapture_table_label_as_tag: true
and observed that it works for BigQuery Tables.
• Could you please confirm if this is the intended functionality?
Additionally, we would greatly appreciate any available workarounds.
Thank you and best regards,
Stianhundreds-airline-29192
05/24/2023, 8:07 AMhttps://files.slack.com/files-pri/TUMKD5EGJ-F059XGR8W4Q/image.png▾
hundreds-airline-29192
05/24/2023, 8:08 AMhundreds-airline-29192
05/24/2023, 8:08 AMagreeable-table-54007
05/24/2023, 8:47 AMdatahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (unity-catalog): type object 'Retry' has no attribute 'DEFAULT_METHOD_WHITELIST'
I tried a lot of stuff, to install some libraries versions etc..but not working.
If you have a solution to use only the UI connector and not the CLI i'll gladly take it. Thanks.
Hope you have a great day guys.echoing-evening-57052
05/24/2023, 12:01 PMboundless-nail-65912
05/24/2023, 1:24 PMadamant-sunset-13770
05/24/2023, 1:55 PM{project_id}
.INFORMATION_SCHEMA.SCHEMATA` which automatically sets region-us
for the query (documentation). We have all our data in region-eu
, but there does not seem to be any way of specifying region.
1. Can you confirm that region-us
is being set ?
2. Do you know of any workarounds?
All the best,
Stianbored-truck-17085
05/25/2023, 12:14 AMstats
tab shows "2023-05-20". Has anyone faced this issue before?
source:
type: "dbt"
config:
# Coordinates
manifest_path: "/home/datahub/datahub/manifest.json"
catalog_path: "/home/datahub/datahub/catalog.json"
sources_path: "/home/datahub/datahub/sources.json"
test_results_path: "/home/datahub/datahub/run_results.json"
# Options
stateful_ingestion:
enabled: false
entities_enabled:
models: 'YES'
sources: 'YES'
seeds: 'NO'
test_definitions: 'YES'
test_results: 'YES'
target_platform: bigquery
pipeline_name: "source_tests"
sink:
type: "datahub-rest"
config:
server: <http://localhost:8080>
token: ${DATAHUB_CLI_ACCESS_TOKEN}
silly-ambulance-51171
05/25/2023, 7:33 AMhundreds-airline-29192
05/25/2023, 8:03 AMCan I display my spark insert into big query flow on datahub ?
echoing-branch-87829
05/25/2023, 8:59 AM[2023-05-25 07:55:30,344] ERROR {datahub.entrypoints:195} - Command failed: [Errno 13] Permission denied: '/root/xxxx/target/manifest.json'
Have tried giving permission to the folder but an struggling to get past this permission issue if anyone could help pleaseechoing-branch-87829
05/25/2023, 8:59 AMechoing-branch-87829
05/25/2023, 9:00 AMsource:
type: dbt
config:
manifest_path: /root/virtualone/virtualone/target/manifest.json
catalog_path: /root/virtualone/virtualone/target/catalog.json
sources_path: /root/virtualone/virtualone/target/sources.json
test_results_path: /root/virtualone/virtualone/target/run_results.json
target_platform: snowflakeechoing-branch-87829
05/25/2023, 9:00 AMripe-stone-30144
05/25/2023, 9:56 AMquiet-exabyte-77821
05/25/2023, 10:31 AMancient-queen-15575
05/25/2023, 10:50 AMcurl --location --request POST '<https://dev.mydomain.org/api/graphql>' \
--header 'Authorization: Bearer <datahub token>' \
--header 'Content-Type: application/json' \
--data-raw '{ "query": "{ dataset(urn: \"urn:li:dataset:(urn:li:dataPlatform:s3,land-dev.insurance/calculations,DEV)\") { domain { associatedUrn domain { urn properties { name } } } } }", "variables":{}}'
But I can connect when trying to run a recipe. My enviorment variable for the token is called DATAHUB_GMS_TOKEN
and has the same value as is used in the curl request. For DATAHUB_GMS_URL
the value is <https://dev.mydomain.org:8080>
.
I’m not understanding why a query to graphql would work but connecting to the GMS port wouldn’thelpful-guitar-93961
05/25/2023, 11:51 AMflat-engineer-75197
05/25/2023, 12:22 PMhelpful-guitar-93961
05/25/2023, 1:11 PMadventurous-pillow-74569
05/25/2023, 1:47 PMadamant-sunset-13770
05/25/2023, 3:37 PM