brief-insurance-68141
09/23/2021, 9:55 PMadamant-pharmacist-61996
09/24/2021, 2:38 AMbrave-market-65632
09/24/2021, 5:03 AMnode 1
> term 1
> term 2
node 2
> term a
> term b
and if one were to introduce a new node and terms collection at an arbitrary location on the tree
the file should be defined like this. Is this a fair assumption?
node 1
> term 1
> term 2
node 2
node 3
> term c
> term d
This did work for me. Wondering if I'm missing something here. This meant that I had to repeat the
name and description configs for the nodes. One could write an abstraction to generate the yaml file.
Would it make sense to simply have a parent_urn or something like that in the nodes config to make it
canonical to attach a node and term collection at any point in the tree?
Something like
node 3
parent_node: node 1.node 2 or simply node 2
> term c
> term d
Thanks!mysterious-monkey-71931
09/24/2021, 5:41 AMdbhistory
topics for datahub?witty-keyboard-20400
09/24/2021, 8:37 AMadventurous-scooter-52064
09/25/2021, 6:53 AMconnection.schema_registry_url
? 😢
https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub#config-details-1nice-planet-17111
09/27/2021, 1:31 AMdatahub-rest
. The app (datahub) and bigquery are on same privatea project. When i try sink through console or sink through file, it succeeds without error . However, sink through datahub-rest fails with ConnectionError
☹️ Is there something i'm missing?
Here's my recipe...
source:
type: bigquery
config:
project_id: <my_project_id>
sink:
type: "datahub-rest"
config:
server: "<http://localhost:8080>"
error message :
ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10ba85d00>: Failed to establish a new connection: [Errno 61] Connection refused'
nice-planet-17111
09/27/2021, 8:32 AMbumpy-activity-74405
09/27/2021, 11:11 AMremoved=true
. Curious to know what if anyone had success with this or any other approach.stocky-noon-61140
09/27/2021, 12:56 PMbland-orange-13353
09/27/2021, 7:46 PMastonishing-lunch-91223
09/27/2021, 8:44 PMlinkedin/datahub-ingestion
container (ingest -c /workspace/data_recipe.yml
):
source:
type: "file"
config:
filename: "/workspace/bootstrap_mce.json"
sink:
type: "datahub-rest"
config:
server: '<http://localhost:8080>'
and I’m using this `bootstrap_mce.json`: https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/mce_files/bootstrap_mce.json with DataHub version v0.8.14. Any ideas why I’m getting the errors from the attached log? Basically I’m hitting that No root resource defined for path '/corpUsers'
issue again.adventurous-scooter-52064
09/28/2021, 6:24 AMnumerous-cricket-19689
09/28/2021, 6:44 PMrough-eye-60206
09/28/2021, 8:52 PMFile "/Users/vn0d5ac/Library/Python/3.7/lib/python/site-packages/datahub/emitter/rest_emitter.py", line 94, in test_connection
f"This version of {__package_name__} requires GMS v0.8.0 or higher"
ValueError: This version of acryl-datahub requires GMS v0.8.0 or higher
brief-insurance-68141
09/28/2021, 11:04 PMsparse-energy-27188
09/29/2021, 2:02 AMbreezy-guitar-97226
09/30/2021, 10:48 AMremove_existing_browse_paths: true
)
Thanks!red-smartphone-15526
09/30/2021, 11:34 AMadorable-portugal-3397
09/30/2021, 1:48 PMchilly-nail-87894
09/30/2021, 6:19 PMpolite-flower-25924
10/02/2021, 8:51 AMredshift-usage
statistics is added with this PR at v0.8.15. This connector requires Redshift Super User privileges to run this query with svv_table_info
, svl_user_info
tables and also explore other user queries.
What’s the approach you follow to pass a Redshift super user to this connector? I’m not sure that data platform team allows us to use super user credentials in a connector. I guess, @witty-state-99511 can give better suggestions here 🙂tall-controller-60779
10/04/2021, 11:29 AMwitty-keyboard-20400
10/04/2021, 1:51 PMdatahub docker nuke
. But this removes all the containers and subsequent datahub docker quickstart
results into pulling all the containers again over the network.witty-keyboard-20400
10/04/2021, 2:12 PMdatahub ingest -c sample.yml
which is pointing to the latest checked out bootstrap_mce.json, I see only 4 records are ingested.
[user@localhost datahub]$ datahub ingest -c ./metadata-ingestion/examples/mce_files/sample.yml
[2021-10-04 19:35:09,834] INFO {datahub.cli.ingest_cli:57} - Starting metadata ingestion
[2021-10-04 19:35:09,858] INFO {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:0
[2021-10-04 19:35:09,886] INFO {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:1
[2021-10-04 19:35:09,907] INFO {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:2
[2021-10-04 19:35:09,964] INFO {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:3
[2021-10-04 19:35:09,964] INFO {datahub.cli.ingest_cli:59} - Finished metadata ingestion
Source (file) report:
{'failures': {},
'warnings': {},
'workunit_ids': ['file://./sample.json:0', 'file://./sample.json:1', 'file://./sample.json:2', 'file://./sample.json:3'],
'workunits_produced': 4}
Sink (datahub-rest) report:
{'failures': [], 'records_written': 4, 'warnings': []}
Pipeline finished successfully
My sample.yml is :
source:
type: "file"
config:
filename: "./sample.json"
# see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub> for complete documentation
sink:
type: "datahub-rest"
config:
server: "<http://localhost:8080>"
However, when I execute the datahub docker ingest-sample-data I see there are 82 records ingested:
...
'file:///var/folders/89/c3061_k547b_g6tgh77dbxsm0000gp/T/tmpnczwmxw4.json:79',
'file:///var/folders/89/c3061_k547b_g6tgh77dbxsm0000gp/T/tmpnczwmxw4.json:80',
'file:///var/folders/89/c3061_k547b_g6tgh77dbxsm0000gp/T/tmpnczwmxw4.json:81'],
'workunits_produced': 82}
Sink (datahub-rest) report:
{'failures': [], 'records_written': 82, 'warnings': []}
What is the difference between manually ingesting against the bootstrap_mce.json vs the ingest-sample-data
command?
@mammoth-bear-12532 @big-carpet-38439witty-keyboard-20400
10/04/2021, 3:58 PM"fields": [
{
"fieldPath": "[version=2.0].[type=boolean].field_foo_2",
"jsonPath": null,
"nullable": false,
"description": {
"string": "Foo field description"
},
"type": {
"type": {
"com.linkedin.pegasus2avro.schema.BooleanType": {}
}
},
"nativeDataType": "varchar(100)",
"globalTags": {
"tags": [{ "tag": "urn:li:tag:NeedsDocumentation" }]
},
"recursive": false
},
....
]
I checked the type for fieldPath, it's declared just as:
fieldPath: SchemaFieldPath
..and SchemaFieldPath is defined as:
typeref SchemaFieldPath = string
Question: Is there any significance of mentioning version
and type: boolean
in the SchemaFieldPath: "[version=2.0].[type=boolean].field_foo_2"
?nice-planet-17111
10/05/2021, 6:01 AMbigquery-usage
?
• options.credentials_path
, or extra_client_options.credentials_path
does not work ( fails to run the file -> got an unexpected keyword arguement
or extra fields not permitted
)
• I tried export GOOGLE_APPLICATION_CREDENTIALS
-> file runs but it stops with the error : the caller does not have permission
• bigquery ingestion under same environment & configs works without errors.witty-keyboard-20400
10/05/2021, 8:03 AM"nativeDataType": "INTEGER(unsigned=True)"
while in the glue_mces_golden.json, I see
"nativeDataType": "int",
Does nativeDataType attribute refer to the data type natively supported by the source systems?
OR, are both the formats supported by DataHub ?witty-keyboard-20400
10/05/2021, 3:32 PM{
"com.linkedin.pegasus2avro.dataset.UpstreamLineage": {
"upstreams": [
{
"auditStamp": {
"time": 1581407189000,
"actor": "urn:li:corpuser:jdoe",
"impersonator": null
},
"dataset": "urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)",
"type": "TRANSFORMED"
}
]
}
}
Shouldn't there be a feature to track lineage at field level?
@green-football-43791 @big-carpet-38439rough-eye-60206
10/05/2021, 10:50 PM