glamorous-library-1322
08/05/2022, 2:38 PMprofile_table_level_only: true
but when it gets to columns it gets stuck on null count (throws an error for datahub/ingestion/source/ge_data_profiler.py
on get_column_nonnull_count
). Side note: unfortunately the query that great expectations is trying to run against druid to count all the nulls is not allowed 😞 There is an option to disable null count in druid data source include_field_null_count: false
but this does not stop the error (or make any difference). Anybody has an experience with profiling on druid data sources? I'm currently running 0.8.36 and run the ingestion via the client and my ingestion yaml is very simple (below).brave-tomato-16287
08/05/2022, 3:14 PMroot / Operations / [Operations] Common reports / workbooks*
Operations
is included in the section projects in yaml and it is ingested.
But items in the subfolder, for example [Operations] Common reports
do not ingest.bulky-keyboard-25193
08/05/2022, 3:40 PMpostgres
and saw that composite types
do not seem to be supported. Anything I’m missing before I look at the code?bulky-keyboard-25193
08/05/2022, 4:13 PMsqlalchemy
. Looking there I see that it views composite types as a collection of columns, like (c1,c2,c3…) and it expects you to access via orm
https://docs.sqlalchemy.org/en/14/orm/composites.html . So I guess I need to write my own ingest code to get my composite types into Datahub?gifted-knife-16120
08/06/2022, 9:46 AMcold-autumn-7250
08/07/2022, 9:05 AMvictorious-tomato-25942
08/07/2022, 11:59 AMsource:
type: postgres
config:
host_port: ddddd
database: ddddd
username: dddd
password:xxxx
include_tables: True
include_views: True
table_pattern:
deny: '*.gateway_raw_*'
profiling:
enabled: True
turn_off_expensive_profiling_metrics: True
sink:
type: "datahub-rest"
config:
server: xxxx
token: xxxx
aloof-oil-31167
08/07/2022, 2:25 PMTypeError: argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'
??🙏lemon-answer-80661
08/07/2022, 3:15 PMcrooked-rose-22807
08/08/2022, 8:16 AMignore_old_state
and ignore_new_state
for dbt stateful_ingestion
. I don’t quite catch how I can check or monitor the checkpoint to see these flags working on my data. Can someone help to clarify where I can check or any useful articles to read? TQVMmysterious-nail-70388
08/08/2022, 8:20 AMaloof-oil-31167
08/08/2022, 12:22 PM'failures': [{'error': 'Unable to emit metadata to DataHub GMS',
'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: Failed to validate record with class '
'com.linkedin.common.Ownership: ERROR :: /owners/0/owner :: "Provided urn Allegro" is invalid\n'
'\n'
'\tat com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$3(AspectResource.java:142)\n'
'\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)\n'
'\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)\n'
this is my recipe -
source:
type: delta-lake
config:
env: $ENV
platform_instance: "riskified-delta-lake"
base_path: $DELTA_TABLE_PATH # test one table, and then make this recipe work for entire bucket
s3:
aws_config:
aws_role: $AWS_ROLE_NAME
aws_region: "us-east-1"
env: $ENV
aws_access_key_id: ""
aws_secret_access_key: ""
transformers:
- type: "simple_add_dataset_ownership"
config:
owner_urns:
- $OWNER
sink:
type: "datahub-rest"
config:
server: "<https://riskified.acryl.io/gms>"
token: $DATAHUB_TOKEN
any ideas?alert-football-80212
08/08/2022, 12:41 PMlittle-twilight-71687
08/08/2022, 3:30 PMSchemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via theI have many JSON files which are cannot be ingested because of:recipe parameter (see below) JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance. We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.max_rows
could not infer schema for file s3://path/to/file.json: ' 'Trailing data']It looks datahub uses ujson for ingesting. How to workaround this problem and/or when this will be fixed ?
victorious-pager-14424
08/08/2022, 3:43 PMplatform
string parameter. If I pass the new data platform name or URN to this parameter will it assign all ingested data to the new platform?bright-receptionist-94235
08/08/2022, 8:07 PMcuddly-apple-7818
08/08/2022, 9:42 PMlemon-zoo-63387
08/09/2022, 2:36 AMfamous-florist-7218
08/09/2022, 6:42 AM'[2022-08-09 06:34:00,334] ERROR {datahub.ingestion.run.pipeline:126} - No JVM shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable properly.\n'
'[2022-08-09 06:36:45,327] ERROR {logger:26} - Please set env variable SPARK_VERSION
few-grass-66826
08/09/2022, 11:19 AMalert-football-80212
08/09/2022, 2:52 PMsource:
type: "kafka"
config:
# Coordinates
env: PROD
connection:
bootstrap: some_url
consumer_config:
security.protocol: "SASL_SSL"
sasl.mechanism: "PLAIN"
sasl.username: user_name
sasl.password: some_password
schema_registry_url: some_scheme_url
topic_patterns:
allow:
- some_topic_name
topic_subject_map:
some_topic_name-value: some_schema_name
transformers:
- type: "simple_add_dataset_ownership"
config:
owner_urns:
- some_owner_name
shy-parrot-64120
08/09/2022, 6:33 PMcurved-magazine-23582
08/10/2022, 1:52 AMsteep-soccer-91284
08/10/2022, 6:33 AMkind-whale-32412
08/10/2022, 7:15 AMMetadataChangeProposalWrapper
if I am building a custom ingestion. I couldn't find a way to do that with the Java library. I also couldn't see any reference (ie it exists for GraphQL API https://datahubproject.io/docs/graphql/mutations/ but couldn't find anything for MCPW)
An example GraphQL API query that I'm trying to do with MCPW is like this:
{
"operationName": "addTags",
"variables": {
"input": {
"tagUrns": [
"urn:li:tag:someTag"
],
"resourceUrn": "urn:li:dataset:(urn:li:dataPlatform:plato,something.here,PROD)",
"subResource": "_file_name",
"subResourceType": "DATASET_FIELD"
}
},
"query": "mutation addTags($input: AddTagsInput\u0021) {\\n addTags(input: $input)\\n}\\n"
}
alert-football-80212
08/10/2022, 9:30 AMbusy-umbrella-4099
08/10/2022, 9:35 AMlimited-forest-73733
08/10/2022, 10:29 AMmicroscopic-mechanic-13766
08/10/2022, 11:07 AMTTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found
I am currently using datahub-gms version v0.8.42 (the release 4f35a6c where the file:///etc/datahub/plugins/auth/resources
is fixed), 0.8.42 for CLI and acryldata/datahub-actions:v0.0.4
.
My recipe is the following:
source:
type: hive
config:
database: null
host_port: 'hive-server:10000'
options:
connect_args:
auth: KERBEROS
kerberos_service_name: hive-server
sink:
type: datahub-rest
config:
server: '<http://datahub-gms:8080>'
I have seen this same error in messages from almost a year ago that the problem was that some library is missing. Although I think it might be solved, I added said libraries but I still get the same error or a very similar one.
I have also seen that it might be a problem that the authentication protocol might not be the same, but in my case, Hive uses Kerberos:
<property>
<name>hive.server2.authentication</name>
<value>kerberos</value>
</property>
elegant-salesmen-99143
08/10/2022, 1:40 PM