wooden-chef-22394
07/11/2022, 9:47 AMbetter-bird-87143
07/11/2022, 1:12 PMrich-policeman-92383
07/11/2022, 1:25 PMplain-guitar-45103
07/11/2022, 5:32 PM'PyDeltaTableError: Failed to load checkpoint: Failed to read checkpoint content: Failed to read S3 object content: Request ID: None '
'Body: <?xml version="1.0" encoding="UTF-8"?>\n'
'<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. '
'Please send all future requests to this '
'endpoint.</Message><Endpoint><http://databricks-lake-dev-us-west-2.s3-us-west-2.amazonaws.com|databricks-lake-dev-us-west-2.s3-us-west-2.amazonaws.com></Endpoint><Bucket>databricks-lake-dev-us-west-2</Bucket><RequestId>N79MDMFJ56EK9V74</RequestId><HostId>h0jvugKsFOWOKy/EPr8NELkO85lO7YQYBKR0H33LqZ7U3HkjFB2iUOM2Ne/3reGDzbzKxfEYPMg=</HostId></Error>\n'
I noticed people are experiencing similar issues with the delta-rs module when they don't pass in the correct s3 region to the DeltaTable class. I went ahead experimented with this behavior in my local Jupyter Notebook. I noticed when I pass in the correct region, my notebook will instantiate a DeltaTable object properly. When I pass the wrong region, I get the exact same error as the one I get when ingesting with Datahub. This leads me to believe that Datahub code is not handling aws region correctly. I then dug into the Datahub source code a bit and realized that the read_delta_table
method is missing the region parameter. I believe that is the reason I am getting this failure. Can someone please confirm my suspicion? I am happy to open an issue on github or pair up any time today to investigate further! Thanks in advance!plain-guitar-45103
07/11/2022, 5:33 PMsource:
type: delta-lake
config:
base_path: 'my_base_path'
relative_path: 'my_relative_path'
s3:
aws_config:
aws_access_key_id: xxxxx
aws_secret_access_key: xxxxxx
aws_region: us-west-2
sink:
type: datahub-rest
config:
server: '<http://172.17.0.1:8080>'
mysterious-nail-70388
07/12/2022, 2:59 AMsteep-vr-39297
07/12/2022, 3:25 AMjdbc:<hive2://hive_host:10001/;transportMode=http;httpPath=cliservice>
It's a recipe file
source:
type: hive
config:
host_port: hive_host:10001
database: db_name
username: id
password: pw
options:
connect_args:
http_path: "/cliservice"
auth: LDAP
sink:
type: datahub-rest
config:
server: "<http://localhost:8080>"
error message is
[2022-07-12 12:09:54,931] ERROR {datahub.entrypoints:184} - File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/entrypoints.py", line 149, in main
....
'---- (full traceback above) ----
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/entrypoints.py", line 149, in main
sys.exit(datahub(standalone_mode=False, **kwargs))
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/upgrade/upgrade.py", line 333, in wrapper
res = func(*args, **kwargs)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/telemetry/telemetry.py", line 338, in wrapper
raise e
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/telemetry/telemetry.py", line 290, in wrapper
res = func(*args, **kwargs)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/utilities/memory_leak_detector.py", line 102, in wrapper
res = func(*args, **kwargs)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/cli/ingest_cli.py", line 131, in run
raise e
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/cli/ingest_cli.py", line 117, in run
pipeline.run()
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/ingestion/run/pipeline.py", line 217, in run
self.preview_workunits if self.preview_mode else None,
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/ingestion/source/sql/sql_common.py", line 712, in get_workunits
for inspector in self.get_inspectors():
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/ingestion/source/sql/sql_common.py", line 516, in get_inspectors
with engine.connect() as conn:
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2263, in connect
return self._connection_cls(self, **kwargs)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 104, in __init__
else engine.raw_connection()
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2370, in raw_connection
self.pool.unique_connection, _connection
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
return fn()
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 304, in unique_connection
return _ConnectionFairy._checkout(self)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
rec = pool._do_get()
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py", line 140, in _do_get
self._dec_overflow()
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
with_traceback=exc_tb,
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
raise exception
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py", line 137, in _do_get
return self._create_connection()
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
return _ConnectionRecord(self)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
self.__connect(first_connect_check=True)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
pool.logger.debug("Error on connect(): %s", e)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
with_traceback=exc_tb,
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
raise exception
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
connection = pool._invoke_creator(self)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
return dialect.connect(*cargs, **cparams)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 508, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/pyhive/hive.py", line 126, in connect
return Connection(*args, **kwargs)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/pyhive/hive.py", line 267, in __init__
self._transport.open()
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/thrift_sasl/__init__.py", line 93, in open
status, payload = self._recv_sasl_message()
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/thrift_sasl/__init__.py", line 115, in _recv_sasl_message
payload = self._trans_read_all(length)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/thrift_sasl/__init__.py", line 210, in _trans_read_all
return read_all(sz)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/thrift/transport/TTransport.py", line 62, in readAll
chunk = self.read(sz - have)
File "/users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/thrift/transport/TSocket.py", line 167, in read
message='TSocket read 0 bytes')
TTransportException: TSocket read 0 bytes
[2022-07-12 12:09:54,942] INFO {datahub.entrypoints:188} - DataHub CLI version: 0.8.40.2 at /users/user/workspace/datahub/datahub-env/lib64/python3.6/site-packages/datahub/__init__.py
[2022-07-12 12:09:54,942] INFO {datahub.entrypoints:191} - Python version: 3.6.8 (default, Nov 16 2020, 16:55:22)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] at /users/user/workspace/datahub/datahub-env/bin/python3 on Linux-3.10.0-693.2.2.el7.x86_64-x86_64-with-centos-7.9.2009-Core
[2022-07-12 12:09:54,942] INFO {datahub.entrypoints:193} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': 'v0.8.40', 'commit': '5bb7fe3691e153ff64137a8bdd64ec1473b6095f'}}, 'managedIngestion': {'defaultCliVersion': '0.8.40', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'quickstart'}, 'noCode': 'true'}
Please, help me...lemon-zoo-63387
07/12/2022, 4:04 AM'File "/tmp/datahub/ingest/venv-5e0fb85d-849d-4862-af24-8090d2718e47/lib/python3.9/site-packages/pytds/tds.py", line 1343, in '
'parse_prelogin\n'
" raise tds_base.Error('Client does not have encryption enabled but it is required by server, '\n"
'\n'
'DBAPIError: (pytds.tds_base.Error) Client does not have encryption enabled but it is required by server, enable encryption and try '
'connecting again\n'
'(Background on this error at: <http://sqlalche.me/e/13/dbapi>)\n'
future-helmet-59694
07/12/2022, 5:41 AMDatahubRestEmitter
’s emit()
method. Is there a better way to do this? How should we deal with the emission of multiple events as if it was some sort of single transaction and the possible rollback in case of any error with the ingestion process?
Thanks in advance! 🙂silly-ice-4153
07/12/2022, 8:52 AMsource acryl-datahub-airflow-plugin==0.8.35.6: EntryPoint(name='acryl-datahub-airflow-plugin', value='datahub_airflow_plugin.datahub_plugin:DatahubPlugin', group='airflow.plugins')
I set also the lazy_loading to False.
I added this in a test dag
task2 = PythonOperator(
task_id='Execute_Test_Script',
python_callable=main,
dag=dag,
inlets={
"datasets": [
Dataset("postgres", "postgres.test.y"),
],
},
outlets={
"datasets": [
Dataset("postgres", "postgres.test.y"),
Dataset("postgres", "postgres.test.x"),
Dataset("postgres", "postgres.test.z")
]
},
I added also the connection datahub_rest_default to the connections. But I don't see anything in the logs that it is emitting data. Has someone an idea what could be wrong?quick-article-20863
07/12/2022, 3:00 PMquick-article-20863
07/12/2022, 3:00 PMwitty-butcher-82399
07/12/2022, 3:22 PMBigQuery doesn’t need platform instances because project ids in BigQuery are globally unique.While the feature is not required in terms of uniqueness (project id is already included in the URN), setting the
DataPlatformInstace
aspect with the project id would enable the use of the project id for the platform instance faceting (as a filter in the searches). WDYT about this? I could do a PR if some agreement on this.numerous-bird-27004
07/12/2022, 8:23 PMrich-policeman-92383
07/13/2022, 6:05 AMwonderful-egg-79350
07/13/2022, 8:16 AMcrooked-holiday-47153
07/13/2022, 12:17 PMbreezy-portugal-43538
07/13/2022, 12:40 PMfaint-advantage-18690
07/13/2022, 1:11 PMtransform_aspect()
method?lively-ice-56461
07/13/2022, 2:34 PM>datahub --debug ingest -c ./mssql.yml
[2022-07-13 17:23:17,777] DEBUG {datahub.telemetry.telemetry:201} - Sending init Telemetry
[2022-07-13 17:23:18,130] DEBUG {datahub.telemetry.telemetry:234} - Sending Telemetry
[2022-07-13 17:23:18,302] INFO {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.40.3rc2
......
[2022-07-13 17:24:16,697] INFO {datahub.ingestion.reporting.datahub_ingestion_reporting_provider:143} - Committing ingestion run summary for pipeline:'pipeline_name',instance:'mssql_localhost:1433_master', j
ob:'common_ingest_from_sql_source'
[2022-07-13 17:24:16,698] DEBUG {datahub.emitter.rest_emitter:224} - Attempting to emit to DataHub GMS; using curl equivalent to:
curl -X POST -H 'User-Agent: python-requests/2.28.1' -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'Connection: keep-alive' -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'Content-Type: application/json' --data '{"pro
posal": {"entityType": "dataJob", "entityUrn": "urn:li:dataJob:(urn:li:dataFlow:(datahub,pipeline_name_mssql_localhost:1433_master,prod),common_ingest_from_sql_source)", "changeType": "UPSERT", "aspectName": "datahubIngestionRunSummary", "aspect": {"value": "< report data>"}}
[2022-07-13 17:24:16,759] INFO {datahub.ingestion.reporting.datahub_ingestion_reporting_provider:169} - Committed ingestion run summary for pipeline:'pipeline_name',instance:'mssql_localhost:1433_master', jo
b:'common_ingest_from_sql_source'
[2022-07-13 17:24:16,760] INFO {datahub.ingestion.run.pipeline:296} - Successfully committed changes for DatahubIngestionReportingProvider.
[2022-07-13 17:24:16,760] INFO {datahub.cli.ingest_cli:133} - Finished metadata pipeline
[2022-07-13 17:24:16,760] DEBUG {datahub.telemetry.telemetry:234} - Sending Telemetry
Source (mssql) report:
{'workunits_produced': 86,
'workunit_ids': [<workunit ids here>],
'warnings': {'database.schema.view': ['unable to map type BIT() to metadata schema']},
'failures': {},
'cli_version': '0.8.40.3rc2',
'cli_entry_location': '\\lib\\site-packages\\acryl_datahub-0.8.40.3rc2-py3.8.egg\\datahub\\__init__.py',
'py_version': '3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)]',
'py_exec_path': 'Scripts\\python.exe',
'os_details': 'Windows-10-10.0.19041-SP0',
'tables_scanned': 5,
'views_scanned': 1,
'entities_profiled': 0,
'filtered': [],
'soft_deleted_stale_entities': [],
'query_combiner': None}
Sink (datahub-rest) report:
{'records_written': 86,
'warnings': [],
'failures': [],
'downstream_start_time': datetime.datetime(2022, 7, 13, 17, 23, 30, 127613),
'downstream_end_time': datetime.datetime(2022, 7, 13, 17, 24, 15, 739817),
'downstream_total_latency_in_seconds': 45.612204,
'gms_version': 'v0.8.40'}
Pipeline finished with 1 warnings in source producing 86 workunits
[2022-07-13 17:24:18,093] DEBUG {datahub.telemetry.telemetry:234} - Sending Telemetry
Where can i look report info in datahub?kind-whale-32412
07/13/2022, 7:15 PMgray-hair-27030
07/13/2022, 7:42 PMHow can I load the postgres data, so that it appears as datasets in datahub, since it currently appears as a table, I attach a print of my configuration
powerful-planet-87080
07/13/2022, 9:12 PMpowerful-planet-87080
07/13/2022, 9:12 PMpowerful-planet-87080
07/13/2022, 9:13 PMpowerful-planet-87080
07/13/2022, 9:15 PMcolossal-sandwich-50049
07/13/2022, 9:35 PMdatahub-protobuf
module described in this documentation (https://github.com/datahub-project/datahub/tree/master/metadata-integration/java/datahub-protobuf) but can't seem to find it on maven; can someone advise?
2. After running the code below with java emitter (using Scala below), I have found that some of the methods on DatasetProperties
(e.g. setTags
, setQualifiedName
) don't alter anything in the user interface; can someone advise? Follow up: I notice, based on maven, that the datahub-client
is fairly new; would it be fair to say that it's functionality is still fairly limited?
val emitter: RestEmitter = RestEmitter.create(b => b
.server("<http://localhost:8080>")
.extraHeaders(Collections.singletonMap("Custom-Header", "custom-val"))
)
// emitter.testConnection()
val tags = new StringArray()
tags.add("featureStore")
tags.add("bi")
val url = new Url("<https://www.denofgeek.com/>")
val customProperties = new StringMap()
customProperties.put("governance", "disabled")
customProperties.put("otherProp", "someValue")
val mcpw = MetadataChangeProposalWrapper.builder()
.entityType("dataset")
.entityUrn("urn:li:dataset:(urn:li:dataPlatform:delta-lake,fraud.feature-stores.feature-store-v1,PROD)")
.upsert
.aspect(
new DatasetProperties()
.setName("feature-store")
.setDescription("some feature store desc")
.setTags(tags, SetMode.DISALLOW_NULL) // SetMode.IGNORE_NULL
.setQualifiedName("fraudFeatureStore")
.setExternalUrl(url)
// .setUri(new URI("<https://www.geeksforgeeks.org/>"))
.setCustomProperties(customProperties)
)
.build
val requestFuture = emitter.emit(mcpw, null).get()
echoing-alligator-70530
07/13/2022, 10:03 PMwonderful-egg-79350
07/14/2022, 7:39 AMloud-kite-94877
07/14/2022, 8:04 AM'File "/tmp/datahub/ingest/venv-dc280a46-0332-4755-a38c-552445dc2860/lib/python3.9/site-packages/jpype/_jvmfinder.py", line 212, in '
'get_jvm_path\n'
' raise JVMNotFoundException("No JVM shared library file ({0}) "\n'
'\n'
'JVMNotFoundException: No JVM shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable properly.\n'
'[2022-07-14 07:54:21,301] INFO {datahub.entrypoints:176} - DataHub CLI version: 0.8.40 at '
'/tmp/datahub/ingest/venv-dc280a46-0332-4755-a38c-552445dc2860/lib/python3.9/site-packages/datahub/__init__.py\n'
'[2022-07-14 07:54:21,301] INFO {datahub.entrypoints:179} - Python version: 3.9.9 (main, Dec 21 2021, 10:03:34) \n'
'[GCC 10.2.1 20210110] at /tmp/datahub/ingest/venv-dc280a46-0332-4755-a38c-552445dc2860/bin/python3 on '
This error appeared when run kafka-connect ingestion through ui.