boundless-student-48844
09/08/2022, 3:36 PM:metadata-ingestion:lint
task failed due to lint errors when running mypy
command. There are 72 errors, listed in thread. A suggestion - do you think if lint check can be enforced when there are PRs to metadata-ingestion
for better QA? 😅
mypy src/ tests/ examples/
clean-tomato-22549
09/09/2022, 5:33 AMjolly-library-86177
09/09/2022, 8:56 AMsilly-finland-62382
09/09/2022, 9:14 AMdf = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/nishchay.agarwal@meesho.com/services_classification.csv")
df.write.mode("overwrite").saveAsTable("new_p")
While I am running this command via Databricks Cluster, pipeline is created successfully as per name given in cluster spark conf spark.datahub.databricks.cluster shell_dbx, but
while I am running delta table command, I am getting error :
22/09/09 09:06:56 ERROR DatasetExtractor: class org.apache.spark.sql.catalyst.plans.logical.Project is not supported yet. Please contact datahub team for further support.
Also, I am not able to see schema of dataset that I build using spark-lineage, also both upstream & downstream table is showing same as per screenshot (that's not expected)
Also, can you help me, how to enable Delta catalog support from databricks, because its not working on Databricks
fresh-cricket-75926
09/09/2022, 10:26 AMrich-battery-25772
09/09/2022, 11:05 AMpub struct DeltaTableLoadOptions {
..............
/// Indicates whether DeltaTable should track files.
/// This defaults to `true`
///
/// Some append-only applications might have no need of tracking any files.
/// Hence, DeltaTable will be loaded with significant memory reduction.
pub require_files: bool,
}
The main problem is that the flag couldn’t be managed from the python deltalake’s library (it needs to be changed to manage the flag).
And also a question is how we can calculate the number of files in alternative way.
• Datahub’s code (using of DeltaTable class):
https://github.com/datahub-project/datahub/blob/083ab9bc0e7b9d8ba293afcf9fae4ffb71c4f86c/metadata-ingestion/src/datahub/ingestion/source/delta_lake/delta_lake_utils.py#L24
• Deltalake’s python library:
- DeltaTable class: https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/python/deltalake/table.py#L72
- RawDeltaTable class: https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/python/src/lib.rs#L78
• Deltalake’s rust library:
- DeltaTableBuilder class (require_files is in the options: DeltaTableLoadOptions field): https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/rust/src/builder.rs#L116witty-butcher-82399
09/09/2022, 2:06 PMprocess_commit
function in pipeline.py
. It checks if there are errors or not, and depending on that and the commit policy, it will commit or not the checkpoint.
https://github.com/datahub-project/datahub/blob/23b929ea10daded7447f806f8860447626[…]e573a6/metadata-ingestion/src/datahub/ingestion/run/pipeline.py
However, I don’t see such a behaviour with the ingestion events themselves. Which means that ingestion pipeline could be publishing some events via the Sink and not committing the checkpoint.
In my opinion, publishing policy in the Sink should be aligned with committing policy. WDYT?busy-glass-61431
09/12/2022, 5:11 AMcreamy-controller-55842
09/12/2022, 8:22 AMmany-hairdresser-79517
09/12/2022, 10:03 AMfamous-florist-7218
09/12/2022, 10:59 AMMcpEmitter: REST Emitter Configuration
is missing. Any thoughts?
22/09/12 17:54:35 ERROR DatahubSparkListener: Application end event received, but start event missing for appId local-1662980072825
Spark version: v3.1.1important-answer-79732
09/12/2022, 11:04 AM~~~~ Execution Summary ~~~~
RUN_INGEST - {'errors': [],
'exec_id': '7f529d57-21f5-4d39-a8e8-2b92580692ab',
'infos': ['2022-09-12 10:22:14.801662 [exec_id=7f529d57-21f5-4d39-a8e8-2b92580692ab] INFO: Starting execution for task with name=RUN_INGEST',
'2022-09-12 10:22:14.855554 [exec_id=7f529d57-21f5-4d39-a8e8-2b92580692ab] INFO: Caught exception EXECUTING '
'task_id=7f529d57-21f5-4d39-a8e8-2b92580692ab, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
' self.event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
' return f.result()\n'
' File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
' raise self._exception\n'
' File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
' result = coro.send(None)\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 71, in execute\n'
' validated_args = SubProcessIngestionTaskArgs.parse_obj(args)\n'
' File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n'
' File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n'
'pydantic.error_wrappers.ValidationError: 1 validation error for SubProcessIngestionTaskArgs\n'
'debug_mode\n'
' extra fields not permitted (type=value_error.extra)\n']}
Execution finished with errors.
chilly-scientist-91160
09/12/2022, 11:34 AMbusy-glass-61431
09/12/2022, 11:40 AMsilly-finland-62382
09/12/2022, 5:18 PMbland-sundown-49496
09/12/2022, 10:49 PMstocky-truck-96371
09/13/2022, 7:45 AMgreat-branch-515
09/13/2022, 9:15 AM(pymysql.err.OperationalError) (3159, 'Connections using insecure transport are prohibited while --require_secure_transport=ON.')
(Background on this error at: <http://sqlalche.me/e/13/e3q8>) due to
'(3159, 'Connections using insecure transport are prohibited while --require_secure_transport=ON.')'.
Any idea?better-orange-49102
09/13/2022, 2:21 PMbrave-pencil-21289
09/13/2022, 2:23 PMgentle-camera-33498
09/13/2022, 2:33 PMcool-actor-73767
09/13/2022, 7:19 PMrhythmic-sundown-12093
09/13/2022, 6:13 AMsource:
type: "dbt"
config:
# Coordinates
# To use this as-is, set the environment variable DBT_PROJECT_ROOT to the root folder of your dbt project
manifest_path: "${DBT_PROJECT_ROOT}/target/manifest.json"
catalog_path: "${DBT_PROJECT_ROOT}/target/catalog.json"
test_results_path: "${DBT_PROJECT_ROOT}/target/run_results.json" # optional for recording dbt test results after running dbt test
# Options
target_platform: "redshift" # e.g. bigquery/postgres/etc.
sink:
type: "datahub-rest"
config:
server: "<http://localhost:8080>"
many-hairdresser-79517
09/13/2022, 4:15 AMpolite-art-12182
09/14/2022, 5:38 AM"retries exceeded with url: /nifi-api/access/token (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] "
Any help resolving this without having to re-configure NiFi certs would be appreciated.blue-boots-43993
09/14/2022, 5:39 AMbumpy-journalist-41369
09/14/2022, 7:14 AMbland-orange-13353
09/14/2022, 7:30 AMmicroscopic-mechanic-13766
09/14/2022, 8:23 AMsource:
type: kafka
config:
platform_instance: <platform_instance>
connection:
consumer_config:
security.protocol: SASL_PLAINTEXT
sasl.username: <user>
sasl.mechanism: PLAIN
sasl.password: <password>
bootstrap: 'broker1:9092'
schema_registry_url: '<http://schema-registry:8081>'
When I got the following error:
File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 98, in _read_output_lines\n'
' line_bytes = await ingest_process.stdout.readline()\n'
' File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
' raise ValueError(e.args[0])\n'
'ValueError: Separator is not found, and chunk exceed the limit\n']}
Mention that recipe worked in previous versions (the current version is v0.8.44)
Thanks in advance!thankful-vr-12699
09/14/2022, 8:48 AM