Leonardo David Treiger Herszenhaut Brettas
08/10/2024, 5:26 PMKacper LeĆniara
08/13/2024, 7:58 AMMatt Glover
08/22/2024, 7:27 AMMark Druffel
08/22/2024, 9:31 PMto_
methods (i.e. to_csv, to_delta, etc.) to the ibis.TableDataset? Or perhaps there should be a different ibis Dataset?
Details
I'm trying to pre-process some badly formed csv files in my pipeline. I know I can use a pandas node separately, but I prefer the ibis api so I tried to use TableDataset. I have the following data catalog entries:
raw:
type: ibis.TableDataset
filepath: data/01_raw/raw.csv
file_format: csv
connection:
backend: pandas
load_args:
sep: ","
preprocessed:
type: ibis.TableDataset
table_name: preprocessed
connection:
backend: pandas
database: test.db
save_args:
materialized: table
standardized:
type: ibis.TableDataset
table_name: standardized
file_format: csv
connection:
backend: duckdb
database: finance.db
save_args:
materialized: table
The pipeline code looks like this:
def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[
node(
func=preprocess_raw,
inputs="raw",
outputs="preprocessed",
name="preprocess"
),
node(
func=standardize,
inputs="preprocessed",
outputs="standardized",
name="standardize"
),
]
)
I jump into an ipython session with kedro ipython
and run `catalog.load("preprocessed") and get the error TypeError: BasePandasBackend.do_connect() got an unexpected keyword argument 'database'
, which is coming from Ibis. After looking at the backend setup, I see database isn't a valid argument.
I remove database and reran and got the error DatasetError: Failed while saving data to data set... Unable to convert <class 'ibis.expr.types.relations.Table'> object to backend type: <class 'pandas.core.frame.DataFrame'>
. I didn't exactly expect this to work, but I wasn't sure...
preprocessed:
type: ibis.TableDataset
table_name: preprocessed
connection:
backend: pandas
Then I tried removing table_name as well and got the obvious error that I need a table_name or a filepath. `DatasetError: Must provide at least one of filepath
or table_name
.` No doubt đ
preprocessed:
type: ibis.TableDataset
connection:
backend: pandas
Then I tried adding a filepath and get the error `DatasetError: Must provide table_name
for materialization.` which I can see in TableDataset's _write
method.
preprocessed:
type: ibis.TableDataset
filepath: data/02_preprocessed/preprocessed.csv
connection:
backend: pandas
Bruk Tafesse
08/27/2024, 11:20 AMpredictions:
type: pandas.GBQTableDataset
dataset: ...
table_name: table_name
project: ....
save_args:
if_exists: replace
Is there a way to configure the table_name
when creating a pipeline job using the vertex ai sdk?
I am using compiled pipelines btw.
ThanksLukas Innig
08/29/2024, 9:11 PMVishal Pandey
09/05/2024, 11:49 AMtime="2024-09-05T11:37:29.010Z" level=info msg="capturing logs" argo=true
cp: cannot stat '/home/kedro/data/*': No such file or directory
time="2024-09-05T11:37:30.011Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1
@Artur Dobrogowski Can you helpVishal Pandey
09/10/2024, 3:34 PMMark Druffel
09/13/2024, 6:44 PMInvalid Input Error: Could not set option "schema" as a global option
.
bronze_x:
type: ibis.TableDataset
filepath: x.csv
file_format: csv
table_name: x
backend: duckdb
database: data.duckdb
schema: bronze
I can reproduce this error with vanilla ibis:
con = ibis.duckdb.connect(database="data.duckdb", schema = "bronze")
Found a related question on ibis' github, it sounds like duckdb can't set the schema globally so it has to be done in the table functions. Wondering if this would require a change to ibis.TableDataset, and if so, would this pattern work the same with other backends?Deepyaman Datta
09/16/2024, 12:53 PMpandera.io.deserialize_schema
under the hood in it's schema resolver, and that seems to be only implemented in pandera for pandas, is that right?Vishal Pandey
09/18/2024, 4:59 PMLĂvia Pimentel
09/19/2024, 3:30 PMVishal Pandey
09/25/2024, 8:47 AMvolume:
# Storage class - use null (or no value) to use the default storage
# class deployed on the Kubernetes cluster
storageclass: # default
# The size of the volume that is created. Applicable for some storage
# classes
size: 1Gi
# Access mode of the volume used to exchange data. ReadWriteMany is
# preferred, but it is not supported on some environements (like GKE)
# Default value: ReadWriteOnce
#access_modes: [ReadWriteMany]
# Flag indicating if the data-volume-init step (copying raw data to the
# fresh volume) should be skipped
skip_init: False
# Allows to specify user executing pipelines within containers
# Default: root user (to avoid issues with volumes in GKE)
owner: 0
# Flak indicating if volume for inter-node data exchange should be
# kept after the pipeline is deleted
keep: False
2.
# Optional section to allow mounting additional volumes (such as EmptyDir)
# to specific nodes
extra_volumes:
tensorflow_step:
- mount_path: /dev/shm
volume:
name: shared_memory
empty_dir:
cls: V1EmptyDirVolumeSource
params:
medium: Memory
Vishal Pandey
09/26/2024, 8:07 AM--env , --nodes , -- pipelines
which we pass using the kedro run
command .
So for any given plugin related to deployments like airflow , kubeflow . How can we supply these arguments ?George p
10/03/2024, 11:53 PMAlexandre Ouellet
10/15/2024, 5:17 PMThiago José Moser Poletto
10/17/2024, 5:25 PMMark Druffel
10/18/2024, 7:38 PMraw_tracks:
type: ibis.TableDataset
table_name: raw_tracks
connection:
backend: pyspark
database: comms_media_dev.dart_extensions
def load(self) -> ir.Table:
return self.connection.table(self._table_name)
I think updating load() seems fairly simple, something like the code below works, but was the initial intent that we could pass a catalog / database through the config here? If yes on the latter I think perhaps I'm not using the spark config properly or databricks is doing something strange... posted a question about that here for context.
def load(self) -> ir.Table:
return self.connection.table(name = self._table_name, database = self._database)
Thabo Mphuthi
11/20/2024, 5:49 AMNok Lam Chan
11/27/2024, 6:35 AMHimanshu Sharma
12/12/2024, 10:16 AMFailed to execute command group with error Container `0341a555koec4794bb36cf074f0386h-execution-wrapper` failed with status code `1` and it was not possible to extract the structured error Container `0341a555koec4794bb36cf074f0386h-execution-wrapper` exited with code 1 due to error None and we couldn't read the error due to GetErrorFromContainerFailed { last_stderr: Some("exec /mnt/azureml/cr/j/0341a555koec4794bb36cf074f0386h/cap/lifecycler/wd/execution-wrapper: no such file or directory\n") }.
Pipeline screenshot from Azure ML:Guillaume Tauzin
02/10/2025, 4:45 PMPhilipp Dahlke
02/13/2025, 11:03 AMkedro_mlflow.io.artifacts.MlflowArtifactDataset
I followed the instructions for building the container from kedro-docker repo but when running, those artifacts want to access my local windows path instead of the containers path. Do you guys know what additional settings I have to make? All my settings in are pretty much vanilla. The mlflow_tracking_uri
is set to null
"{dataset}.team_lexicon":
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
dataset:
type: pandas.ParquetDataset
filepath: data/03_primary/{dataset}/team_lexicon.pq
metadata:
kedro-viz:
layer: primary
preview_args:
nrows: 5
Traceback (most recent call last):
kedro.io.core.DatasetError: Failed while saving data to dataset MlflowParquetDataset(filepath=/home/kedro_docker/data/03_primary/D1-24-25/team_lexicon.pq, load_args={}, protocol=file, save_args={}).
[Errno 13] Permission denied: '/C:'
Bibo Bobo
02/16/2025, 12:18 PMlog_table
method in kedro-mlflow. So I wonder what will be the right way to log additional data from a node, something that is not yet supported by the plugin?
Right now I just do something like this at the end of the node function
mlflow.log_table(data_for_table, output_filename)
But I am concerned as I am not sure if it will always work and will always log the data to the correct run because I was not able to get retrieve the active run id from inside the node with mlflow.active_run()
(it returns None
all the time).
I need this because I want to use the Evaluation
tab in the UI to manually compare some outputs of different runs.Yifan
02/20/2025, 2:33 PMkedro-mlflow 0.14.3
specific to python 3.9
. It seems that a fix is already merged in the repo. When would the fix be released? Thank!Ian Whalen
02/25/2025, 3:38 PMJuan Luis
02/25/2025, 4:58 PMJuan Luis
03/11/2025, 4:43 PMkedro-azureml
0.9.0 and kedro-vertexai
0.12.0 with support for the most recent Kedro and Python versions. you can thank GetInData for it đđŒMerel
03/26/2025, 10:39 AM0.19.12
and the changes we did to the databricks starter (https://github.com/kedro-org/kedro-starters/pull/267) might have broken the resource creation for the kedro-databricks
plugin @Jens Peder Meldgaard. When I do kedro databricks bundle
the resources folder gets created, but it's empty. (cc: @Sajid Alam)Merel
03/27/2025, 8:31 AMkedro-databricks
works and I was wondering whether it makes sense to use any of the other runners (ThreadRunner
or ParallelRunner
)? As far as I understand for every node we use these run parameters --nodes name, --conf-source self.remote_conf_dir, --env self.env
. Would it make sense to allow for adding runner type too? Or if you want parallel running you should use the databricks cluster setup for that? I'm not very familiar with all the run options in Databricks, so trying to figure out where to use Kedro features and where Databricks. (cc: @Rashida Kanchwala)