Kedro #plugins-integrations

Matt Glover

08/22/2024, 7:27 AM

Hi - has anyone written a dataset class for AWS Athena that they could share before i attempt to do it myself?

Mark Druffel

08/22/2024, 9:31 PM

Question about ibis.TableDataset. Is there a way to use the pandas backend in a pipeline? It seems like you can't write pandas output to a file or a database. It seems like this is by design and makes sense for a *Table*Dataset, but is that the intent? I really like the Ibis API and would prefer to use it as my primary dataframe library. I mostly work with pyspark and duckdb so it's a natural fit there, but I'm wondering if there is a long-term plan or willingness to consder adding

to_

methods (i.e. to_csv, to_delta, etc.) to the ibis.TableDataset? Or perhaps there should be a different ibis Dataset? Details I'm trying to pre-process some badly formed csv files in my pipeline. I know I can use a pandas node separately, but I prefer the ibis api so I tried to use TableDataset. I have the following data catalog entries:

Copy code

raw:
  type: ibis.TableDataset
  filepath: data/01_raw/raw.csv
  file_format: csv
  connection: 
    backend: pandas
  load_args:
    sep: ","

preprocessed:
  type: ibis.TableDataset
  table_name: preprocessed
  connection: 
    backend: pandas
    database: test.db
  save_args:
    materialized: table

standardized:
  type: ibis.TableDataset
  table_name: standardized
  file_format: csv
  connection: 
    backend: duckdb
    database: finance.db
  save_args:
    materialized: table

The pipeline code looks like this:

Copy code

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_raw,
                inputs="raw",
                outputs="preprocessed",
                name="preprocess"
            ),
            node(
                func=standardize,
                inputs="preprocessed",
                outputs="standardized",
                name="standardize"
            ),
        ]
    )

I jump into an ipython session with

kedro ipython

and run `catalog.load("preprocessed") and get the error

TypeError: BasePandasBackend.do_connect() got an unexpected keyword argument 'database'

, which is coming from Ibis. After looking at the backend setup, I see database isn't a valid argument. I remove database and reran and got the error

DatasetError: Failed while saving data to data set... Unable to convert <class 'ibis.expr.types.relations.Table'> object to backend type: <class 'pandas.core.frame.DataFrame'>

. I didn't exactly expect this to work, but I wasn't sure...

Copy code

preprocessed:
  type: ibis.TableDataset
  table_name: preprocessed
  connection: 
    backend: pandas

Then I tried removing table_name as well and got the obvious error that I need a table_name or a filepath. `DatasetError: Must provide at least one of

filepath

table_name

.` No doubt 😂

Copy code

preprocessed:
  type: ibis.TableDataset
  connection: 
    backend: pandas

Then I tried adding a filepath and get the error `DatasetError: Must provide

table_name

for materialization.` which I can see in TableDataset's

_write

method.

Copy code

preprocessed:
  type: ibis.TableDataset
  filepath: data/02_preprocessed/preprocessed.csv
  connection: 
    backend: pandas

👍 1

Bruk Tafesse

08/27/2024, 11:20 AM

Hi everyone, I have a dataset configure like the following

Copy code

predictions:
  type: pandas.GBQTableDataset
  dataset: ...
  table_name: table_name
  project: ....
  save_args:
    if_exists: replace

Is there a way to configure the

table_name

when creating a pipeline job using the vertex ai sdk? I am using compiled pipelines btw. Thanks

Lukas Innig

08/29/2024, 9:11 PM

Is anyone aware of an integration between kedro and IaC tools such as terraform or pulumi?

Vishal Pandey

09/05/2024, 11:49 AM

Hello everyone I am trying to run a toy kedro pipeline on kubeflow. So i build a docker image and executed the pipeline in the container locally and it runs fine. I have published the pipeline on kubeflow as well. But when I execute the pipeline on kubeflow I am getting an error

time="2024-09-05T11:37:29.010Z" level=info msg="capturing logs" argo=true

cp: cannot stat '/home/kedro/data/*': No such file or directory

time="2024-09-05T11:37:30.011Z" level=info msg="sub-process exited" argo=true error="<nil>"

Error: exit status 1

@Artur Dobrogowski Can you help

Vishal Pandey

09/10/2024, 3:34 PM

@Artur Dobrogowski Do we we have any python sdk for kubeflow plugin. We are looking for a pythonic way to use the functionalities instead of the CLI being offered.

Mark Druffel

09/13/2024, 6:44 PM

Has anyone used ibis.TableDataset with duckdb schemas? If I set a schema on a data catalog entry I get the error

Invalid Input Error: Could not set option "schema" as a global option

Copy code

bronze_x:
  type: ibis.TableDataset
  filepath: x.csv
  file_format: csv
  table_name: x
  backend: duckdb
  database: data.duckdb
  schema: bronze

I can reproduce this error with vanilla ibis:

Copy code

con = ibis.duckdb.connect(database="data.duckdb", schema = "bronze")

Found a related question on ibis' github, it sounds like duckdb can't set the schema globally so it has to be done in the table functions. Wondering if this would require a change to ibis.TableDataset, and if so, would this pattern work the same with other backends?

Deepyaman Datta

09/16/2024, 12:53 PM

Am I correct in understanding that Kedro-Pandera will only work with pandas schemas currently? I saw that it uses

pandera.io.deserialize_schema

under the hood in it's schema resolver, and that seems to be only implemented in pandera for pandas, is that right?

Vishal Pandey

09/18/2024, 4:59 PM

Hello everyone This is regarding the kubeflow plugin. I wanted to just gain some information about how kedro nodes are executed by kubeflow. Does kubeflow run each node in a separate container ?? or separate pods ?? or all of nodes are executed in the same container

Lívia Pimentel

09/19/2024, 3:30 PM

Hi, everyone. Can someone confirm if its possible to use kedro-azureml with kedro>=0.19? From what I see here it's not, but wanted to confirm it if mabe the website is outdated

Vishal Pandey

09/25/2024, 8:47 AM

Hey Folks I am looking for a way to mount AWS EFS volume to my kedro pipeline which will be executed by kubeflow . I am using the kubeflow plugin. The config has below 2 options for Volumes , I am not sure which one is for what purpose 1.

Copy code

volume:

    # Storage class - use null (or no value) to use the default storage
    # class deployed on the Kubernetes cluster
    storageclass: # default

    # The size of the volume that is created. Applicable for some storage
    # classes
    size: 1Gi

    # Access mode of the volume used to exchange data. ReadWriteMany is
    # preferred, but it is not supported on some environements (like GKE)
    # Default value: ReadWriteOnce
    #access_modes: [ReadWriteMany]

    # Flag indicating if the data-volume-init step (copying raw data to the
    # fresh volume) should be skipped
    skip_init: False

    # Allows to specify user executing pipelines within containers
    # Default: root user (to avoid issues with volumes in GKE)
    owner: 0

    # Flak indicating if volume for inter-node data exchange should be
    # kept after the pipeline is deleted
    keep: False

Copy code

# Optional section to allow mounting additional volumes (such as EmptyDir)
  # to specific nodes
  extra_volumes:
    tensorflow_step:
    - mount_path: /dev/shm
      volume:
        name: shared_memory
        empty_dir:
          cls: V1EmptyDirVolumeSource
          params:
            medium: Memory

Vishal Pandey

09/26/2024, 8:07 AM

Hey Everyone I wanted to know more about kedro CLI that we have . So there are arguments like

--env , --nodes , -- pipelines

which we pass using the

kedro run

command . So for any given plugin related to deployments like airflow , kubeflow . How can we supply these arguments ?

George p

10/03/2024, 11:53 PM

Hey all. I recently stumbled across "React Flow" (link) while searching for an open source graph & node drag-and-drop solution. Specifically, I am looking to create a UI for my team, which would allow for easy no-code pipeline creation, by presenting a list of available nodes (on the side, as "prewritten" functions) which can then be dragged-dropped-and-connected with the rest of the pipeline. I am unsure if something like this would... 'play nice' with kedro-viz (now and/or in the future), but is there anyone who has thought about this before? If so, what did you do about it (ideally in combination with kedro/kedro-viz)? [I have posted 2 relevant links below]

👀 1

Alexandre Ouellet

10/15/2024, 5:17 PM

Hey there! Quick question about kedro-azureml. We are using AzureML, and we'd like to use AzureMLAssetDataset with dataset factories. After a lot of headach and debugging, it seems impossible to use both, as the way credentials are passed to the AzureMLAssetDataset is done through a hook (after_catalog_created), but the issue is that if you use a dataset_patterns (as in, declare your dataset as "{name}.csv" or something similar), the hook is called, but the patterned dataset is not instanciated yet. After all that, a before_node_run is called, and then there is a AzureMLAssetDataset._load() called, but the AzureMLAssetDataset.azure_config setter hasn't been called yet (as it is called only in the after_catalog_created hook). At first glance, it seems like a kedro-azureml issue, as AzureMLAssetDataset._load() can be called without the setter being called when used as a dataset factory. But also, it might be a kedro issue, as I think there should be an obvious way to setup credentials in that specific scenario, and I don't quite see it from the docs on hook either

Thiago José Moser Poletto

10/17/2024, 5:25 PM

Hey guys I would like to know if theres anyone that have tested the Kedro Vertex AI Plugin, on its latest version. I'm having some issues with async node runs, for some reason it is taking a lot longer than when run locally. It might be because I'm allocanting a GPU to parto of the process, but it shouldn't, in my perspective, so if anyone have any ideas or suggestions, I'll appreciate that...

Mark Druffel

10/18/2024, 7:38 PM

Hey there, another question on the ibis.TableDataset. Just moving a bunch of our local code (duckdb) to databricks and hit a snag. We're using unity catalog (UC). I loaded raw tables into UC manually for simplicity and confirmed I can load them using an ibis connection (see screenshot 1). When I try to load this table in using the TableDataset I get an error saying "`raw_tracks` cannot be found" (see screenshot 2). I think this is because the load() method doesn't pull in database from the config...

Copy code

raw_tracks:
  type: ibis.TableDataset
  table_name: raw_tracks
  connection:
    backend: pyspark
    database: comms_media_dev.dart_extensions

Copy code

def load(self) -> ir.Table:
            return self.connection.table(self._table_name)

I think updating load() seems fairly simple, something like the code below works, but was the initial intent that we could pass a catalog / database through the config here? If yes on the latter I think perhaps I'm not using the spark config properly or databricks is doing something strange... posted a question about that here for context.

Copy code

def load(self) -> ir.Table:
            return self.connection.table(name = self._table_name, database = self._database)

Thabo Mphuthi

11/20/2024, 5:49 AM

Hey folks, has anyone use the kedro-azureml plugin on a Apple M1 mac? Seem to be unable to install it locally due to a dependency on packages that are unsupported on M1 chips (azureml-sdk etc,).

Nok Lam Chan

11/27/2024, 6:35 AM

Stay tuned with upcoming Kedro VSCode releases (it will probably show up in 0.3.0, we will release 0.2.3 for some bug fixes including Windows issue), we are working on improving the static catalog validation, It will validate against user virtual environment, so it's able to detect missing dependencies/third parties dependencies.

🙌 10

❤️ 5

Himanshu Sharma

12/12/2024, 10:16 AM

Hi Team, I'm getting an issue while using kedro-azureml using this doc - link, Able to run all steps without any issues but while the pipeline runs in Azure ML it gives the following error:

Copy code

Failed to execute command group with error Container `0341a555koec4794bb36cf074f0386h-execution-wrapper` failed with status code `1` and it was not possible to extract the structured error Container `0341a555koec4794bb36cf074f0386h-execution-wrapper` exited with code 1 due to error None and we couldn't read the error due to GetErrorFromContainerFailed { last_stderr: Some("exec /mnt/azureml/cr/j/0341a555koec4794bb36cf074f0386h/cap/lifecycler/wd/execution-wrapper: no such file or directory\n") }.

Pipeline screenshot from Azure ML:

Guillaume Tauzin

02/10/2025, 4:45 PM

Hi Team! Anyone ever played with hyperparameter tuning frameworks within kedro? I have found several scattered pieces of info related to this topic, but no complete solutions. Ultimately, I think what I would like to set up is a way to have multiple nodes running at the same time and all contributing to the same tuning experiment. I would prefer using optuna and this is the way I would go about it based on what I have found online: 1. Create a node that creates an optuna study 2. Create N nodes that each run hyperparameter tuning in parallel. Each of them loads the optuna study and if using kedro-mlflow each hyperparameter trial can be logged into its own nested run. 3. Create a final nodes that process the results of all tuning nodes Does this sound reasonable to you? Has anyone produced such a kedro workflow already? I would love to see what it looks like. I am also wondering: • I am thinking of creating an OptunaStudyDataset for the optuna study . Has anyone attempted this already? • For creating N tuning nodes, I am thinking of using the approach presented on the GetInData blog post on dynamic pipelines. Would this be the recommended approach? Thanks!

Philipp Dahlke

02/13/2025, 11:03 AM

Hi guys, I am having trouble to run my kedro from a docker build. I'm using MLflow and the

kedro_mlflow.io.artifacts.MlflowArtifactDataset

I followed the instructions for building the container from kedro-docker repo but when running, those artifacts want to access my local windows path instead of the containers path. Do you guys know what additional settings I have to make? All my settings in are pretty much vanilla. The

mlflow_tracking_uri

is set to null

Copy code

"{dataset}.team_lexicon":
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataset  
  dataset:
    type: pandas.ParquetDataset  
    filepath: data/03_primary/{dataset}/team_lexicon.pq 
    metadata:
      kedro-viz:
        layer: primary  
        preview_args:
            nrows: 5

Copy code

Traceback (most recent call last):
  
kedro.io.core.DatasetError: Failed while saving data to dataset MlflowParquetDataset(filepath=/home/kedro_docker/data/03_primary/D1-24-25/team_lexicon.pq, load_args={}, protocol=file, save_args={}).
[Errno 13] Permission denied: '/C:'

Bibo Bobo

02/16/2025, 12:18 PM

Hello, guys, I noticed that there is no support for

log_table

method in kedro-mlflow. So I wonder what will be the right way to log additional data from a node, something that is not yet supported by the plugin? Right now I just do something like this at the end of the node function

Copy code

mlflow.log_table(data_for_table, output_filename)

But I am concerned as I am not sure if it will always work and will always log the data to the correct run because I was not able to get retrieve the active run id from inside the node with

mlflow.active_run()

(it returns

None

all the time). I need this because I want to use the

Evaluation

tab in the UI to manually compare some outputs of different runs.

Yifan

02/20/2025, 2:33 PM

Hello guys! Noticed there is a typing-annotation bug in

kedro-mlflow 0.14.3

specific to

python 3.9

. It seems that a fix is already merged in the repo. When would the fix be released? Thank!

Ian Whalen

02/25/2025, 3:38 PM

I think this belongs in plugins! If I remember correctly, there was once a pycharm friendly version of this: https://github.com/kedro-org/vscode-kedro Does that exist anywhere still?

Juan Luis

02/25/2025, 4:58 PM

hi folks, in case it's useful for anybody, yesterday I quickly hacked a kedro-openlineage integration, and demonstrated it using Marquez. I guess it should work with any OL consumer but you tell me 🙂 https://github.com/astrojuanlu/kedro-openlineage

❤️ 4

Juan Luis

03/11/2025, 4:43 PM

happy to announce that @em-pe released

kedro-azureml

0.9.0 and

kedro-vertexai

0.12.0 with support for the most recent Kedro and Python versions. you can thank GetInData for it 👏🏼

K 6

🥳 4

vertex ai 5

azure 6

Merel

03/26/2025, 10:39 AM

I think Kedro

0.19.12

and the changes we did to the databricks starter (https://github.com/kedro-org/kedro-starters/pull/267) might have broken the resource creation for the

kedro-databricks

plugin @Jens Peder Meldgaard. When I do

kedro databricks bundle

the resources folder gets created, but it's empty. (cc: @Sajid Alam)

Merel

03/27/2025, 8:31 AM

Hi @Jens Peder Meldgaard, I'm learning more about how

kedro-databricks

works and I was wondering whether it makes sense to use any of the other runners (

ThreadRunner

ParallelRunner

)? As far as I understand for every node we use these run parameters

--nodes name, --conf-source self.remote_conf_dir, --env self.env

. Would it make sense to allow for adding runner type too? Or if you want parallel running you should use the databricks cluster setup for that? I'm not very familiar with all the run options in Databricks, so trying to figure out where to use Kedro features and where Databricks. (cc: @Rashida Kanchwala)

Yury Fedotov

05/28/2025, 7:47 PM

Does

kedro-mlflow

support custom model flavors in datasets? I'm reading in docs that yes, but wanted to double check that this is relevant. @Yolan Honoré-Rougé

Yolan Honoré-Rougé

05/28/2025, 8:30 PM

(and for the record kedro mlflow has a built in custom model to log an entire kedro pipeline which may be useful)

👍 1