Kedro #questions

Afaque Ahmad

06/12/2023, 8:14 AM

Hi Folks I'm working with reference to this feature on

kedro-plugins

. How should I setup my local development environment? I cannot find a

requirements.txt

file.

Abhishek Bhatia

06/12/2023, 1:06 PM

Hi Team, I am developing a kedro pipeline in which I pass around

MemoryDataSet

from nodes. By default kedro, deep copies the memoery dataset which leads to loss of information so I created a catalog entry with

copy_mode

set to

assign

. This solves our basic problem of objects being retained as is but messes up the DAG order displayed in kedro viz. Any solutions?

Jose Nuñez

06/12/2023, 3:32 PM

Hello fellow Kedroids K🤖! . I'm having a very strange issue when saving a file to parquet. I'm getting this error:

DataSetError: Failed while saving data to data set ParquetDataSet(filepath=/Users/jose_darnott/PycharmProjects/planta-litio/data/01_raw/data_sql.parquet, load_args={'engine': pyarrow}, protocol=file, save_args={'engine': pyarrow}). Duplicate column names found:  ['timestamp', 'lims_BFIL CO3', 'lims_BFIL Ca %', ...]

# It's basically showing all the columns inside the dataframe, (here I'm showing only 3 of them) . My catalog entry looks like this:

Copy code

data_sql:
  type: pandas.ParquetDataSet
  filepath: data/01_raw/data_sql.parquet
  load_args:
    engine: pyarrow
  save_args:
    engine: pyarrow
  layer: III

. I'm using: kedro==0.18.8 pandas==2.0.1 pyarrow==12.0.0 . The problem is quite similiar to this issue from 2022: https://github.com/kedro-org/kedro/discussions/1286 but in my case removing the load and save args as the OP mentions won't solve my problem. . This is quite puzzling, since I just did a df.to_clipboard() inside the node before returning my output, open it on a jupyter notebook and I see no problems with the dataframe, I can even save it to parquet without any issues. So that makes me thing the problem comes from kedro (?) . Anyways, as a workaround I'm saving the dataframe as csv and it's working just fine. But I'd like to find a way to make the parquet work again since this is a huge file. Thanks in advance 🦜!

Trevor

06/12/2023, 4:52 PM

Is it possible to import an already packaged Kedro pipeline in a separate script and assign node return values to new variables for use later in the script? I've been trying to get people on our team on board with Kedro and a couple of us would be really interested in being able to use

MemoryDataSet

returned by nodes as pieces of larger scripts. Up until now, I've only needed to import

main

and that has worked for our purposes so far

Jared T

06/12/2023, 4:54 PM

Hi all I am having an issue defining a pipeline with using namespaces in multiple modular pipelines. I am following the structure of the spaceflights tutorial and I am getting this error:

Copy code

ValueError: Duplicate keys found in 
<project repo>/conf/base/parameters/pr
epare.yml and:
- 
<project repo>/conf/base/parameters/in
gest.yml: train_pipeline

I have the

train_pipeline

namespace in both the ingest and prepare modular sub-pipelines, here are the respective yamls:

Copy code

# The following is a list of parameters for ingest pipeline for each namespace (train, inference)


# Parameters for train namespace
train_pipeline:
  ingestion_options:
    #Portfolio to use
    portfolio_name: has_meds_portfolio.HasMedsPortfolio
    # Feature store sub-pipes, only one for now.
    feature_store_subpipe_name: BasicFeaturePipeline
    # Expected output columns
    expected_columns:
      datetime: datetime64[ns]
      patient_id: int64
      age_days: int64
      Male: int64
      binary_smoking_status: object
      overall_censorship_time: datetime64[ns]
      months_until_overall_censorship: int64
      death_date: datetime64[ns]

# Parameters for inference namespace
# currently same as train but this will change
# first updated to Nightly Porrtfolio then to
# an api call to the valuation queue.
inference_pipeline:
  ingestion_options:
    #Portfolio to use
    portfolio_name: has_meds_portfolio.HasMedsPortfolio
    # Feature store sub-pipes, only one for now.
    feature_store_subpipe_name: BasicFeaturePipeline
    # Expected output columns
    expected_columns:
      datetime: datetime64[ns]
      patient_id: int64
      age_days: int64
      Male: int64
      binary_smoking_status: object
      overall_censorship_time: datetime64[ns]
      months_until_overall_censorship: int64
      death_date: datetime64[ns]

Copy code

# all parameters for prepare pipeline are in train_pipeline namespace
train_pipeline:
  preparation_options:
    # target params
    target_death_buffer_months: 2
    
    # split params 
    splitter: TimeSeriesSplit
    holdout_size: 0.3

am I not allowed to use the same namespace in multiple modular pipelines?

CHIRAG WADHWA

06/13/2023, 4:34 AM

Hi all, i have recently come across this error

kedro-datasets 1.4.0 does not provide the extra 'pickle.pickledataset'

does kedro-datasets not support pickle datasets ? context - i'm removing

kedro.extras

datasets from our asset codebase and using kedro-datasets

Abhishek Bhatia

06/13/2023, 10:21 AM

Hi Team, I have a basic doubt about using

PartitionedDataSet

. In the below pipeline, I have a node which returns a dictionary with values as pandas dataframes, so I define a

PartionedDataSet

catalog entry for it. If I run the nodes till only this node then the files do get saved in the correct location but the output is an empty dictionary. If I add an identity node, then the correct key-value pair is returned. Is this the desired behaviour?

Jose Nuñez

06/13/2023, 1:39 PM

Hi Kedroids 🦜🤖! I updated my kedro viz to the latest version but now I'm unable to preview datasets as in the previous version... I got used to that feature 😄! Is there any way to have that back? I was checking the setting but there is nothing there neither. Thanks in advance!

Jeremi DeBlois-Beaucage

06/13/2023, 4:32 PM

Hi team, did anyone use Kedro in a multi-GPU training setup? Would love to ask a few questions on how to best setup the repo. We are using Databricks and MLFlow, and are trying to assess whether Kedro can handle multi-GPU training in a straightforward way. Thanks!

Andreas_Kokolantonakis

06/14/2023, 12:19 PM

hello everyone, I am using Kedro docker and I am running on an issue where docker cannot find the globals I am specifying for my enviroments. e,g I want to run

kedro run --env=dev

from docker and I am getting

ValueError: Failed to format pattern '${s3_root_path}': no config value found, no default provided

What’s the best way to fix it? Thank you in advance!

Rafał Nowak

06/14/2023, 4:49 PM

Hello, I am using kedro with dvc for data version control. The dvc is based on

gto

which depends on

semver >= 3

Unfortunately I cannot install

kedro-viz

since

kedro-viz 6.3.0

depends on

semver < 3

Is there any reason why

kedro-viz

is limited to

semver < 3

? The current

semver

3.0.1

. Could anyone from kedro-viz team relax this dependency limitation?

Alexandre Ouellet

06/14/2023, 7:07 PM

I believe I have found a bug when running the same pipieline with different parameters. For instance, I have the following pipeline : function X-> versionned dataset -> function Y if I start this pipeline twice, if the 2nd pipeline's X node finishes earlier, I don't get the expected dataset

👀 1

Khangjrakpam Arjun

06/15/2023, 12:08 PM

Hi team, I am trying to save a plotly image to html for reporting purpose. Is there a way where we can save a plotly image as an html plot in kedro catalog? I tried using the following class .

Copy code

type: kedro.extras.datasets.pandas.HTMLDataSet

On using the above class I am getting this error :

Copy code

kedro.io.core.DataSetError: An exception occurred when parsing config for DataSet 'boxplot_figures_cfa':
Class 'kedro.extras.datasets.pandas.HTMLDataSet' not found or one of its dependencies has not been installed.

Does this class even exist?

Javier del Villar

06/15/2023, 6:51 PM

Hi all! I was trying the collaborative experiment tracking feature https://kedro-org.slack.com/archives/C03RKAQ0MGQ/p1686144020499809 Is it possible that "Notes" are not been shared? I should be seeing a note a coworker left me. I can see everything else.

Georgi Iliev

06/16/2023, 7:56 AM

Hi team! I need advice on using

ONNX

files and uploading them to S3 automatically using "only" the catalog definition. Broadly speaking, the main flow of what we're trying to build is the following: 1. There is a process that trains and creates some files (PCA, scaler, some K-Means models, etc.) and saves them as

Pickle

to use them between different nodes. 2. Once the main

pipeline

is done, we're ready to distribute the model to our services. 3. We're using

ONNX

because our services are not built in Python and the ONNX libraries we use are a bit faster. 4. So taking this into account, we have a

publish

pipeline now that takes this

Picke

files, converts them to

ONNX

using

convert_sklearn

, and then uploads to S3. So, my main question here is: Is there a way to implement this so the transformation and the S3 upload is done automatically? • I know that we can specify a S3 path in the catalog, but I didn't see how to set the

.onnx

file type.

K 3

Khangjrakpam Arjun

06/16/2023, 8:23 AM

Hi Team, Is there a way to save a plot as a pdf/png/jpeg in kedro catalog? I tried using the

kedro.extras.datasets.matplotlib.MatplotlibWriter

class to save a figure object as a .png file in the kedro catalog and I got the below error:

Copy code

'Figure' object has no attribute 'save'

Is there a way to use

sav_fig

method instead of

save

method to save an figure object in the kedro catalog?

Camilo López

06/16/2023, 12:18 PM

Hi Team, I'm deploying Kedro with Databricks Workflows. We have a way to breakdown each node of the kedro pipeline in to a task of Databricks workflows Job. The issues is that each task takes ~10 seconds to create the Kedro session which generates a lot of overhead for the pipeline. Is a way to create the Kedro session faster or a recommendation to avoid this 10 additional seconds for each node?

Guilherme Parreira

06/16/2023, 12:28 PM

Hi Guys. I am using Kedro with python `3.10.6`: (photo attached). For

auto-sklearn

I will need to downgrade my Kedro project to

3.9

version. I already installed

python 3.9.16

with

pyenv

. Which would be my next steps? (I need to change the python version in

Pipfile

3.9

, and individually change the kernel of the notebook?) If I change manually the kernel version of my notebook, it does not recognize as being part of the project (second photo attached) Thanks in advance!

Vici

06/16/2023, 1:08 PM

Hi everybody! I'm currently analyzing a large number of signals (N~=100), where the source data is organized as a partitioned data set. For each of these signals I want to make a plotly plot of each signal, such that I can explore the plots nicely e.g. in

kedro viz

. I saved the plots as follows:

Copy code

plots:
  type: PartitionedDataSet
  path: data/08_reporting/plots
  dataset:
    type: plotly.JSONDataSet
  filename_suffix: '.json'

Saving all the plots worked just fine (and I was able to load and show individual JSONs via

fig = plotly.io.read_json(file); fig.show()

. But it turns out, when you save plots in bulk like this, they cannot be displayed in kedro viz. Is there a way to allow accessing bulk-saved-plots from kedro-viz (e.g., clicking the partitioned dataset in kedro viz, then having the option to select a specific plots), without forcing me to literally have a hundred JSONDataSets cluttering kedro viz? Thank you so much 😊 Edit: I'm also open for other (non-kedronic) ideas regarding the exploration of a large bulk of plotly plots.

👀 1

loading 1

Sebastian Cardona Lozano

06/16/2023, 2:14 PM

Hi kedroids. In my pipeline, I have this logic for 2 nodes: node 1: reads a table and executes a data process only to new items that are not in the table node 2: Executes transformations to those new items and append them to the same table. I'm getting this error:

Copy code

CircularDependencyError: Circular dependencies exist among these items: [node1 ...., node2]

Yes, the output of node 2 is an input for node 1. My goal is to not process all the items every time I run the pipeline, but only the new items not in that table. How can I do this? Thanks!! 🙂

Nok Lam Chan

06/17/2023, 10:33 AM

Hi, I wonder if anyone have experience using Kedro with Prefect 2.0? How different is it from Prefect 1?

Abhishek Bhatia

06/17/2023, 1:15 PM

Hi Team, is there a way to have multiple nested partitions in

PartitionedDataSet

? It seems kedro assumes, the keys to be flat and string so neither a specification of tuple as keys nor nested dictionary specification works.

Abhishek Bhatia

06/19/2023, 7:46 AM

Hi Folks! I have a

PartitionedDataSet

like this:

Copy code

scenario_x/
├── iter_1/
│   ├── run_1.csv
│   ├── run_2.csv
│   └── run_3.csv
└── iter_2/
    ├── run_1.csv
    ├── run_2.csv
    └── run_3.csv
scenario_y/
├── iter_1/
│   ├── run_1.csv
│   ├── run_2.csv
│   └── run_3.csv
└── iter_2/
    ├── run_1.csv
    ├── run_2.csv
    └── run_3.csv

The catalog entry is like this:

Copy code

_partitioned_csvs: &_partitioned_csvs
  type: PartitionedDataSet
  dataset:
    type: pandas.CSVDataSet
    load_args:
      index_col: 0
    save_args:
      index: true
  overwrite: true
  filename_suffix: ".csv"

_partitioned_jsons: &_partitioned_jsons
  type: PartitionedDataSet
  dataset:
    type: json.JSONDataSet
  filename_suffix: ".json"

my_csv_part_ds:
  path: data/07_model_output/my_csv_part_ds
  <<: *_partitioned_csvs

my_json_part_ds:
  path: data/07_model_output/my_json_part_ds
  <<: *_partitioned_jsons

When I run the pipeline, the csv partitioned dataset gets deleted first, and then new one gets written, but the json partitioned dataset remains, and new ones get added. I need a sort of a custom behaviour, wherein, the 2nd level of the partition should get overwritten, and not first level partition i.e. in the node which produces the partitioned csv, the return value is like this:

Copy code

def node_that_generates_part_ds(scenario, **kwargs):
  res = {'scenario_x/iter_1/run_1': df1, 'scenario_x/iter_1/run_2': df2,  .... and so on}}
  return res

so when return

res

keys contain scenario_x, scenario_y shoul NOT get deleted. Can anyone guide me on how can I achieve this? Thanks! 🙂

marrrcin

06/19/2023, 7:49 AM

[Custom starters] Is there a way to make sure that some of the prompts from starter’s cookiecutter prompts will be actual

bool

? We experience an issue where all values from the interactive prompts are being casted to

str

, which is really inconvenient for `true`/`false` values, because they enforce such syntax:

{%- if <http://cookiecutter.my|cookiecutter.my>_flag != "False" %}

Juan Luis

06/20/2023, 10:42 AM

just helped a colleague get Kedro Viz working. two observations: • Kedro Viz launched, but

127.0.0.1

was not working. I suspect it's because they were using an SSH connection to a Linux machine on AWS.

localhost

worked perfectly. any reason to use the IP directly? (user was on Windows) • their pipelines were huuuuuuge. he asked me about a way to group sub-pipelines visually, but I'm not versed enough. is there any way to do it?

Pranav Khurana

06/20/2023, 11:32 AM

Hi folks I'm trying to create a custom kedro dataset (inherited from AbstractVersionedDataSet) I have to write a few tests similar to the ones existing for CsvDataSet However I'm witnessing that a few tests are failing. Need some advice around the same. happy to hop on a call to discuss the details

Kevin Mills

06/20/2023, 7:32 PM

Hi all. I am new to using kedro. Went through the spaceflight tutorial and other parts of the documentation. I was wondering if there was a tutorial around how to use the API by chance.

Idris Benkhelil

06/21/2023, 6:02 AM

Hello, Thank you for this great library. I am a DS working in France. I have a question, I want to make my pipeline dynamic, ie: Pipeline:

Copy code

[etape 1] > [etape 2] > [if score_etape2 < X ] > [etape4]
				      > [if score_etape2 >= X ] > [etape5]

Do you have any indication of how I can do this? Or an example of code already implemented? Thanks in advance. Idris

Marc Gris

06/21/2023, 7:10 AM

Hi Everyone, I was experimenting with

@singledispatchmethod

from the

functools

library to refactor my code and create “per-model type” implementations of

fit()

predict()

…). Unfortunately, this results in a

ValueError: Invalid Node definition: first argument must be a function, not 'singledispatchmethod'.

And indeed in kedro/pipeline/node.py:72

Copy code

if not callable(func):
            raise ValueError(
                _node_error_message(
                    f"first argument must be a function, not '{type(func).__name__}'."
                )
            )

Is this “rejection” of functools.singledispatchmethod a “un-intended collateral” of the test (in which case I could make a pull request to handle it) or are there some things “down the line” that would justify not allow the use of functools & co ? 🙂 Thx

Marc Gris

06/21/2023, 9:31 AM

CONFIGS CONSOLIDATIONS / INTERPOLATIONS in conf/config.yml random_state: 42
and in conf/model_training.yml random_state: ${random_state}

Copy code

kedro run 
>>> [...] 
TypeError: Cannot cast scalar from dtype('<U15') to dtype('int64') according to the rule 'safe'

If I get this correctly, the consolidation / interpolation process resulted in

random_state

being assigned the value

"42"

instead of

Granted, I could easily circumvent this issue with

int(params['random_state'])

, but I’m curious and would like to know if this is an expected behavior, and whether there is a more robust / elegant way of handling it. Thx in advance M