Kedro #questions

noam

05/04/2023, 2:55 PM

Hi Kedro users, Does anyone know how one would implement versioning for a PartitionedDataSet? In other words, does anyone have a convenient solution for enabling versioning (i.e. setting “versioned: True” in the data catalog) for a PartitionedDataSet the same way one can for a PickleDataSet? The following is my code in conf/data/local/catalog.yml:

Copy code

validation_data:
  type: kedro.io.PartitionedDataSet
  path: data/03_primary/validation_data/
  dataset: pickle.PickleDataSet
  filename_suffix: ".df"
  versioned: True

Thanks in advance!

marrrcin

05/05/2023, 8:45 AM

[Kedro Starters] I’m wondering, whether custom Kedro starters (especially from plugins) should have the same behaviour as official ones w.r.t the tags. Kedro starters work with

kedro new --starter=spaceflights

, but when we’ve developed our own starter for Kedro Snowflake, it requires to additionally specify

--checkout=

flag, because of the default mechanism in Kedro:

Copy code

Error: Kedro project template not found at git+<https://github.com/getindata/kedro-snowflake>. Specified tag 0.18.8. The following tags are available: 0.0.1, 0.1.0, 0.1.1.

Is there a way (or if not, I think there should be) for the plugin to specify the default tag to use? The versioning of Kedro should not affect the versioning of custom starters/plugins 🤔

👀 2

👍 1

👍🏼 1

noam

05/05/2023, 12:01 PM

Thank you @Antony Milne and @Deepyaman Datta for responding to question about data versioning with a PartitionedDataSet (one cannot use

versioned: True

argument in the data catalog for this kind of dataset). Perhaps it is better than I explain the root issue/challenge, in case there are solutions I am missing. The Problem: By default, Kedro overwrites data objects with each run, using the paths set in the data catalog. The Question: What is a convenient solution/tech stack for enabling the execution of multiple parallel ML experiments in my Kedro pipeline, while maintaining that… 1. Each experiment triggers the data to be versioned effectively. Ideally… a. When there are changes to the data, the data is copied and assigned a unique ID (sha, md5, timestamp), perhaps with metadata regarding the parameters that were used to generate the data. In this case, it is important that the data is stored a sensible, organized manner. b. When there have been no changes the data, the same unique ID (and metadata) are used and can be extracted 2. The unique IDs (and metadata) for each relevant dataset relevant to the ML run can be extracted and stored alongside the (presumably lighter) results of the experiment 3. Given 1. and 2. above, the results are reproducible (they offer point-time-correctness) The solutions I have thus far come across are problematic: • Writing a class to set dynamic dataset filepaths ◦ The main issue with this approach is that it is incredibly high-maintenance. It requires continuous, careful attention to the parameters used to define the dynamic filepaths. ▪︎ For example, if I set the filepath to

training_data_{a}_{b}

using parameters

and

and I change parameter

, changing the composition of the data, a new dataset will overwrite the previous dataset. If I wanted to have kept them both, I would have had to remember to update filepath parameters to include

. Of course, with many different data-defining parameters, this becomes problematic rather quickly. • Use Kedro versioning - use the

versioned: True

argument in the catalog underneath datasets for which you desire to version ◦ The first issue with this approach is that it appears to version all of the data with every new run, presenting a massive storage issue and the necessity for a custom retention policy to clear useless/outdated data. ◦ The second issue is that this doesnt work with PartitionedDataSet datasets. Are there any effective solutions I am missing?

👍 1

👀 1

1000 1

Ofir

05/05/2023, 2:22 PM

Does Kedro work with DVC or any Data Version Control solution?

Ofir

05/05/2023, 3:04 PM

Does Kedro allow dumping experiment outputs (results) into different folders without changing the YAML files? We would like to run the same code (pipeline) but with different input datasets. What’s the best way to achieve that without having to copying-n-pasting the Kedro folders and then manually modifying the YAML files? I looked at the documentation of

kedro run

but it doesn’t accept output dirs / workspace as a parameter. It assumes you have the configuration files already in place. Perhaps I need to come up with my own

kedro

wrapper? (to auto-generate the configuration files and then call

kedro run

)

Ofir

05/05/2023, 3:11 PM

It’s like the concept of “virtually forking/cloning” an existing experiment, with slight changes in the I/O.

Ofir

05/05/2023, 3:16 PM

Please correct me if I’m wrong but it looks like Kedro’s implementation has slightly overlooked input dataset as a differentiating factor for an experiment. That is, Kedro doesn’t consider a different input dataset as a different session/experiment run.

Adrien

05/05/2023, 3:38 PM

Someone succed to deploy a kedro pipeline on OVH ?

Jose Nuñez

05/05/2023, 7:51 PM

Hi Team, I have always had this question, but never asked before: Here I have a very short & simple pipeline (look at the picture) and I have notice that the

Run Command

of every function starts with

None.

for instance in the picture you can read

kedro run --to-nodes=None.clean_mdt

why is this? my pipeline executes just fine without any issues after a regular

kedro run

. If I do a

kedro run --to-nodes=None.clean_mdt

I'll get an error so I manually need to erase the

None.

before running. So running this instead works just fine

kedro run --to-nodes=clean_mdt

Rob

05/06/2023, 3:59 AM

Hi everyone, Why are nodes in a namespace collapsed by default when deploying

kedro viz

over Host

'0.0.0.0'

? I want to show them expanded by default. Here's an example of the behavior I'm seeing: https://brawlstars-retention-pipeline-6u27jcczha-uw.a.run.app/, and this doesn't happen with localhost. Can someone suggest a way to modify the code used in this thread to show the nodes expanded? Thanks 🙂

Dawid Bugajny

05/08/2023, 10:34 AM

Hello! I would like to ask if there is any way to create multiple pipelines in a single directory. I want to have 2 pipelines with the same nodes, but with different outputs. If I create a function that is not called "create_pipeline", but for example "create_pipeline_2" in the same directory, it will not be found by the find_pipelines() function (kedro.framework.project). I have seen modular pipelines, but I would still have to create a new directory for pipelines (correct me if I am wrong)

Andreas_Kokolantonakis

05/09/2023, 2:51 PM

Hi everyone, I am facing the following issue when I am trying to read a CSV with spark. With pandas, it works fine but with spark, it seems I need some extra configurations. Could you please point me in the right direction? thank you in advance!

Juan Luis

05/09/2023, 3:38 PM

anybody with a Windows machine would be willing to lend me a hand with this? windows https://github.com/kedro-org/kedro/pull/2568#issuecomment-1539949117 there seems to be a test using

python -m build

that is failing, but I can't reproduce it locally

Elena Mironova

05/09/2023, 3:43 PM

Hi team, Question on

kedro pull micropkg

and sdist here. Would anyone know how to best deal with the multiple egg.info error in the screenshot? Context: with

kedro==0.18.3

on a mac I used

python -m build --sdist path/to/package

to create tar.gz inside our Git repo (i know alternative is available to create sdist through

kedro micropkg package

, but i can't do it from inside a kedro project). I see the archive, but when doing

kedro pull micropkg

(from kedro project root), the following error comes up. This may or may not be related to the existing issue .

Javier del Villar

05/10/2023, 1:45 PM

Hello everybody, I have a problem. I’m unable to load a CSV folder from local storage using

SparkDataSet

. More details in thread.

Richard Bownes

05/10/2023, 2:46 PM

If I wanted to get the total run time for each node in a pipeline what's the best method?

Mate Scharnitzky

05/10/2023, 3:33 PM

Hi Everyone, We’re in the process of relaxing our kedro dependencies in our repository. The current Kedro version is

==0.18.3

. When we relax it to

~=0.18.3

, pip would install

0.18.8

while compiling and we get the below error for some of our pipelines:

Copy code

KeyError: 'logging'

Can you provide some pointers what could be the reason behind this?

0.18.*

should have no breaking changes based on the RELEASE notes so I’m not sure what could explain this. Thanks for the help!

Panos P

05/10/2023, 5:08 PM

Hello Kedro folks, I have a question on dynamic generation of nodes via parameters in my catalog. I have this paramaters.yml file:

Copy code

params:
 - p1
 - p2

p1: value1
p2: value2

I want to create a pipeline with 2 nodes that each node take as input one of these params. e.g.

Copy code

nodes = [node(func, f"params:{p}", f"output_{p}" for p in params]
pipeline(nodes)

Is that possible and how?

Brandon Meek

05/10/2023, 11:58 PM

Hey everyone, I'm have a modular pipeline that I'm running with the only difference being a small parameter change but the dataset is a really large dataset and the issue I'm running into is Kedro is running the first node of each modular pipeline causing a OOM issue, is there a way to make it finish running a modular pipeline before beginning the next?

Melvin Kok

05/11/2023, 1:21 AM

Hi team, is there a way to access the kedro catalog at some arbitrary point in a run? Context: I’m using hooks to run Great Expectations, and need access to the Kedro context and catalog during an Action in Great Expectations. However I can’t pass in the catalog from the hook into the action because I’m not calling the Action

run

function directly, Great Expectations is doing it, and the config only allows for passing in serializable objects

Artur Dobrogowski

05/11/2023, 1:35 PM

Hello, where can I find list of available data catalog types? I'm looking for something that parses yaml but I can't find it in the docs

Erwin

05/11/2023, 1:38 PM

Hi team! What are the options to orchestrate jobs using a docker image in azure cloud? I have a docker image with a K kedro project. I’m looking for some options to run this in a daily basis enabling observability, alerts, retries, etc. I worked with data factory in the past, but not sure if there is a activity to execute dockers from the container registry.

Giuseppe Ughi

05/11/2023, 3:34 PM

Hi Team! I’m currently trying to create a dynamic catalog on Kedro with Jinja2 but struggling to import the list on which to build the catalog entries. I’m new to Jinja so the solution might be trivial. For reproducibility sake I’m basing the following examples on the jinja-example 🙂. When considering a hard-coded list in the file

conf/base/catalog.yml

as follows

Copy code

{% for region in ['parasubicular', 'parainsular'] %}

{{ region }}.data_right:
    type: PartitionedDataSet
    path: data/01_raw/ClinicalDTI/R_VIM/seedmasks/
    dataset: pandas.CSVDataSet
    filename_suffix: /{{ region  }}_R_T1.nii.gz

{{ region }}.data_right_output:
    type: pandas.CSVDataSet
    filepath: data/03_primary/{{ region }}_output.csv

{% endfor %}

everything works fine. However, I need to iterate over a list that is not practical to hard-code therefore I was hoping to have something like follows

Copy code

regions:
  - 'parasubicular'
  - 'parainsular'

{% for region in regions %}

{{ region }}.data_right:
    type: PartitionedDataSet
    path: data/01_raw/ClinicalDTI/R_VIM/seedmasks/
    dataset: pandas.CSVDataSet
    filename_suffix: /{{ region  }}_R_T1.nii.gz

{{ region }}.data_right_output:
    type: pandas.CSVDataSet
    filepath: data/03_primary/{{ region }}_output.csv

{% endfor %}

but no matter where I define the

regions

list (i tried to define it in different

.yml

files) I stumble on the same error screen-shotted below. Do you by chance know if I have to save the jinja pattern in a different file, there is a specific place where I have to save the list that I want to read, or if I have to change the parsing somehow? Thank you in advance!!

Toni - TomTom - Madrid

05/12/2023, 7:52 AM

Good morning! anyone has tried to run Databricks Mosaic with Kedro? run it locally is challenging https://stackoverflow.com/questions/72289894/how-to-run-mosaic-locally-outside-databricks This solution did not work, but maybe because I am not tuning it properly in kedro :S, any help is appreciated! thanks

📢 1

Nitin Soni

05/12/2023, 12:23 PM

hi i'm nitin soni tried to create conditional pipelines how can i do it suggest me something i'm totally new in kedro

Ofir

05/14/2023, 8:43 AM

A colleague of mine is trying to join Kedro’s Slack but it seems like Kedro’s slack has exceeded its members capacity Can anyone help with that? moderators? admins?

Chengjun Jin

05/14/2023, 8:55 PM

Hi there, A question regarding Kedro + Great Expectation, I am following the example on https://docs.kedro.org/en/stable/hooks/examples.html#add-data-validation. V3. This is my check points yaml:

Copy code

name: pm_stat_checkpoint
config_version: 1.0
template_name:
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template:
expectation_suite_name:
batch_request: {}
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: store_evaluation_params
    action:
      class_name: StoreEvaluationParametersAction
  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction
      site_names: []
evaluation_parameters: {}
runtime_configuration: {}
validations:
  - batch_request:
      datasource_name: default_pandas_datasource
      data_asset_name: my_runtime_asset_name
      data_connector_name: default_runtime_data_connector_name
    expectation_suite_name: pm_expectation_suite
profilers: []
ge_cloud_id:
expectation_suite_ge_cloud_id:

Copy code

INFO - Loading data from 'pm_sales_raw' (ParquetDataSet)...
INFO - FileDataContext loading fluent config
INFO - Loading 'datasources' ->
[{'name': 'default_pandas_datasource', 'type': 'pandas'}]
INFO - Loaded 'datasources' ->
[]
INFO - Of 1 entries, no 'datasources' could be loaded
...
...
DatasourceError: Cannot initialize datasource default_pandas_datasource, error: The given datasource 
could not be retrieved from the DataContext; please confirm that your configuration is accurate.

It seems that there is a problem in loading the data source. Did I miss some steps? Thank you

Mate Scharnitzky

05/15/2023, 12:03 PM

K `kedro-datasets`: dependencies Hi Team, Where do you define dependencies for the

kedro-datasets

package? We ran into some pip resolver issues and turned out that from

kedro-datasets==1.0.0

and above would require kedro to be

kedro~=0.18.4

. We can verify this by

pip install kedro-datasets==0.0.7 --dry-run

but we don’t find where this dependency is actually defined. In

setup.py

it’s actually not mentioned. Thank you! @Kasper Janehag

Andrew Doherty

05/15/2023, 2:27 PM

Hi all, I am developing a namespace pipeline and have come into an issue when using Neptune to track my experiments. When I pass

"neptune_run"

as an input to a pipeline node I get the following error:

ValueError: Pipeline input(s) {'NAMESPACE.neptune_run'} not found in the DataCatalog

Where

"NAMESPACE"

is my namespace pipeline name. Is there a way to use Neptune along with namespace pipelines? Thanks again.

👀 1

Amanda Locatelli

05/15/2023, 2:52 PM

Hey Everyone, I have to enforce a timeout argument to my postgress connection in a kedro project. I have included in the catalog.yml, but I keep receiving this error:

Copy code

lib/python3.7/site-packages/kedro/io/core.py", line 191, in load
    raise DataSetError(message) from exc
kedro.io.core.DataSetError: Failed while loading data from data set SparkPostgresJDBCDataSet(load_args={'properties': {'connectTimeout': 300, 'driver': org.postgresql.Driver}}, option_args=True, save_args={'properties': {'driver': org.postgresql.Driver}}, table= , url=jdbc:postgresql:).
An error occurred while calling o49135.setProperty. Trace:
py4j.Py4JException: Method setProperty([class java.lang.String, class java.lang.Integer]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
        at py4j.Gateway.invoke(Gateway.java:274)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Thread.java:750)

Does anyone know what I am doing wrong, and how to fix it? Or what are the files which I am suppose to update this timeout argument?