Kedro #questions

Elior Cohen

11/09/2022, 3:01 PM

~Question regarding using kedro on managed notebooks. I'm working on AzureML workspace, which very similar to Databricks, have a managed Juptyer instance. Now I have a working kedro project. By working I mean I can access the data with~

Copy code

%> kedro ipython
>>> catalog.load('companies')

~~The same kernel, is also registered in the managed Jupyter instance. In the first cell I run~~

Copy code

%load_ext kedro.ipython
%reload_kedro path/to/my/project

~Executing this results in the attached stacktrace What am I doing wrong?~

stacktrace.txt

Hervé Lauwerier

11/09/2022, 3:25 PM

Hello everyone! Is it possible to use a jinja generated catalog in order to generate all the nodes in a pipeline. I haven’t found a way yet to generate every node that would save every table to GCP with an identity function. The catalog contains around 150 tables to backup and I would like to avoid declaring every node.

AVallarino

11/09/2022, 6:27 PM

Hello everyone! I posted this question on Discord a few days ago, but I still can’t figure it out. I’m trying the started Iris project, reading the raw .csv files from S3, but I’m having issues with versions (kedro, boto3, s3fs) using Virtualenv.

✅ 1

Zihao Xu

11/09/2022, 9:48 PM

Hi team, I am following the tutorial for

kedro viz

experiment tracking: https://kedro.readthedocs.io/en/stable/tutorial/set_up_experiment_tracking.html#set-up-your-nodes-and-pipelines-to-log-metrics. But I keep getting the error “You don’t have any experiments” within

kedro viz

view. Here are a few observations: 1. Within the spaceflight starter, I do not see the file

src/settings.py

, but had to create it myself and pasted in the specified content for

SQLiteStore

2. The tutorial mentions “proceed to set up the tracking datasets or set up your nodes and pipelines to log metrics; these two activities are interchangeable.” But I had to implement both steps to make it work. 3. After a few

kedro run

, I do not see

session_store.db

appearing within

09_tracking

folder, which could be the reason for my error? Any insights would be greatly appreciated! Thanks team!

✅ 1

Amala

11/09/2022, 10:29 PM

Hi team, I’m looking for a programmatic way to rerun the failed nodes in a kedro pipeline. The kedro pipeline is triggered by airflow from the outside. So we were taking the approach of using kedro-airflow plugin and do the node/task retry by passing an argument. Looking for ways to do it without the plugin. Any help/pointers? Thanks in advance

Cyril Verluise

11/10/2022, 1:49 PM

hello there, hope you are doing great! kedroid Issue: I'm trying to persist a dict as a yaml file. The related catalog entry is the following:

Copy code

optimisation_programme:
  type: yaml.YAMLDataSet
  filepath: data/test/05_model_input/optimisation_programme.yaml
  layer: model_input

However, it fails with the following error:

Copy code

DataSetError: Failed while saving data to data set YAMLDataSet(filepath=/Users/cyril_verluise/Documents/GitHub/ClimaTeX/dist/apps/alhambra/rendered/alhambra/data/test/05_model_input/optimisation_programme.yaml, protocol=file, 
save_args={'default_flow_style': False}).
'str' object has no attribute '__name__'

Expected behaviour From the doc, I understood that the save function was just a wrapper around yaml.dump which should work with my `optimisation_programme`(dict) kedro version

0.18.3

Any idea?

Zihao Xu

11/10/2022, 2:32 PM

Hi team, I have a quick question regarding

kedro viz

. When I try to re-open

kedro viz

for the second time, after using “^z” to terminate the 1st use, I often receive an error message like the following. Is there anything I should do differently to prevent this from happening (e.g., a different way to terminate)?

Copy code

[Errno 48] error while attempting to bind on address ('127.0.0.1', 4141):        server.py:156
                             address already in use

Hervé Lauwerier

11/10/2022, 4:51 PM

Hello Team, how to I use a generated catalog with create_pipeline. I have my catalog generated with a function and I would like my pipeline to be aware of it. But when I run it i got all my datasets

Copy code

not found in the DataCatalog

user

11/11/2022, 9:18 AM

Kedro - Getting path to item in the datacatalog I'm training an nlp model using spacy. I have the preprocessing steps all written as a pipeline, and now I need to do the training. According to spacy's documentation I need to run the following command: python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy The files config.cfg, train.spacy and dev.spacy are all registered in my data catalog. I want to run this command...

👍 1

Ian Whalen

11/11/2022, 3:19 PM

Question on using

ParallelRunner

: is there a way to supply the number of processes to use from command line? Couldn’t find anything here.

Andrew Stewart

11/11/2022, 4:47 PM

Anyone here ever use kedro to run glue jobs? I know someone was talking about a glue runner plugin at one point.

Alicja Fras

11/14/2022, 11:05 AM

Question on Kedro with Jupyter Notebook. We have pipelines already built and we want to prepare notebooks as a guide for CSTs and other Data Scientists. Ideally we would like to have a notebook running nodes one by one, showing significant outputs along the way, step by step. What I found in Kedro resources were rather Notebook to Kedro pipeline guides, and what I need is rather the other way around. https://kedro.readthedocs.io/en/stable/tools_integration/ipython.html#use-kedro-with-ipython-and-jupyter Also big limitation is the fact, that I can only run pipeline once, if I do not want to restart notebook kernel. So it is impossible to run first node, then pause, look at outputs, run the second node, etc. Only option is to run full pipeline beforehand and only show intermediate outputs afterwards, without corresponding snippets of code running. All things considered, it seems to me that the best solution here would be to import nodes and using access to the catalog.yaml run nodes one by one as functions, providing all arguments in a notebook. Please, let me know if you had similar challenge and if you found some better solutions. 🙂

👍 2

Sasha Collin

11/15/2022, 6:35 PM

Hey 🙂 Is it possible to version a PartitionedDataSet?

Zemeio

11/16/2022, 5:11 AM

Hello everyone. Does anyone know if there is documentation on how to use google storage files instead of amazon s3?

Safouane Chergui

11/16/2022, 10:07 AM

Hello everyone, Is there a way to force a pipeline to run its nodes in the order they are declared within the pipeline ? I know that I can create a dummy output just to force one node to run after another but I’d like to know if there is a better way to accomplish this Thanks

Sean Westgate

11/16/2022, 10:24 AM

Hi Team, Is there a way to add the the kedro-viz interactive graph to a static website? I found the kedro-static-viz plugin, but it doesn't work with 0.18.3. Looking at the

kedro-viz/readme

it suggests using a standalone React component with the

pipeline.json

as prop. I tried it, but couldn't get it to work - I am not familiar with React. Do you have a simple example using a static website or do you need a proper React environment? Many thanks

Mate Scharnitzky

11/16/2022, 10:33 AM

Hi Team, We have developed a

kedro

project locally and the team we’re supporting has an Amazon

EMR

spark cluster environment. We would like to be able to run this

kedro

project on EMR, but we’re struggling to create a virtual environment, install requirements into that env…etc. given the security protocols. Do you have any recommendation/pointers on what steps we need to take to be able to run this kedro project on EMR? I didn’t find a supporting document in the official documentation .

Safouane Chergui

11/16/2022, 3:37 PM

Hello, I’d like to know how does kedro process regex specified in globals_pattern. I’d like to do something like this below but the method below doesn’t work:

Copy code

TemplatedConfigLoader(
            conf_paths,
            globals_pattern="(globals*)|(another_parameters_file*)",
            globals_dict={"param1": "pandas.CSVDataSet"}
        )

How can I accomplish this ?

✅ 1

Filip Panovski

11/17/2022, 10:03 AM

Hello! Is the

yaml

Loader part of the

ConfigLoader

somewhat configurable in any meaningful way? Or does kedro implement its own

yaml

parsing mechanism? We're trying to use some custom filtering that gets passed to the

kedro.extras.datasets.dask.ParquetDataSet

load_args

. Specifically, we want to be able to do something like:

Copy code

# catalog.yml
raw_data:
  type: dask.ParquetDataSet
  filepath: 's3://...'
  load_args:
    filters:
      - !!python/tuple ['year', '=', '2022']
      - !!python/tuple ['day', '=', '3']
      - !!python/tuple ['id', '=', 'someVal']

dask

(via

filters

, see docs) supports row-filtering on loaded data via this way and

yaml

(via tuple support in

.yml

files) supports the above definition. However,

yaml

unfortunately supports this using either the non-default

FullLoader

or the

UnsafeLoader

(for controlled environments, see here). Is it possible to configure the

ConfigLoader

to use either of these? An example use case for this would be to filter only the rows belonging to all

day = 3

partitions of any month in

year = 2022

. I could alternatively write a DataSet that parses this logic from plain string lists, but I was wondering if there's any existing support for something like this.

Safouane Chergui

11/17/2022, 10:19 AM

Hello everyone, Is there a way to use parameters that are passed to

kedro run

as an input to a node ? Here is a quick example: • Kedro run command:

kedro run --pipeline my_pipeline --params first_param:first_value

• I’d like to use first_param as an input to a node without having to put it in parameters.yml just to use it as an input to my node. If not, is there a way to use it directly into code ?

Copy code

Pipeline([
            node(
                do_something,
                 inputs="first_param",
                 outputs="some_output"
            )
       ])

Thanks 👍

Debanjan Banerjee

11/17/2022, 11:12 AM

Hi Team, I created a

CustomDataSet

and strangely enough, it works when i invoke

kedro run --xxxxx

from terminal but when i try to do

catalog.load(xxxx)

ipython

kedro jupyter

, it fails and it raised the famous

DataSet error : Dataset is not installed

here is my catalog definition :

Copy code

ft_spine_prd : 
 type : project_name.extras.datasets.dataset_filename.DataSetClass
 dataset_args : 
     arg1
     ....

Debanjan Banerjee

11/17/2022, 11:13 AM

What would i be doing wrong ?

Debanjan Banerjee

11/17/2022, 1:32 PM

Hi Team , can we make dataset load to a kedro node parallel ? I have 19 datasets that i need to read in a single function but when i see the logs it taking 2 minutes per dataset i.e. ~40 minutes of read time , any way we can make the read parallel ? I dont think they are conflicting in any way so i dont see why it needs to be sequential ?

Panos P

11/17/2022, 11:17 PM

Hello kedro experts, Is it possible to create dynamic pipelines based on params? For example: My parameters.yml file:

Copy code

pipelines:
 - test1
 - test2

I want to return 2 pipelines like this:

Copy code

pipes = []
for pipeline in pipelines:
   pipes.append(Pipeline([
      node(do_something, [f"params:{pipeline}"], [f"output_{pipeline}"], tags=pipeline)
   ]))

In the older versions of Kedro I was able to get the params before the creation of pipelines and then work from there. Like this:

Copy code

def get_kedro_env() -> str:
    """Get the kedro --env parameter or local

    Returns:
        The kedro --env parameter
    """
    return os.getenv("KEDRO_ENV", "local")


def _get_config() -> ConfigLoader:
    """Get the kedro configuration context
    Returns:
        The kedro configuration context
    """
    try:
        return get_current_session().load_context().config_loader
    except Exception:  # NOQA
        env = get_kedro_env()
        return ConfigLoader(["./conf/base", f"./conf/{env}"])


def get_params() -> Dict[str, Any]:
    """Get all the parameter values from the parameters.yml as a dictionary.
    Returns:
        The parameter values
    """
    return _get_config().get("parameters*", "parameters*/**", "**/parameters*")

It seems like in kedro 0.18.3 There is no more load_context() any thoughts?

Luis Cano

11/18/2022, 4:53 AM

Hi team, beginner question, is it possible to create a partitioned dataset where you store many pickles with models, i.e:

Copy code

s3_path/train_outputs/
                 ├── 202201.pkl
                 ├── 202202.pkl
                 ├── 202203.pkl
                 ├── 202204.pkl

...

so on.

Is there a way of saving pickles this way? or maybe what would be a better way of doing this? any thoughts?

Jo Stichbury

11/18/2022, 10:12 AM

Hi all! A request from me to improve the docs 🤓 Has anyone found/made/published any great examples of converting a Jupyter notebook to a Kedro project? I know there's one from DataEngineerOne on YouTube from the natural history archives, but I was looking for something a bit simpler, without it being completely pointless. Something like an Iris datasets example level of detail.

👀 1

Vaibhav

11/18/2022, 12:05 PM

Hey, A small question on mandatory attributes of [tools.kedro] in pyproject.toml. I noticed that

project_name

refers to ‘name of the project’ where kedro is being used, whereas

project_version

refers to the the ‘version of kedro’ being used instead of the project version which is defined in project_name kedro validates version match as part of the checks. Is this by design that the two attributes refer to different

project

entities ?

MarioFeynman

11/18/2022, 1:28 PM

Hey! Quick question, is there any easy way to read a delta table in kedro, but the final object is a pandas dataframe? Or should I have to work in a custom DataSet in order to make this happen? I would like to avoid using .toPandas() function inside every node for every input, or have to add a decorator to every func to achieve this. Main goal is only use kedro catalog to manage this problem

Debanjan Banerjee

11/18/2022, 1:49 PM

Hey Team We were recently playing with a remote connection dataset (needed

credentials.yml

from

/local

) and --env argument that we can specify to dictate the directionality of the pipeline. We know that if we specify our --env , kedro "ignores" the /base and /local and overrides to whatever we pass to the --env. Shouldn't Kedro do a search for **credentials.yml separately (not with parameters.yml/globals.yml) ?? Every user will have their own credentials so shouldnt we make kedro check for credentials.yml separately ? say only in /locals and not in --env ?

Doug Warner

11/20/2022, 12:45 PM

Hi Team. Loving the look of kedro as a way to enforce modularity in work done by my actuarial team. I've done the spaceflights tutorial, and now I'm implementing my first real pipeline and I have a two general "advice" questions. My current aim is modularity and best practice in a data processing pipeline (but with a lower learning curve for data scientist than full on ETL pipelines). We're talking about "small data" with order of 100k rows and 50 columns. I'm not so worried about the model building part yet. Questions are: 1. I'd like to split the data prep pipeline into modular steps (some geneic i.e. "format all Boolean fields", "replace nulls", and some quite specific edge cases "fix this particular named field"). My instinct is that each of these steps should be a node, but I note in the tutorials that quite often "prep this whole dataset" is one node. Is that just to keep the tutorial simple, or am I going down a bad path? 2. I have tended in recent times to be a "no pandas in place operations" nazi. (i.e. use "assign" to create a new column rather than df['new_column_name']) I'm seeing in place operations in the tutorials --is that just to keep things simple, or is there either a reason not to worry, or a performance overhead reason why actually I should be using in place? (Bearing in mind my "small data" use case) Sorry to ask basic questions - but I'm super keen on what I think of as the "kedro mindset" of "do it neat and tidy first time to avoid problems later on" :-)