Kedro #questions

Leslie Wu

06/21/2023, 5:55 PM

Hi everyone, Any ideas for getting

kedro viz

to work within Amazon SageMaker Studio? I am in the terminal of a studio instance.

Jonah Blumstein

06/21/2023, 10:07 PM

anyone know how to override the opt into usage analytics prompt when running

%load_ext kedro.ipython

for the first time in a notebook? basically this but in a notebook environment: https://github.com/kedro-org/kedro/issues/1640

Mate Scharnitzky

06/22/2023, 2:05 PM

Hi Team, Currently we’re using

kedro==0.18.3

that pins

pytest~=6.2

which is conflict with

pandera

, a new dependency we want to introduce.

kedro==0.18.5

has already

pytest~=7.2

, so we’re not far from resolving this conflict. On the other hand, in order to upgrade to a higher Kedro version, we would need to change our custom

JinjaTemplatedConfigLoader

that inherits from

AbstractConfigLoader

, as both

0.18.4

and

0.18.5

introduced changes to configuration management, the latter

OmegaConf

specifically. Also,

0.18.6

fixes some regressions in

0.18.5

. Question: based on the above context, what Kedro would you suggest for us to upgrade to? • It seems at least we need to go to

0.18.6

but maybe we can aim all the way to

0.18.10

? • Also, do you have a migration guide about how to migrate from a custom config loader to OmegaConf bearing in mind that we need to use

multi-runner

as well? Thank you! @Kasper Janehag, @Jaskaran Singh Sidana

Lucas Hattori

06/22/2023, 3:10 PM

Hi team, I’m starting to use

kedro-mlflow

for the first time in a project. Would this be also a appropriate theme for questions here? 😅 if so, regarding parameters, my kedro project has a lot of parameters. Many of them are not crucial for me to be logged in mlflow experiments. How can I easily select which parameters I’d like to have it logged? I have an idea on how to it if I were to build the mlflow hooks from scratch, but I’d love to leverage

kedro-mlflow

for simplicity

Camilo López

06/23/2023, 1:56 AM

Hi team, I'm using the new

ManagedTableDataSet

with Databricks Unity Catalog and I didn't find a way to store tables on a external location (ABFS Azure). There's a way of storing a external table with pure-spark :

df.write.mode(mode).option("path", table_path).saveAsTable(f"{catalog_name}.{schema_name}.{table_name}")

, where

table_path

its the path for the external location like

<abfss://container@storage_account.dfs.core.windows.net/raw>

. There's a way to pass this path to the

ManagedTableDataSet

when saving the data? Or should I go and create a

CustomManagedTableDataSet

with this capability?

👍🏼 1

Sivasubramanian.S

06/23/2023, 4:02 AM

Hi team, I would like to expose some of the kedro node and pipeline as an FastAPI, Is there any documentation to refer ?

Marc Gris

06/23/2023, 7:58 AM

DEPENDENCIES ISOLATION Hi everyone, Let’s assume that my

data_processing_node

and my

model_training_node

have conflicting dependencies. How would you handle such a (unfortunately common) situation ? I know that in MLFlow it is possible to have task-specific-venv… Does kedro offer such a possibility ? If not, how would could one circumvent the issue ? 🙂 Many thanks in advance, M.

Artur Dobrogowski

06/23/2023, 1:19 PM

I'm trying to understand omgeaconfigloader - https://docs.kedro.org/en/latest/kedro.config.OmegaConfigLoader.html. I know I can use it to interpolate with env variables like this

${oc.env:SOME_VAR}

, but how to use it to interpolate with parameters defined in

params.yml

? let's say I have param

some_var=42

and I want to make it fall back to it when there is no env. Is this correct?

${oc.env:SOME_VAR, ${some_var}}

Panos P

06/23/2023, 4:44 PM

Hello folks, I have a kedro framework with a lot of parameters and and catalog records, when I run kedro run with a different environment I get this log messages:

kedro.config.config - INFO - Config from path "/conf/dev" will override the following existing top-level config keys

This messages appear for about 30 minutes before the kedro pipeline even runs. Do you have any ideas, recommendations of speeding this up?

Hoàng Nguyễn

06/26/2023, 5:07 PM

Hello. Please help me. Can I using kedro-fast-api now?

Andreas_Kokolantonakis

06/27/2023, 8:06 AM

Hi everyone, I will like to pass a date parameter from the command line when I am executing kedro run, so the catalog paths can point to the specified date. What is the best way to do so? (e.g args with python)

Zemeio

06/27/2023, 8:53 AM

Hey guys. I want to generate different pipelines based on the value of my parameters. So, basically, I can have 3 or 15 pipelines (for example), depending on the value of a few parameters. Is it possible to do that? I see that on the pipeline registry I don't have a kedro context to work with, so maybe using a config loader?

Hugo Evers

06/27/2023, 2:01 PM

Hi everyone, I’m trying to pass a dictionary of keyword arguments to a function in a Kedro node, but it doesn’t seem to be working. Instead, I have to use a lambda function to pass the arguments as separate inputs. For example, I would like to have a node that looks like this (knowing that best practice is to move the

sample_size

to a config):

Copy code

node(
    func=train_test_split,
    inputs={"df": "input", "sample_size": 50},
    ...
),

However, this doesn’t seem to work and I get an error refering to a separator error.. I noticed that in the modular pipeline, a similar syntax is allowed. Is that on purpose? What does work is:

Copy code

node(
      func=lambda df: train_test_split(df, sample_size=50),
      inputs="input",
      ...
     )

Alina Glukhonemykh

06/27/2023, 4:33 PM

Hey all, hope you are well! I'm trying to set up experiment tracking and facing issues with MetricsDataSet, here is the error I get:

Copy code

DataSetError: Save path '.../data/08_reporting/init/metrics.json/2023-06-27T16.17.18.857Z/metrics.json' for MetricsDataSet(filepath=.../data/08_reporting/init/metrics.json, protocol=file, 
 save_args={'indent': 2}, version=Version(load=None, save='2023-06-27T16.17.18.857Z')) must not exist if versioning is enabled.

Here is how I define the file in catalog:

Copy code

metrics:
  type: tracking.MetricsDataSet
  filepath: data/08_reporting/init/metrics.json

Marc Gris

06/28/2023, 11:13 AM

Hi everyone, Given:

Copy code

node(pre_process, 
     inputs = ['dataset', 'params:pre_process'], 
     outputs = "pre_processed_dataset")

Kedro will pass

'params:pre_process'

as a dict to

pre_process

which results in a bit of an “opaque” function’s signature:

def pre_process(df: pd.DataFrame, params: dict): ...

Is there a “kedro way” of unpacking this dict and therefore have more “transparent” signature, with individual params specified ? Thx M

Sebastian Cardona Lozano

06/28/2023, 9:28 PM

Hi all. Maybe is a naive question, but how can I change the name of my kedro project? I understand that some kedro files use the project name to execute the pipeline. Thanks!

Hugo Evers

06/29/2023, 8:23 AM

Hi All, The other day i was making a custom dataset for the Huggingface AudioFolder dataset, which takes a folder as an argument. As such, i gave it the parameter

data_dir

as input, instead of

filepath

, it took me roughly an hour of debugging to figure out why i loading the dataset was now dependant on the current working directory, and just wouldn;t load if i gave it a relative path (data/01_raw/..) instead of workspace/project_name/data/01_raw/…. Anyway, the issue was that filepath has a (buried) custom resolver in AbstractDataSet baseclass. So would it be a good idea to add to the docs for custom datasets that

filepath

has that behaviour, and maybe we could add an example of a how to make a FolderDataset. since all the current datasets in kedro-datasets point to specific files, but i’d wager there are folks out there who would want to read an entire folders’ worth of data.

Balazs Konig

06/29/2023, 11:22 AM

Hi Team 🦜 Hopefully a quick one: What's the best way to specify saving catalog entries to parent directories? I have the below structure:

Copy code

folder1
  projects
    project1
      conf
      data
      src
folder2
  data

And I want to save to

folder2/data

- when I try relative paths, it seems to append that to

folder1/projects/project1/<relative_path>

(as in, it adds the dots to the path as well). How can I achieve this?

✅ 1

Harry Vargas Rodríguez

06/29/2023, 1:50 PM

Hello everyone. I am trying to upload a model I created using kedro, this is sklearn object. But I noticed this artefact can´t be loaded outside my kedro project. When I try to upload using pickle.load( ) it fails and the error says I don´t have the module I've created in project/src.

model = pickle.load(open('models/model.pkl','rb'))

This is how my catalog looks like

best_model:

type: pickle.PickleDataSet

filepath: models/model.pkl

layer: models

It works just fine after I load kedro using %load_ext kedro.ipython Thanks in advance for your help

Ahmed Alawami

06/30/2023, 7:49 AM

Hi all. I need to specify a

date_parser

in the catalog. Is there a way to specify a lambda function in the YAML file?

Markus Sagen

06/30/2023, 12:32 PM

Hi Kedro community 👋 We have started to use Kedro for our projects at my company and want to use Weights and Biases for the experiment logger. If I would want to create a custom experiment tracker plugin / extension, is there a guide for how to get started writing your own extensions or plugins?

Markus Sagen

06/30/2023, 2:00 PM

Is there a way in kedro to define or register parameters, like a logger, that all parameters and nodes can get access to, s.a. for the datasets from the catalog file?

Emilio Gagliardi

06/30/2023, 11:36 PM

Hi everyone, what kind of dataset do I create if I'm scraping data from web pages or if I'm grabbing data from RSS feeds? I have a small project I'm working on where I need to grab data from a few web sites regularly. They are mostly Microsoft notices for various products/services. I want to store the text in a mongo atlas database I have set up. I looked through the documentation but the only reference I found was for an HTTP(s) API call. Any guidance greatly appreciated 🙂

Markus Sagen

07/01/2023, 8:12 AM

Hi again! It seems the Kedro commands

install

and

test

listed in the docs here are deprecated. Is there a preferred place to report issues or add fixes to the docs? https://docs.kedro.org/en/stable/development/set_up_vscode.html#setting-up-tasks

Choon Ho Loi

07/03/2023, 2:34 PM

Would like to use Kedro in EMR. Other than this article: https://kedro.org/blog/how-to-deploy-kedro-pipelines-on-amazon-emr cant find much details. anyone have some git repo to share? Appreciate that.

👀 1

Hugo Evers

07/03/2023, 3:46 PM

Hi all, I get this weird bug with kedro viz: i have this set of nested modular_pipelines in order to make my training/test and finetuning pipelines completely dry for different languages. to that end, i mapped the train and test splits throughout my pipelines to make i namespace them at the last moment. But when i visualize i see these two unconnected artifacts, Test and Train:

Emilio Gagliardi

07/03/2023, 6:39 PM

A quick clarification on registering pipelines. When I install the spaceflights demo, the register_pipelines.py file contains the following:

Copy code

def register_pipelines() -> Dict[str, Pipeline]:
    """Register the project's pipelines.

    Returns:
        A mapping from pipeline names to ``Pipeline`` objects.
    """
    pipelines = find_pipelines()
    pipelines["__default__"] = sum(pipelines.values())
    return pipelines

However, in the spaceflights tutorial videos I'm watching, the host doesn't use the above code. instead they add the following:

Copy code

data_processing_pipeline = dp.create_pipeline()
return{"__default__": data_processing_pipeline,
    "dp":data_processing_pipeline}

So I'm unclear what I'm supposed to do for my own project. Do I just use the sum(pipelines.values()) or do I manually add pipelines as in the second block? THanks kindly,

Hugo Evers

07/04/2023, 12:54 PM

Hi All, In developing modular pipelines the kedro-viz tool is quite indispensable, without the viz its really quite hard to see whether inputs, outputs and parameters are connected properly. However, the most obvious workflow for an established project to develop a new pipeline/refactor a pipeline based on the documentation, requires several steps which could be streamlined. # The issues: 1. To have more control over over the interactions between the pipelines you want to work on, one can adjust the pipeline_registry to only return the pipelines you want. Which renders the other pipelines unusable (and could lead to bugs down the road). (this can be partly solved by filtering the pipelines in the kedro viz CLI command) 2. For every change you want to visualise, you’ll need to: a. save the file b. stop the current kedro viz c. run kedro viz d. go to the browser window to view the pipeline (which is quite annoying if one has only one monitor at their disposal) 3. Lastly, this does not make for easy debugging/viewing the python objects in the pipelines/nodes. # Half-way solution: using run_viz in a jupyter notebook is great, and solves some issues. I personally combine it with nbdev which allows me to convert the notebook to a python file, and then do run_viz on the entire project. This still suffers from issue 1 even more, but issue 2 and 3 are drastically reduced. Mostly because i just save/commit and then rerun the notebook and get the viz in the notebook. # Better solution Running the

run_viz

command almost begs for the ability to do

run_viz(pipeline)

. Where pipeline is an actual pipeline object. (although it would also be nice as to be able to pass the name of a pipeline to filter like the CLI command w.r.t issue 1). This way, one doesn’t need nbdev (which is a slightly controversial tool), one can develop pipelines easier, without any adjustment to the original project. Since kedro viz can already filter, i can imagine such changes being possible. Also, debugging kedro pipelines from the vscode notebook cell debugger is actually quite nice (i would arque a lot nicer than using the debug configs). Has anyone faced similar issues, or thought of a different solution?

Emilio Gagliardi

07/04/2023, 5:38 PM

I have another basic question. I'm learning how to productionize ML apps and have worked through some tutorials but nothing real. In the tutorials I've seen, when developers want to make their ML model available to perform inferences they need to use a framework like fastapi or flask so that the consumer call pass data to an endpoint and get back an inference. what I don't quite understand yet is with kedro, everything is encapsulated within pipelines and if I call the project, then the default pipeline runs, which could be the data ingestion and the model training. How do we handle inferencing with kedro? do I make an inference pipeline separate from the other pipelines? do I use fastapi to create endpoints? in the spaceflights example, the purpose is supposed to be to generate predictions, but I don't see where in that example the inferencing with the trained model is addressed. Any wisdom is greatly appreciated.

Marc Gris

07/05/2023, 10:20 AM

Hi everyone, A super-ultra-duper-dummy question: Assuming that in

conf/base/parameters.yml

I have

Copy code

model:
  init:
    k: 3
    loss: warp
    no_embeddings: 50
    learning_schedule: adagrad
    rho: 0.95
    epsilon: 1.e-6
    random_state: ${random_state}

How can I “update” a single specific field “locally” I’ve first tried in

conf/local/parameters.yml

Copy code

model:
  init:
    no_embeddings: 100

But this actually completely over-writes the model section and, of course, breaks everything. Granted: I could

cp conf/base/parameters.yml conf/local/parameters

and then update

no_embeddings

But this ends up being very “noisy”, not really “highlighting” the specificities of the local config… Is there a way to do such local / “surgical over-write” ? Thx 🙂