Kedro #questions

Nikos Kaltsas

02/15/2023, 12:11 AM

Hello, anyone have a guide / example for running Kedro pipelines on DataBricks with dbx?

❤️ 1

dor zazon

02/15/2023, 11:01 AM

Hey, i am trying to setup experiment tracking in Kedro. Everything work fine but Kedro cant save session metadata into sqlite3 DB. i get the following error every time i run Kedro:

dor zazon

02/15/2023, 11:02 AM

the session_store.db is created, but it is locked. i have tried multiple times to delete the db and run again, but the issue remains

Vassilis Kalofolias

02/15/2023, 2:42 PM

Hello, I have a quick question: What is the use-case for

dataset.confirm()

? Documentation is not clear, also it is not implemented in any dataset except

IncrementalDataSet

Alexander Johns

02/15/2023, 6:19 PM

Hey team, trying to implement a very simple custom dataset which loops through a directory reading in specific csvs that match a string pattern as pandas DataFrames performing basic cleaning operations on the individual DataFrames and concatenating them together. class definition is located:

Copy code

src/<my_project>/extras
├── __init__.py
└── datasets
    ├── __init__.py
    └── <my_custom_dataset>.py

catalog entry:

Copy code

raw_custom_dataset:
  type: <my_project>.extras.datasets.<my_custom_dataset>.<MyCustomDataSet>
  filepath: 01_raw/folder/*

when I run the node keep getting the following error:

Copy code

An exception occurred when parsing config for DataSet 'raw_custom_dataset':
Class '<my_project>.extras.datasets.<my_custom_dataset>.<MyCustomDataSet>' not found or one of
its dependencies has not been installed.

Kedro =0.18.3

Matthias Roels

02/15/2023, 8:23 PM

I want to create a new kedro project for ML and I am not sure how to properly structure it. I want to have a default pipeline consisting of a feat and modelling pipeline. Both the feat and modelling pipelines will consist of several sub-pipelines and I want to make sure that nested pipeline structure is somehow reflected in my project structure. I was thinking about nested dirs in the pipelines folder, e.g.

Copy code

pipelines/
  - feat/
    __init__.py
    pipelines.py. <—- contains all subpipelines in this folder e.g feat_sales
    - feat_sales/
      __init__.py
      nodes.py
      pipelines.py
    - …

Would this be the right approach? And if not, what is the recommended way to structure this? Do we use modular pipelines or regular pipelines?

Alex Ferrero

02/16/2023, 10:48 AM

Hey team, is there anyway I can write to a delta table using the catalog making an upsert like in SQL? I have seen in kedro's code that the only supported modes are append, overwrite, error, errorifexists and ignore.

Vassilis Kalofolias

02/16/2023, 11:06 AM

Hello, I am trying to override a bool parameter using the CLI (running from bash):

kedro run --params round_occupancy:False

However the

False

is read as a string. Is there a way to pass a boolean instead? Note that the original param is correctly read from the Yaml file as a bool.

Keith Edmonds

02/16/2023, 10:53 PM

Does Kedro interface with sklearn pipeline at all? https://scikit-learn.org/stable/modules/compose.html If there is a ML model built with sklearn's pipeline and we want to do the data engineering in Kedro. Is there an ability to look at the whole pipeline in kedro?

Sebastian Pehle

02/17/2023, 9:36 AM

iam working on windows and want to store my project on a network folder. however, when i want to create a pipeline i get an error of incorrect paths (source path must be relative to...). this stems from the variation of pathlib.Path: using without resolve(), it gives me the 'drive letter' specific path (X:/abc), if with .resolve(), it gives me the 'network' specific path (//server.xy/a/b/c/abc). Manually removing the .resolve() from all kedro source files solves the problem. someone has a better solution?

Solomon Yu

02/17/2023, 3:34 PM

Edit: documenting a solution to my own question. I'm trying to load a multisheet ExcelDataSet through the Catalog. I'm trying to load all sheets this way,

Copy code

my_excel_file:
  type: pandas.ExcelDataSet
  filepath: some-excel-file.xlsx
  load_args:
    sheet_name: None

and I get

Worksheet named 'None' not found

~~Is there a way to load all sheets through the catalog?~~ Yes Edit: Not documented fully in Kedro, but in case someone comes across this, reminder to use YAML API syntax for

None

which is

null

or . Thanks in advance!

🌟 2

Chris Santiago

02/17/2023, 6:26 PM

Hi-- new to Kedro. Why is there a

pyproject.toml

file in the root project directory and then a separate

setup.py

in the

src

directory? Trying to understand their separate roles. I'd like to introduce

kedro

to my team at work. We use a custom cookiecutter to setup all of our projects so that they're pip-installable across various platforms. Our current update uses only

pyproject.toml

, and we've removed last remnants of

setup.py

and

setup.cfg

. Specifically, I'm trying to understand how I could structure a custom starter, incorporating our existing cookiecutter, that would allow for editable installs with extras-- but I don't want to disturb any existing kedro functionality. How does the kedro cli use the

src/setup.py

file, if at all; same with the

pyproject.toml

in the root folder

Ricardo Araújo

02/18/2023, 3:39 PM

I feel this might be a basic questions, but can't quite make it work. In a kedro pipeline (Pipe1) there are two defined pipelines (say pipeA and pipeB), where pipeB is a remapping of inputs and outputs of pipeA. For organization purposes, I don't want to spin pipeB into an individual Pipe2. However, in another pipeline (Pipe3) I want to re-use pipeB, but not pipeA. Is there a way to do this?

Alexis Eutrope

02/18/2023, 8:22 PM

Hi, I have a question, (and very likely what I'm trying to do is a kedro anti-pattern) Basically I'd like to have a node pipeline With a diamond shape : EntryNode --> [IntermediateNode X for X in list] ---> OutputNode Doing this require each intermediate node to have a runtime (in code, not in static catalog.yml files) generation of dataset. I don't want to use them to store any data, thoses datasets would just be dummy ones in order to keep the dependency/ordering of nodes Any ideas on how I could deal with that ? (Ideally within the create_pipeline file) Thank you

Dustin

02/20/2023, 3:29 AM

hi team, I had this issue and thought it would be good to share and looking for advice. I intended to set a parameter "quoting" for saving csv value in catalog.yml (image 2). In the normal to_csv() function, you would use quoting=csv.QUOTE_NONNUMERIC as a parameter but this won't work in catalog.yml as it doesn't know modual 'csv'. one way is manually set the desired integer (image 3) but found that value was actually changed from 3 to 2 (image1, 3 used to stands for csv_QUOTE_NONNUMERIC but now it is 2 ) in the latest version 'csv'. is there any way we could fetch this dynamically (like how quoting = .csv_QUOTE_NONNUMERIC works in normal to_csv() in catalog?

Juan Luis

02/20/2023, 10:36 AM

I'm trying to run

kedro new

in non-interactive ways so it's compatible with Jupyter shell commands (

!kedro new ...

). I see two ways: •

yes "Project Name" | kedro new --starter=xxx

: works, but it's UNIX-only (don't think this will work on Windows), assumes there is only one question, and looks a bit arcane. • `vim kedro.yaml ... && kedro new --starter=xxx --config=kedro.yaml`: works, but I'm creating a file that I will only use once, plus it's not very easy to discover what structure should the file have (one has to navigate to the source code of the starter in question, locate the

prompts.yml

, and mimic those keys) I see that this has been unchanged since basically "forever" but I'm wondering what are folks opinions on having a way to pass these configs to the CLI? something like

kedro new --starter=xxx --project_name=yyy

Juan Luis

02/20/2023, 11:53 AM

also, a totally unrelated question: our docs say "Kedro offers a command (

kedro jupyter notebook

)" but actually this depend on the starter that got used - for example, projects created with

standalone-datacatalog

do not have it. is this a docs issue (we should amend those to explain how to get that command working regardless of the starter used) or a starter issue (all starters should have

kedro jupyter notebook

Lan Bui

02/20/2023, 2:08 PM

hi friends! Is there anyway to load a Kedro project from a project directory? I recently lost my motherboard, but salvaged the data where my project was in. When I reinstalled kedro, the project is no longer recognized even though all the files are there

Massinissa Saïdi

02/20/2023, 5:54 PM

Hello kedroids ! Do you know how to pass boolean parameter in CLI ?

kedro run --params key:false

kedro run --params key:False

return string

'False' or 'false'

. I know i cant set parameter to 0 or '' to have the false condition but there is a better way ? thx 🙂

Laura Oñate

02/21/2023, 2:25 AM

Hi*,* quick question, approximately how many users are using kedro?

Robertqs

02/21/2023, 4:43 AM

Hi guys I’m facing a strange issue on windows, where the kernels in jupyter lab instance keeps disconnecting. It would normally work for a while after restarting jupyter lab but the problem gets back after. Doesn’t seem to be a resource issue, as this happens when working on a light notebook. Wondering if anyone has encountered a similar issue? Thanks in advance.

✅ 1

Jan

02/21/2023, 10:26 AM

Hi! Did anyone yet create a script / function to delete old experiments systematically? If I were to create one to delete the old folders, how can I remove them from the session_store.db (sqlite)?

Olivier Ho

02/21/2023, 10:48 AM

hello! Is there a way to autoincrement micropackage version?

Armen Paronikyan

02/21/2023, 10:56 AM

Hi guys. I would like to know if there is a way to have experiment tracking deployed on separate server. So that several kedro applications will send the data there? Thanks in advance.

Nicolas Oulianov

02/21/2023, 7:55 PM

Hey, is there any plan to make an interactive kedro viz ? Where you could plug and unplug data connectors. A bit like the Blender 3d or Unreal Engine scripting system

datajoely

02/22/2023, 8:00 AM

Sorry about that - not sure why Stackoverflow kedro tag just posted all of that.

Francisco Alejandro Leal Tovar

02/22/2023, 1:34 PM

Hello everybody! Has anybody worked with Snowpark in Kedro?

Solomon Yu

02/22/2023, 2:18 PM

Hiya, trying to figure out params for data processing pipelines. I'd like to set parameters for catalog config so that catalog.load() can load dataset with load_argsdtypedtypes_dict_var, like:

Copy code

my_dataset:
  type: pandas.CSVDataSet
  filepath: path-to-my-file.csv
  load_args:
    parse_dates: ['col_3']
    dtype: dtypes_dict_var

So that catalog.yml won't be too many lines long. I'd like this dtype dict to live within conf/base/parameters/my_pipeline.yml, as:

Copy code

dtypes_dict_var: {
  "col_1": int,
  "col_2": str,
  "col_3": DateTime<'Y-m-d'>, # assumes YAML API syntax will be converted to datetime object
}

Another question here is how to pass in datetime object type to load_args:dtype I'd like this dtype dict to affect only loading my_dataset, and not use as a global var if possible. A separate case could be that I'd like to load the same dataset with different dtypes in different pipelines, which could utilise TemplatedConfigLoader.. Passing in certain parameters doesn't seem very straightforward tbh :/ Thanks in advance!

Ian Whalen

02/22/2023, 2:32 PM

Not to necro an old thread, but does OmegaConf help with this? Specifically: defining a list of constants in

settings.py

and looping over it in the jinja-esque style to define catalog entries. Couldn’t immediately tell from the docs, though I haven’t had much time to work with the new loader. I am excited too of course 🙂

Shiv Pratap Singh

02/22/2023, 3:04 PM

Hi Everyone, I am facing an issue while saving a pickle dataset on an on-Prem s3a. Attached is - Catalog Entry and Error. Any ideas 🙂 ?