Kedro #questions

Filip Panovski

02/06/2023, 3:55 PM

Hi everyone. I'm not sure if this is the right place to ask, but does anybody have experience with using Airflow vs Prefect

>= 2.0.0

to run Kedro pipelines? Currently, only Prefect 1.x is tested to work with

0.18.x

according to the docs which is making us hesitate a bit on that end. We're currently evaluating both as a higher level orchestration platform for our Kedro pipelines, and both seem great for generic workflows, so some community feedback would be much appreciated.

K 1

👍 2

Zoran

02/06/2023, 5:26 PM

Hi everyone, is it possible to change globals parameters (conf/<env>/globals.yml) dynamically at runtime (as params do with --params)?

👍 2

MarioFeynman

02/07/2023, 3:00 AM

Hi everyone! If i would like to use deltatables for update, delete or merge, should i do that inside the node? Or there is something that i can use for this goal using only catalog entries?

JOEL WILSON

02/07/2023, 7:15 AM

Hi everyone! This might not be a pure kedro issue but looking for some inputs around kedro - SparkDataSet save method Getting this error on running a kedro pipline; i think this has to do with the dependencies / environment variables. Lmk your thoughts. Windows machine python 3.7

pyarrow==0.14.0

Copy code

java version "1.8.0_341"
Java(TM) SE Runtime Environment (build 1.8.0_341-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.341-b10, mixed mode)

Copy code

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Using Scala version 2.12.15, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_341
Branch HEAD
Compiled by user yumwang on 2022-10-15T09:47:01Z
Revision fbbcf9434ac070dd4ced4fb9efe32899c6db12a9
Url <https://github.com/apache/spark>

Untitled

Vassilis Kalofolias

02/07/2023, 3:06 PM

Vassilis Kalofolias

02/07/2023, 3:15 PM

Hello, thanks for a great framework! After having set up my pipelines I am trying to develop on jupyter new features (and create new pipelines from there). In that case, from what I understand it is more convenient to run pipelines using manually

SequentialRunner

instead of using the

session

of jupyter. For example I would like to run the same pipeline in a loop with different partitions of a

PartitionedDataSet

and I find it weird to call

%reload_ext kedro.ipython

in a loop. Is this discouraged practice? What is the benefit of having a session in jupyter if you develop interactively? (related but not answering my question: https://kedro-org.slack.com/archives/C03RKP2LW64/p1668423931294329) Thanks a lot!

Lawrence Shaban

02/07/2023, 7:32 PM

Hello everyone, I am having a little problem with the logger. I thought I could print out "logger.debug" values by updating the project side logging file (conf/base/logging.yml) console handler level to "DEBUG" but it doesn't seem to change anything? I'm trying to output logs from pipeline nodes.

Copy code

handlers:
    console:
        class: logging.StreamHandler
        level: DEBUG
        formatter: simple
        stream: <ext://sys.stdout>

Copy code

import logging
logger = logging.getLogger(__name__)

def example_node(input):
   logger.debug(input)
   output = input + 1
   return output

I might be just doing something simple wrong but any help be appreciated! It works for info, so just using that for now but would be good to have the option of debug! 🙂

Dustin

02/08/2023, 12:58 AM

hi team, just a quick question. let's say I have output O1 from node1 with the associated catalog configured so the content of O1 will be saved to CSV. node2 will use O1 as input. The current behaviour is that node2 will reload the data from O1 file instead of from memory (this is expected I assume, due to the catalog configuration). Is there any way I could still have O1 saved as CSV (easier for business people to check data quality) while having O1 loaded to node2 through memory (faster and no need to deal with csv save/load tricks), Thanks

Afaque Ahmad

02/08/2023, 4:46 AM

Hi Folks I'm trying to run

kedro

on EMR. The run fails because it is not able to find the

conf

folder. Is there a way to package

conf

folder together when doing

kedro package

user

02/08/2023, 8:28 AM

kedro dynamic catalog creation only for specific nodes before their run I have several thousands of files to be processed of the different types. I am using dynamic catalog creation with hooks. I used first after_catalog_created hook but it is too early in and I need those entries only for specific nodes. My try is with before_node_run for specific node tags returning the dictionary with just dynamically created entries. Node function is **kwargs only. It works as I see that node get updated inputs, but the problem is that I need to provide for the node...

David Pérez

02/08/2023, 10:41 AM

Hi Team, qq: When doing kedro viz, if we select our main pipeline and expand it, all the modular pipelines appear nested. However, when we collapse it, one of the pipelines is no longer within the main pipeline, it appears isolated on the side. Do you know why might be happening?

Szymon Czop

02/08/2023, 10:52 AM

Hi guys, im having problem with follwing set up experimnet tracking in visualisation with kedro viz tutorial. I added things to catalog. Yaml and changed code in data_science/node.py after running kedro run there is No data stored in 09_tracking folder. Visualisation of nodes and pipeline is working. All is set up but No data stored. AM i missing smth ? Some extra package. Please let me know. Wit regards Szymon

Massinissa Saïdi

02/08/2023, 1:59 PM

Hello kedroids ! I have a question about the read priority of credentials files. Suppose I have a

conf/base

and

conf/prod

environment and my

credentials.yml

file in

conf/local

. If I run

kedro run

, will

conf/local/credentials.yml

overwrite

conf/base/credentials.yml

? And if I run

kedro run --env prod

which credentials file will be used? I have the impression that it is the local file that is always used? Thank you

Oscar Villa

02/08/2023, 9:42 PM

Hi, guys. Maybe somebody knows what it is the pattern when you have very long queries? I'm getting data from BigQuery trought pandas.GBQQueryDataSet but the queries are so long, making the catalog.yaml looks dirty. Is that the right way or should I store the queries in files and call from them, or store the queries as views in Bigquery? What do you use to do? Any suggestion is appreciated. Thanks in advance.

✅ 1

Ankar Yadav

02/09/2023, 12:18 PM

Hi team, I am trying to run kedro on windows and when I start my pipeline I get the following errror:

Copy code

keyerror: "logging"

I immediately get this message as soon as I run the pipeline, any idea why this is happening?

user

02/09/2023, 2:18 PM

Parametrize input datasets in kedro I'm trying to move my project into a kedro pipeline but I'm struggling with the following step: my prediction pipeline is being run by a scheduler. The scheduler supplies all the necessary parameters (dates, country codes etc.). Up until now I had a CLI which would get input parameters such as below python predict --date 2022-01-03 --country UK The code would then read the input dataset for a given date and for a given country, so the query would be something like: SELECT * FROM...

Chouaib Nemri

02/09/2023, 4:39 PM

deleted

Jorge sendino

02/09/2023, 5:15 PM

Hello everyone, is there a way to modify

ConfigLoader

to namespace catalog and parameter entries using the folder structure inside

conf

? For example, I have:

Copy code

conf/
    catalog/
       ns1/
       ns2/
    parameters/
       ns1/
       ns2/

Ideally I would modify

ConfigLoader

to automatically add

ns1

and

ns2

as namespaces for all entries in the catalog and parameters below that folder. Is this possible?

Sebastian Pehle

02/10/2023, 12:06 AM

lets say i have the following: - source: a csv rest api with time series data and a 'duration to pull' parameter - task: weekly preparation of a dataset (historic and recent data) to be used by a BI tool for vizualization what would be the kedroic way to implement this? my guess: define a 'first run/update run' parameter in the conf/parameters.yml. if first run, pull all the data there is (duration to pull in last weeks = nan) and save as partitioned dataset into 01_raw (yearweek as partition key). if update run, determine amount of weeks to pull by checking whats already downloaded (difference between begin(='most recent' yearweek foldername in the partitioned dataset) and end(=current yearweek)) and save in same partitioned dataset (in fact i guess it would happen inside the same node as 'first run', the only difference being the computed 'duration to pull' parameter). in another node, the report dataset would be prepared (concat all data, save as multi sheet xlsx) and saved into 08_reporting. any advice is appreciated!

Andrew Stewart

02/10/2023, 5:10 AM

Anyone happen to get poetry+kedro+jupyter to work in VSCode's notebook ui ?

Wojciech Szenic

02/10/2023, 6:37 AM

Heyy guys! I'm trying to avoid doing some ugly, non kedronic solutions so perhaps You could help me with my problem. I would like for kedro to take in a command line arguments such as date or country and then do processing based on these arguments. So for example, I have a trained machine learning model, and a

predict

pipeline can output predictions. Ideally, this predict pipeline can be run as

kedro run --pipeline=predict --date=2023-01-05

and this would ingest the dataset for 05th of Jan 2023 and run the prediction on it. I'm wondering how can I pass the CLI argument into the dataset catalog?

✅ 1

Jong Hyeok Lee

02/10/2023, 9:32 AM

Hello everyone! Does anyone know how to pass in list of dataframes as an input in the pipeline node for Kedro? Because I have a function that takes in list of dataframes but doesn’t seem like it’s straightforward to implement

Sergei Benkovich

02/12/2023, 8:35 PM

is there a way to save a kedro project template? there are the usual things i change and add when setting up a new project, is it possible to save this template and instead of kedro new , do kedro load ?

✅ 1

Olivia Lihn

02/13/2023, 1:49 PM

Hi everyone! I'm trying to create a hook to overwrite some parameters if scoring pipeline runs, but it does not seem to be working (the parameters dont get written - if not present - not overwritten - if present-). The code im using is the following:

Copy code

def before_pipeline_run(self, run_params, catalog: DataCatalog) -> None:
        """Change feature inclusion parameters for
        scoring pipeline
        """
        if run_params["pipeline_name"] == "scoring":
            # retrieve feature_list from catalog
            feature_list_df = catalog.load("modeling.feature_selection_report")
            feature_list = list(feature_list_df[feature_list_df.selected == True].feature.unique())

            # get list of feature engineering pipelines
            params = catalog.load("parameters")
            feateng_pipes = [fteng_name for fteng_name in params.keys() if fteng_name.endswith("_fteng")]

            # overwrite parameters
            for pipeline in feateng_pipes:
                catalog.add_all(
                    {f"params:{pipeline}.feature_inclusion_params.feature_list": feature_list,
                    f"params:{pipeline}.feature_inclusion_params.enable_regex": True},
                    replace=True
                )

I also tried using

run_params["params"]

without any luck, and tried returning the catalog but no luck. The hook runs (tested with print statements), so my guess is i'm missing something. Thanks!

K 1

Rob

02/13/2023, 4:52 PM

Hi everyone, Is there a way to dynamically set the name of an output without setting manually the same outputs with variations on the catalog? Context: I've a pipeline that saves 15 different outputs that are defined in my catalog, but now I need to save each one of them by category as as

{category}_output_1.parquet

{category}_output_2.parquet

and so on... Any alternative suggestion is welcome 🙂

Akshay

02/14/2023, 5:03 AM

Hello Everyone, I am seeing an issue with partitionedDataset not found in Kedro pipeline when running on Azure Databricks notebook. It throws error - DataSetError: No Partitions found in ''`/mnt/testmount/data/05_model_input/partitions`'' ADLS has been mounted to /mnt/testmount/ Partitions are getting created at

/mnt/testmount/data/05_model_input/partitions

Details-- I am running a Kedro pipelines on Azure Databricks notebook. There are 4 pipelines in the project. First two, Parse and Clean works fine, read the raw data from ADLS, do the transformation and write the data back to ADLS. Third pipeline 'optimize' has spark dataset as input and generates 2 outputs. PartitionedDataset and transformed Pandas Dataframe.

Copy code

Optimize.partition@spark:
  type: kedro.io.PartitionedDataSet
  dataset:<<: *spark_parquet_partitioned
  load_args:
    maxdepth: 1
    withdirs: True
  layer : Data Transformation
  path : /mnt/testmount/data/05_model_input/partitions

model_input@pandas:
  type: kedro.io.PartitionedDataSet
  dataset:<<: *pandas_parquet_partitioned
  load_args:
    maxdepth: 1
    withdirs: True
  layer : Data Transformation
  path : /mnt/testmount/data/05_model_input/model_data

Note- pipeline works fine when run in local environment. Kedro =0.18.3 Python =3.8.10 Cluster= Spark 3.2.1

Filip Wójcik

02/14/2023, 9:40 AM

Hello all! I'm wrapping my head around the following problem/use case. So far, no luck. Imagine you have a data pipeline where you run, e.g., a web scraper every day, so it saves some amount of data (a couple of hundred records, so no big data case) every day. Can we configure a dataset so that we can append it to it? I was trying with pandas.CSVDataSet
with save_args: mode: "a"
and PartitionedDataSet
- but every time a dataset is overridden. I cannot find any such case in the docs. Should I create my implementation, deriving from the AbstractDataSet? I've heard from many fellow DS-Kedro Users that a similar use case happens from time to time, so probably I'm not alone. Thanks in advance, and best regards, Kedro is an absolute blast!

Filip Panovski

02/14/2023, 12:58 PM

Can anyone explain to me why Kedro attempts to load all catalog definitions, even if running only a specific Pipeline that uses a subset of the catalog? For example, let's say I have a catalog with

input

output

and

wrong

entries.

wrong

has a configuration problem (e.g. no credentials could be found), but I'm running a pipeline

mypipeline

which only uses

input

and

output

. Why does

kedro run --pipeline mypipeline

fail if

wrong

is configured improperly in this case? I get that you usually want to be able to view the entire catalog, but is

--pipeline <...>

not enough information to let Kedro know that I potentially don't want that?

Zirui Xu

02/14/2023, 3:01 PM

Why is

setuptools

a Kedro dependency? It gets ignored when I piptools compile a

<http://requirements.in|requirements.in>

that contains kedro because setuptools is “considered to be unsafe in a requirements file”

FlorianGD

02/14/2023, 4:32 PM

Hello, is there a reason why

pandas.ParquetDataSet

does not use pandas all the time? I would like to use it for partitioned data, and I want to use the

filters

that

pandas.read_parquet

provides, but it is not available for

pyarrow.parquet.ParquetDataset.read

. Doing a quick test and using

pd.read_parquet

every time seems to work ok, even though it does not behave exactly the same.