Kedro #questions

Mohammed Samir

01/29/2023, 11:06 AM

Hello, Have a quick question related to pipelines and nodes Ordering, I have created like 5 pipelines each of which has its own nodes, now whenever i am running the full env.

kedro run --env env_name

the pipelines nodes are interchangeable in running order , meaning that it runs as below

pipeline 1 --> Node 1

pipeline 2 ---> Node 1

pipeline 2 --> Node 2

pipeline 3 --> Node 1

pipeline 1 --> Node 2

pipeline 3 --> Node 2

(Note Nodes order in each pipeline is correct but kedro run a node from each pipeline) However i want them to run in the below order,

pipeline 1 --> Node 1

pipeline 1---> Node 2

pipeline 2 --> Node 1

pipeline 2 --> Node 2

pipeline 3 --> Node 1

pipeline 3 --> Node 2

I have the following config in pipeline_registry -->

return {"__default__": pipeline1 + pipeline2+ pipeline3 + pipeline4 + pipeline5, }

K 2

👀 1

Rob

01/29/2023, 6:21 PM

Hi again everyone, When I set a

spark.yml

file on the configuration folder, this to run the code from a

databricks cluster

(using a workflow job, so my

run.py

is in the DBFS), is required to specify the spark master URL? Or is there an alternative to omit the

spark.yml

to let Databricks manage my configuration? (I mean, to omit the manual setting of the Master URL) Thanks in advance!

Sergei Benkovich

01/29/2023, 8:01 PM

is there any integration with weights and biases? ideas on how i can run several runs with varying configuration where each one would be logged by W&B?

Antoine Bon

01/30/2023, 9:00 AM

Hi, I've been trying to use the

load_version

functionality with a catalog that is build programmatically with a hook, but I fail to do so. From my understanding of the code this is not possible, and so I raised the following ticket https://github.com/kedro-org/kedro/issues/2233 Unless someone knows of a way to do so?

Massinissa Saïdi

01/30/2023, 4:17 PM

Hello ! has anyone experienced kedro + sagemaker + custom docker image? Looking closer I have the impression that it's quite difficult to achieve given the way sagemaker is run and someone has already faced this problem without an answer. If anyone has any tips I'd love to hear them, thanks 🙂

Massinissa Saïdi

01/30/2023, 5:34 PM

Another question again sorry, is it possible to get updated parameters with

--params

in code with

KedroSession

? I have something like that

Copy code

def get_session() -> Optional[MyKedroSession]:
    bootstrap_project(Path.cwd())
    try:
        session = MyKedroSession.create()
    except RuntimeError as exc:
        <http://_log.info|_log.info>(f"Session doesn't exist, creating a new one. Raise: {exc}")
        package_name = str(Path(__file__).resolve().parent.name)
        session = MyKedroSession.create(package_name)
    return session


def get_parameters():
    context = get_session().load_context()
    return context.params

But

get_parameters

gives the parameters set in yaml and not the updated with

--params

? thx !

Andrew Stewart

01/30/2023, 9:59 PM

What's the use-case difference between

Copy code

## from <https://kedro.readthedocs.io/en/stable/kedro_project_setup/session.html>

from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from pathlib import Path

bootstrap_project(Path.cwd())
with KedroSession.create() as session:
    session.run()

Copy code

## from <https://kedro.readthedocs.io/en/stable/tutorial/package_a_project.html>

from kedro_tutorial.__main__ import main

main(
    ["--pipeline", "__default__"]
)  # or simply main() if you don't want to provide any arguments

Alexandra Lorenzo

01/31/2023, 4:48 PM

Hello, First, thanks a lot for creating such a community. I'm trying to connect my PartitionedDataSet to my S3 Bucket, I have the following error:

"create_client() got multiple values for keyword argument 'aws_access_key_id'."

credentials.yml

Copy code

dev_s3:
 client_kwargs:
    aws_access_key_id: AWS_ACCESS_KEY_ID
    aws_secret_access_key: AWS_SECRET_ACCESS_KEY

catalog.yml

Copy code

raw_images:
  type: PartitionedDataSet
  dataset:
    type: flair_one.extras.datasets.satellite_image.SatelliteImageDataSet
  credentials: dev_s3
  path: <s3://ignchallenge/train> 
  filename_suffix: .tif
  layer: raw

kedro = 0.17.7 s3fs = 0.4.2 Anyone as an idea ? Thanks in advance

João Areias

01/31/2023, 5:01 PM

Hi all, so I guess I'm a little late for the party, but why is

kedro jupyter convert

being deprecated? And is there going to be an easy way of turning notebooks into nodes and pipelines following this decision on kedro 0.19?

Elias

01/31/2023, 5:54 PM

What would be the smartest way to query only data from a database that is newer than 5 years (from today/a set enddate) through the catalog?

Olivia Lihn

01/31/2023, 7:28 PM

Hi everyone! Is there anyway of taking a column from a csv dataset as a parameter for another pipeline? We have a csv file with features that need to be created and we need to pass a these features as a list to another node as a parameter. Any ideas?

Andrew Stewart

02/01/2023, 1:35 AM

I have a Kedro project where I want to use PySpark when running in a cloud / production environment, but for experimentation in local environment I don't want to necessarily bother with standing up an entire Spark env. Looking for strategy advice. Solutions areas as I see so far: • somehow make SparkHook condition on environment? • really really simple Spark setup (like via Docker or something ; don't want to install Java on native)

Sebastian Cardona Lozano

02/01/2023, 4:43 AM

Hi everyone! In my team we started to use Kedro recently for data science projects, we have found many advantages, and we are very happy with it. Now we are facing some challenges regarding the implementation of the models in Google Cloud and Vertex AI. I woud really appreciate you opinion about these points: 1. We want to apply the data transformation steps to the new data (eg. one-hot encoding, standardization, missing imputation, etc) when the model is used for prediction. We know that with scikit-learn pipelines we can do that, but there are many disadvantages which were discussed in this thread. There, some of you recommended the

kedro-mlflow

plugin to achieve what we want. Here are the questions: Once you have the mlflow artifact can we still use the kedro-docker plugin to create the image or do we have to create the Docker image from scratch? From the other hand, can we still use the other plugins to export the pipeline to Airflow or Vertex Pipelines? 2. On that basis, we start to question if is it better to use mlflow for tracking and model registry taking advantage of the Kedro plugins, than the Vertex AI APIs. I would like to know your opinion about this or recommendations about how to combine both worlds. Thanks in advance. #C03RKP2LW64 #C03RKPCLYGY

Anirudh Dahiya

02/01/2023, 1:14 PM

Hi all! I have a kedro project that is being intiated with a pyspark session. Till date, I never had any issues when running pipelines or opening a jupyter notebook from my project's directory. However today I am facing this error -

Copy code

Exception: Java gateway process exited before sending its port number

Has anyone faced this error before?

Massinissa Saïdi

02/02/2023, 9:59 AM

Hello kedroids ! Is it possible to get the name of running

tag

in code ? (

kedro run --tag NAME

)

Larissa Siqueira

02/02/2023, 2:28 PM

Hello everyone! Is it possible to access parameters apart from the node inputs? Our goal is to format the variables names and make it change depending on the global params input on kedro run.

Artur Dobrogowski

02/02/2023, 3:58 PM

Hi, I'm getting to know kedro hooks - I want my hook only to run for specific pipeline. What should be the approach here? Should I detect which pipeline is running in settings.py and register it only if the pipeline is correct? Or can I somehow check which pipeline is being run in the hook itself? I don't see how to do it though from given hook parameteres: https://kedro.readthedocs.io/en/latest/kedro.framework.hooks.specs.DataCatalogSpecs.html#kedro.framework.hooks.specs.DataCatalogSpecs

datajoely

02/02/2023, 3:58 PM

which kind of hook do you want to run?

Filip Panovski

02/02/2023, 5:01 PM

Hello everyone. I have a question with regard to environments, since I'm seemingly misunderstanding them. I searched a bit in this channel, but (unless I grossly misread something) didn't see this specific question. I have a

dask.yml

in my

conf/base

which contains the following (real config is much larger, but this gets the point across):

Copy code

dask_cloudprovider:
  region: eu-central-1
  instance_type: t3.xlarge
  n_workers: 36

And a

dask.yml

in another environment, e.g.

conf/low

with the following:

Copy code

dask_cloudprovider:
  instance_type: t3.small
  n_workers: 8

Which I activate using

kedro run --env=low

. Now, I would have expected the

config_loader

(

TemplatedConfigLoader

) to contain something like

{'dask_cloudprovider': {'region: 'eu-central-1', 'instance_type': 't3.small', 'n_workers': 8}}

. However, it overrides the entire entry, resulting in the

config_loader

containing:

{'dask_cloudprovider': {'instance_type': 't3.small', 'n_workers': 8}}

. Is there any way to get what I was expecting out of the box? I don't really want to copy my entire configuration N-times for each environment, especially since only a few of the keys change. Is the intended use case for environments different to what I'm trying to use it for (say, only for top-level entries)?

WEN XIN (Jessie 文馨)

02/03/2023, 4:47 AM

Hi team, is there any guide on submitting

spark

job to

EMR

through

livy

for a

kedro

project?

Evžen Šírek

02/03/2023, 10:01 AM

Hi everyone! Is it possible to use the

fastparquet

engine with the ParquetDataSet? There is a possibility to specify the engine in the catalog entry:

Copy code

dataset:
  type: pandas.ParquetDataSet
  filepath: data/dataset.parquet
  load_args:
     engine: fastparquet
  save_args:
     engine: fastparquet

However, when I do that, I get the

DataSetError

with

I/O operation on closed file

when Kedro tries to save the dataset. When I manually save the data with

pandas

and

engine=fastparquet

(which is what Kedro should do according to the docs), it works well. Is this expected? Thanks! :)) Environment:

python==3.10.4, pandas==1.5.1, kedro==0.18.4, fastparquet==2023.1.0

Massinissa Saïdi

02/03/2023, 10:45 AM

Hello kedroids ! Has anyone ever used the kedro-argo plugin? And what is their feedback? Is it maintained and reliable with the new versions of kedro (as the last update is in 2020)

👍 1

Veenu Yadav

02/03/2023, 1:18 PM

Hi Team, I am getting error

Given configuration path either does not exist or is not a valid directory: /usr/local/airflow/conf/base

while deploying Kedro pipeline on on Apache Airflow with Astronomer . Any clues?

Veenu Yadav

02/03/2023, 1:20 PM

The directory

/usr/local/airflow/conf/base

is not even present in webserver container.

Sergei Benkovich

02/03/2023, 3:29 PM

hey 🙂 i want to output several htmls, jsons, and dataframes at once, as a single report. is there any way to create them all in a single node and save to a single zipped file?

Rafał Nowak

02/05/2023, 6:54 PM

Hello all kedro enthusiasts, I am looking for implementation of kedro dataset

json.JSONDataSet

supporting gzip compression, so the filepath would be

*.json.gz

I haven’t found such backend in

kedro.datasets

Have anyone already implemented such dataset?

Sergei Benkovich

02/05/2023, 8:05 PM

when saving a model using pickleDataSet with dill backend it packages the node in which the model instance was created and ran, trying to dill.load raises

Copy code

ModuleNotFoundError: No module named 'pipelines'

any suggestions on how to handle it?

Ankar Yadav

02/06/2023, 12:19 PM

Hi Team, one quick questions: I am using pandas.CSVDataset to save a file however when I mention

sep

in save_args, it gives me an error:

Copy code

prm_customer:
  type: pandas.CSVDataSet
  filepath: ${base_path}/${folders.prm}/
  save_args:
    index: False
    sep: "|"

Any idea how to fix this? I am using

kedro 0.18.1

Yanni

02/06/2023, 1:59 PM

Hi guys I am a newbie in Kedro and have a question about my KedroProject. I would like to integrate k-fold cross validation into my project. What is the best way to implement this with Kedro? I found many train_test_split methods in Github with Kedro but none of them use cross-validation. The dataset is only splitted once into training and test set. What would be the best way to implement this in Kedro? Or is Kedro not useful in this case?

Debanjan Banerjee

02/06/2023, 2:03 PM

Team, bit of a long shot but is Kedro catalog available as a separate data catalog API ? something like https://intake.readthedocs.io/en/latest/catalog.html