Kedro #questions

Muhammad Ghazalli

05/31/2023, 3:04 AM

Hi, I'm trying to migrate from kedro 0.17.* to kedro 0.18.* and having difficulty in credentials.yml. How to put the connection string from the environment variable into the con variable in the yml files? DataSetError: Could not parse SQLAlchemy URL from string '${dummy_con}'.

Lucas Hattori

05/31/2023, 1:52 PM

Hi Kedro community 🙂 If I have two modular pipelines and one has a dependency on the other (eg,

pipe2

depends on an output from

pipe1

), will they run in the correct order always, ie,

pipe1

pipe2

? If they were regular pipelines I know they will, but I’m not sure about modular_pipelines (tho I can’t imagine why they would be different) Mock code below:

Copy code

pipe1 = modular_pipeline.pipeline(
        pipe=func,
        namespace="pipe1_ns"
        inputs={"input":"pipe1_ns.input"}
    )
    pipe2 = modular_pipeline.pipeline(
        pipe=func,
        namespace="pipe2_ns"
        inputs={"input":"pipe1_ns.output"}
    )

Nan Dong

05/31/2023, 3:20 PM

Hello everyone! Long time Kedro user here, which completely revolutionized how our data team writes pipelines. We are trying to choose an orchestrator, can anyone provide some insights on the pros and cons of Airflow vs Prefect? We are considering Airflow because there’s experience in using that in the team, but some have found it cumbersome to repackage the Kedro project each time there’s code change (containerized with Docker). We also have experience with Prefect 1.0, using the script included in the Kedro deployment docs, but are not sure about how the migration to Prefect 2.0 will work with Kedro. Thank you all in advance!

🔥 4

Gabriel Bandeira

05/31/2023, 5:12 PM

[solved] Hi, team! I’m trying to run

kedro jupyter notebook

but it’s failing saying

Error: No such command 'jupyter'.

how can I make it work? packages versions on thread

Ezekiel Day

05/31/2023, 7:49 PM

Hi team, trying to run kedro with spark.SparkDataSets on databricks. When running in a notebook, looks like there is an issue between the spark session of the notebook and the one the project is trying to create. Can someone assist with resolving this conflict?

Manilson António Lussati

05/31/2023, 8:13 PM

Hello everyone, a question how can I create Hooks tests that I created using pytest? Example I have a mlflow hook, and I want to create a test_hook_mlflow to validate if that hook is working via pytest for test coverage?

Ezekiel Day

06/01/2023, 8:07 PM

Hi team, question regarding a DeprecationWarning. I get this warning that kedro.extras.datasets will be deprecated etc in Kedro 0.19 and to install kedro-datasets. Have done this, but still getting warning (even though no dataset I have depends on this). I have 1 custom dataset which I use a lot of places. I believe the warning is cropping up because in kedro.io.core it is polling through _DEFAULT_PACKAGES if any of those classes exist and goes through custom datasets last ie "" so even though I'm not using kedro.extras.datasets it still polls and produces warning. Not breaking or anything, but is there anything to be done?

Richard Purvis

06/01/2023, 8:18 PM

Hello, if I want to set an environment variable before running a pipeline or importing any node libraries, what would be the best way to do that? I've tried setting via

os.environ

before the import statements in my node scripts, to no avail. Edit: I resolved this by putting the

os.environ["VAR"] = "value"

call at the top of my

settings.py

file. I don't know if this is the best solution, but since this is a workaround until a bugfix occurs in one of the project dependencies I'm happy to leave it there.

Ofir

06/01/2023, 9:08 PM

How do you typically deploy Kedro with Kubernetes? We currently use Prefect deployments to host our Kedro DS pipelines on our Kubernetes cluster but we would love to learn what are the popular ways other people are using Kedro with Kubernetes. One of the issues we are having difficulty with is updating MLFlow deployments from the host, i.e. model registry / model deployment.

Tomás Rojas

06/02/2023, 6:15 AM

Hi everyone. I am trying to use the modular pipeline module but I am getting the error

ModularPipelineError: Failed to map datasets and/or parameters:

and a list of datasets that do exist on the catalog. This is the code:

Copy code

from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline
from .nodes import select_columns


def create_pipeline(**kwargs) -> Pipeline:
    nominal_template = pipeline(
        [
            node(
                func=select_columns,
                inputs=["nominal_raw_data_normalized", "params:columns"],
                outputs="nominal_raw_data_features",
                name="extracting_columns_nominal_data",
                namespace="data_preprocessing"
            )
        ]
    )

    faulty_template = pipeline(
        [
            node(
                func=select_columns,
                inputs=[f"fault_{i}_raw_data_normalized", "params:columns"],
                outputs=f"fault_{i}_raw_data_features",
                name=f"extracting_columns_fault_{i}",
                namespace="data_preprocessing"
            ) for i in range(1, 29)
        ]
    )


    reactor_nominal = pipeline(
        pipe=nominal_template,
        inputs={f"fault_{i}_raw_data_normalized" for i in range(1, 29)},
        parameters={"params:columns": "params:reactor_columns"},
        namespace="reactor"
    )

    reactor_faulty = pipeline(
        pipe=faulty_template,
        inputs={f"fault_{i}_raw_data_normalized" for i in range(1, 29)},
        parameters={"params:columns": "params:reactor_columns"},
        namespace="reactors"
    )

    reactor = reactor_nominal + reactor_faulty



    return reactor

Any ideas on what is the error? Maybe I am not using the module correctly thanks in advance 🙂

Riley Brady

06/02/2023, 4:44 PM

I’m having trouble designating multiple AND tags to run a subset of a pipeline. We have a large pipeline with ~700 nodes and I only want to run 20 or so. Each node has tags like:

Copy code

PIPELINE1
node1
tags=[
    "task1",
    variable,
    model,
    region
]

node2
tags=[
    "task2",
    variable,
    model,
    region
]

I want to run all

node1

s under

PIPELINE1

for a certain variable and model, but over all regions (working on geospatial data). We run from the kedro CLI, launching AWS batch jobs. I found that I could run jobs from a config spec here. So I set up the following `config.yml`:

Copy code

run:
  tags: task1, temperature, GFDL-ESM4 # don't declare region so all regions are run
  pipeline: PIPELINE1
  env: dev
  runner: cra_data_pipelines.runner.BatchRunner

Then I run

kedro run --config=config.yml

. RESULT: It ends up launching all 700 jobs from PIPELINE1 without any distinction for the listed tags above. I of course just want the 20 or so that meet the AND conditions of those three tags. I recall having this issue back in the fall and asked about it, and at the time I don’t think there was any way to run tags with AND logic. I was told that recent versions of kedro updated this, and saw on the config page that it listed multiple tags, so I assumed that’s how it should work. Any help would be great here! Would prefer a simple solution like this rather than looping through each node manually in a shell script.

Tomás Rojas

06/03/2023, 5:02 AM

Hi, I am running a

kedro jupyter lab

on a project and it seems to run ok but sometimes I get an error, it crashes and the cell returns me

ERROR! Session/line number was not unique in database. History logging moved to new session 668

. Any idea on what could be the isue?

Artur Janik

06/04/2023, 8:03 PM

Hello, https://github.com/kedro-org/kedro/issues/1457 appears to break custom datasets in kedro 0.18.*, contrary to https://docs.kedro.org/en/stable/extend_kedro/custom_datasets.html. I've tested some datasets in my project, and

kedro ipython

appears to accept and tolerate the old way of doing things, with the extras folder, while

kedro run

and

kedro viz

do not, and cannot find the dataset definitions. https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets doesn't appear to provide any new advice as to how to declare datasets that are not in the extras folder What is the correct way to declare custom datasets in kedro 0.18.*?

Artur Janik

06/05/2023, 12:55 AM

yeah idk, I did the whole playing around thing of pip installing with -e and

kedro ipython

is still happy with it, while neither

kedro viz

nor

kedro jupyter

is.

Dan Knott

06/05/2023, 1:18 PM

Hello kedroids! kedroid Does Kedro have any datasets for 3D geometrical data (.stl, .vtk files etc) Had a quick look at the docs but couldn’t see anything! Thanks

Nok Lam Chan

06/05/2023, 1:56 PM

https://github.com/kedro-org/kedro/issues/2639

Joseph Mehltretter

06/05/2023, 5:36 PM

Hello!! Is there any way during node runtime to access what version the data catalog will use to save the outputs?

Zhe An

06/05/2023, 10:47 PM

Hi team, quick question: how to add this test case ( I am not sure how to load the catalog in tests folder) e.g. I have defined a node

Copy code

node(func=create_class_code_list, inputs=["full_raw_data_dump", 'params:feature_engineering.primary_policy_key', 'params:feature_engineering.class_code_col'], outputs="full_data_with_agg_features")

I want to test the inputs 1.

full_raw_data_dump

is a dataframe from catalog.yaml. I want to test keys in this df. 2.

params:feature_engineering.primary_policy_key

is str from catalog.yaml. I want to test the string using a keyword pattern.

charles

06/05/2023, 11:15 PM

hey folks - I've got a python module named database sitting in the same repo as my kedro project. I can't seem to get it recognized. tried using hooks, altering the PYTHONPATH but nothing. just ModuleNotFoundError. any idea how I can overcome? • database/ • kdr_project/ ◦ src/ ▪︎ project/ • pipelines/ ◦ mypipeline/ ( file in here is where I am importing database

Iñigo Hidalgo

06/06/2023, 7:40 AM

Hi, I would like to set a default copy_mode for datasets of a certain type, Ibis Tables should always be passed through as "assign" I would like to build a query on an Ibis table over multiple nodes which would imply creating lots of MemoryDatasets, I would like to avoid needing to specify an instance in the catalog for each to specify their copy_mode. https://github.com/kedro-org/kedro/blob/39f2168b81c550873c685eea42f1018c2927dbb8/kedro/io/memory_dataset.py#L83 Would it make sense to somehow modify the behavior of

_infer_copy_mode

? In this issue it was mentioned as a possibility but was discarded because it’s too “heavy” but I think adding one additional branch to the already-existing pandas check could be worth it for incorporating Ibis functionality.

👀 1

👍 2

Andreas_Kokolantonakis

06/06/2023, 2:03 PM

hello , I am currently using kedro viz to visualize a pipeline, and I am noticing that the intermediate outputs between the nodes are showing alone when are not expanded. Is there an easy way to hide them? Thank you in advance!

fmfreeze

06/07/2023, 9:48 AM

I have a problem with experiment tracking. I setup (locally on windows) as described in the docs and everything worked fine. I pushed my repository without the session.db. Then on another machine (linux), i pulled the changes and

kedro run

showed this error (attached screenshot). How can I "reset" kedros experiment tracking?

fmfreeze

06/07/2023, 10:29 AM

Is it possible to load a "versioned" pipeline run (versioned with experiment tracking)? I have a couple of

MemoryDataSet

flowing around a pipeline, and I want to inspect them for individual tracked pipeline runs after the run (e.g. load them again like with

session.run(to_outputs...)

but for a specific experiment run from the past.)

Manilson António Lussati

06/07/2023, 11:33 AM

Hello, anyone here tried to develop a test for kedro-mlfow?

Julius Hetzel

06/08/2023, 6:30 AM

Hi Everyone, I am running a Kedro Pipeline on AWS Step Functions with Lambda. I use S3 as for the data. Everything works fine. However whenever I add torch

Copy code

torch==2.0.1+cpu -f <https://download.pytorch.org/whl/torch_stable.html>
torchvision==0.15.2+cpu -f <https://download.pytorch.org/whl/torch_stable.html>

the lambda is not able to access s3 and fails with

Install s3fs to access S3

. If I install everything locally on my linux and run

Kedro run

it runs fine. Anyone came across this problem or has an idea on how to fix it?

Hannes

06/08/2023, 1:53 PM

Hi Everyone, I am trying to load a file from an SFTP server and am facing the following error:

Copy code

DataSetError: Failed while loading data from data set CSVDataSet(filepath=/home/foo/dev.csv, load_args={}, protocol=sftp, save_args={'index': False}).
<urlopen error unknown url type: sftp>

The file is referenced in

conf\base\catalog.yml

using the following syntax:

Copy code

input_data:
    type: pandas.CSVDataSet
    filepath: "sftp:///home/foo/dev.csv"
    credentials: cluster_credentials

Where the cluster_credentials are as follows in my

conf\local\credentials.yml

Copy code

cluster_credentials:
  username: username
  host: localhost
  port: 22
  password: password

I am running Kedro version 0.18.8 and I have Paramiko version 3.2.0 installed running on a Windows machine. I have followed the instruction in the data catalog docs here. I would greatly appreciate any insights or suggestions on how to debug and resolve this issue. Thank you in advance for your help! Best Regards Hannes

Iñigo Hidalgo

06/08/2023, 5:01 PM

Has something been done around type checking in kedro pipelines? Could be an interesting option for ensuring data correctness

Melvin Kok

06/09/2023, 7:14 AM

Hi team, Kedro 0.18.10 doesn’t work with starters?

Copy code

> kedro new --starter=spaceflights
kedro.framework.cli.utils.KedroCliError: Kedro project template not found at git+<https://github.com/kedro-org/kedro-starters.git>. Specified tag 0.18.10. The following tags are available: 0.17.0, 0.17.1, 0.17.2, 0.17.3, 0.17.4, 0.17.5, 0.17.6, 0.17.7, 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.18.4, 0.18.5, 0.18.6, 0.18.7, 0.18.8, 0.18.9. The aliases for the official Kedro starters are:
- astro-airflow-iris
- astro-iris
- pandas-iris
- pyspark
- pyspark-iris
- spaceflights
- standalone-datacatalog

Run with --verbose to see the full exception
Error: Kedro project template not found at git+<https://github.com/kedro-org/kedro-starters.git>. Specified tag 0.18.10. The following tags are available: 0.17.0, 0.17.1, 0.17.2, 0.17.3, 0.17.4, 0.17.5, 0.17.6, 0.17.7, 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.18.4, 0.18.5, 0.18.6, 0.18.7, 0.18.8, 0.18.9. The aliases for the official Kedro starters are:
- astro-airflow-iris
- astro-iris
- pandas-iris
- pyspark
- pyspark-iris
- spaceflights
- standalone-datacatalog

Sebastian Cardona Lozano

06/10/2023, 12:31 AM

Hi all. I'm using Annoy library to perform a nearest neighbors search. To save the index created by the algorithm I need to build a custom dataset. I tried to follow these examples: 1) Docs. Example and Recommendation system example). My code is this:

Copy code

import fsspec
from pathlib import PurePosixPath
from typing import Any, Dict
from annoy import AnnoyIndex
from <http://kedro.io|kedro.io> import AbstractDataSet
from kedro.io.core import get_filepath_str, get_protocol_and_path


class AnnoyIndexDataSet(AbstractDataSet[AnnoyIndex, AnnoyIndex]):
    """``AnnoyIndexDataSet`` loads / save Annoy index from a given filepath.
    """

    def __init__(self, filepath: str, dimension:int, metric:str):
        """Creates a new instance of AnnoyIndexDataSet to load / save an Annoy
        Index at the given filepath.

        Args:
            filepath (str): The path to the file where the index will be saved
                or loaded from.
            dimension (int): The length of the vectors that will be indexed.
            metric (str): The distance metric to use. One of "angular",
                "euclidean", "manhattan", "hamming", or "dot".
        """
        # parse the path and protocol (e.g. file, http, s3, etc.)
        protocol, path = get_protocol_and_path(filepath)
        
        self._protocol = protocol
        self._filepath = PurePosixPath(path)
        self._fs = fsspec.filesystem(self._protocol)
        
        self.dimension = dimension
        self.metric = metric

    def _load(self) -> AnnoyIndex:
        """Load the index from the file.

        Returns:
            An instance of AnnoyIndex.
        """
        # using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
        load_path = get_filepath_str(self._filepath, self._protocol)
        
        annoy_index = AnnoyIndex(self.dimension, self.metric)
        annoy_index.load(load_path)
        return annoy_index

    def _save(self, annoy_index: AnnoyIndex) -> None:
        """Save the index to the file.

        Args:
            data: An instance of AnnoyIndex.
        """
        save_path = get_filepath_str(self._filepath, self._protocol)

        annoy_index.save(save_path)

    def _describe(self) -> Dict[str, Any]:
        """Return a dict describing the dataset.

        Returns:
            A dict with keys "filepath", "dimension", and "metric".
        """
        return {
            "filepath": self._filepath,
            "dimension": self.dimension,
            "metric": self.metric,
        }

And in the data catalog I have this:

Copy code

annoy_index:
    type: pricing.extras.datasets.annoy_dataset.AnnoyIndexDataSet
    dimension: 1026
    metric: angular
    filepath: /data/06_models/products_index.ann
    layer: model_input

My goal is to save the .ann file in Google Cloud Storage or in a local folder, but I got the next error when running the node that saves the file:

Copy code

DataSetError: Failed while saving data to data set AnnoyIndexDataSet(dimension=1026, 
filepath=/data/06_models/products_index.ann, metric=angular).
Unable to open: No such file or directory (2)

Please your help. Thanks!!