https://kedro.org/ logo
Join Slack
Powered by
# questions
  • m

    Muhammad Ghazalli

    05/31/2023, 3:04 AM
    Hi, I'm trying to migrate from kedro 0.17.* to kedro 0.18.* and having difficulty in credentials.yml. How to put the connection string from the environment variable into the con variable in the yml files? DataSetError: Could not parse SQLAlchemy URL from string '${dummy_con}'.
    j
    n
    • 3
    • 2
  • l

    Lucas Hattori

    05/31/2023, 1:52 PM
    Hi Kedro community 🙂 If I have two modular pipelines and one has a dependency on the other (eg,
    pipe2
    depends on an output from
    pipe1
    ), will they run in the correct order always, ie,
    pipe1
    ->
    pipe2
    ? If they were regular pipelines I know they will, but I’m not sure about modular_pipelines (tho I can’t imagine why they would be different) Mock code below:
    Copy code
    pipe1 = modular_pipeline.pipeline(
            pipe=func,
            namespace="pipe1_ns"
            inputs={"input":"pipe1_ns.input"}
        )
        pipe2 = modular_pipeline.pipeline(
            pipe=func,
            namespace="pipe2_ns"
            inputs={"input":"pipe1_ns.output"}
        )
    d
    n
    • 3
    • 2
  • n

    Nan Dong

    05/31/2023, 3:20 PM
    Hello everyone! Long time Kedro user here, which completely revolutionized how our data team writes pipelines. We are trying to choose an orchestrator, can anyone provide some insights on the pros and cons of Airflow vs Prefect? We are considering Airflow because there’s experience in using that in the team, but some have found it cumbersome to repackage the Kedro project each time there’s code change (containerized with Docker). We also have experience with Prefect 1.0, using the script included in the Kedro deployment docs, but are not sure about how the migration to Prefect 2.0 will work with Kedro. Thank you all in advance!
    🔥 4
    j
    • 2
    • 6
  • g

    Gabriel Bandeira

    05/31/2023, 5:12 PM
    [solved] Hi, team! I’m trying to run
    kedro jupyter notebook
    but it’s failing saying
    Error: No such command 'jupyter'.
    how can I make it work? packages versions on thread
    j
    • 2
    • 6
  • e

    Ezekiel Day

    05/31/2023, 7:49 PM
    Hi team, trying to run kedro with spark.SparkDataSets on databricks. When running in a notebook, looks like there is an issue between the spark session of the notebook and the one the project is trying to create. Can someone assist with resolving this conflict?
    j
    • 2
    • 2
  • m

    Manilson AntĂłnio Lussati

    05/31/2023, 8:13 PM
    Hello everyone, a question how can I create Hooks tests that I created using pytest? Example I have a mlflow hook, and I want to create a test_hook_mlflow to validate if that hook is working via pytest for test coverage?
  • e

    Ezekiel Day

    06/01/2023, 8:07 PM
    Hi team, question regarding a DeprecationWarning. I get this warning that kedro.extras.datasets will be deprecated etc in Kedro 0.19 and to install kedro-datasets. Have done this, but still getting warning (even though no dataset I have depends on this). I have 1 custom dataset which I use a lot of places. I believe the warning is cropping up because in kedro.io.core it is polling through _DEFAULT_PACKAGES if any of those classes exist and goes through custom datasets last ie "" so even though I'm not using kedro.extras.datasets it still polls and produces warning. Not breaking or anything, but is there anything to be done?
    n
    • 2
    • 1
  • r

    Richard Purvis

    06/01/2023, 8:18 PM
    Hello, if I want to set an environment variable before running a pipeline or importing any node libraries, what would be the best way to do that? I've tried setting via
    os.environ
    before the import statements in my node scripts, to no avail. Edit: I resolved this by putting the
    os.environ["VAR"] = "value"
    call at the top of my
    settings.py
    file. I don't know if this is the best solution, but since this is a workaround until a bugfix occurs in one of the project dependencies I'm happy to leave it there.
    n
    • 2
    • 2
  • o

    Ofir

    06/01/2023, 9:08 PM
    How do you typically deploy Kedro with Kubernetes? We currently use Prefect deployments to host our Kedro DS pipelines on our Kubernetes cluster but we would love to learn what are the popular ways other people are using Kedro with Kubernetes. One of the issues we are having difficulty with is updating MLFlow deployments from the host, i.e. model registry / model deployment.
    i
    • 2
    • 5
  • t

    Tomás Rojas

    06/02/2023, 6:15 AM
    Hi everyone. I am trying to use the modular pipeline module but I am getting the error
    ModularPipelineError: Failed to map datasets and/or parameters:
    and a list of datasets that do exist on the catalog. This is the code:
    Copy code
    from kedro.pipeline import Pipeline, node
    from kedro.pipeline.modular_pipeline import pipeline
    from .nodes import select_columns
    
    
    def create_pipeline(**kwargs) -> Pipeline:
        nominal_template = pipeline(
            [
                node(
                    func=select_columns,
                    inputs=["nominal_raw_data_normalized", "params:columns"],
                    outputs="nominal_raw_data_features",
                    name="extracting_columns_nominal_data",
                    namespace="data_preprocessing"
                )
            ]
        )
    
        faulty_template = pipeline(
            [
                node(
                    func=select_columns,
                    inputs=[f"fault_{i}_raw_data_normalized", "params:columns"],
                    outputs=f"fault_{i}_raw_data_features",
                    name=f"extracting_columns_fault_{i}",
                    namespace="data_preprocessing"
                ) for i in range(1, 29)
            ]
        )
    
    
        reactor_nominal = pipeline(
            pipe=nominal_template,
            inputs={f"fault_{i}_raw_data_normalized" for i in range(1, 29)},
            parameters={"params:columns": "params:reactor_columns"},
            namespace="reactor"
        )
    
        reactor_faulty = pipeline(
            pipe=faulty_template,
            inputs={f"fault_{i}_raw_data_normalized" for i in range(1, 29)},
            parameters={"params:columns": "params:reactor_columns"},
            namespace="reactors"
        )
    
        reactor = reactor_nominal + reactor_faulty
    
    
    
        return reactor
    Any ideas on what is the error? Maybe I am not using the module correctly thanks in advance 🙂
    j
    • 2
    • 7
  • r

    Riley Brady

    06/02/2023, 4:44 PM
    I’m having trouble designating multiple AND tags to run a subset of a pipeline. We have a large pipeline with ~700 nodes and I only want to run 20 or so. Each node has tags like:
    Copy code
    PIPELINE1
    node1
    tags=[
        "task1",
        variable,
        model,
        region
    ]
    
    node2
    tags=[
        "task2",
        variable,
        model,
        region
    ]
    I want to run all
    node1
    s under
    PIPELINE1
    for a certain variable and model, but over all regions (working on geospatial data). We run from the kedro CLI, launching AWS batch jobs. I found that I could run jobs from a config spec here. So I set up the following `config.yml`:
    Copy code
    run:
      tags: task1, temperature, GFDL-ESM4 # don't declare region so all regions are run
      pipeline: PIPELINE1
      env: dev
      runner: cra_data_pipelines.runner.BatchRunner
    Then I run
    kedro run --config=config.yml
    . RESULT: It ends up launching all 700 jobs from PIPELINE1 without any distinction for the listed tags above. I of course just want the 20 or so that meet the AND conditions of those three tags. I recall having this issue back in the fall and asked about it, and at the time I don’t think there was any way to run tags with AND logic. I was told that recent versions of kedro updated this, and saw on the config page that it listed multiple tags, so I assumed that’s how it should work. Any help would be great here! Would prefer a simple solution like this rather than looping through each node manually in a shell script.
    d
    • 2
    • 5
  • t

    Tomás Rojas

    06/03/2023, 5:02 AM
    Hi, I am running a
    kedro jupyter lab
    on a project and it seems to run ok but sometimes I get an error, it crashes and the cell returns me
    ERROR! Session/line number was not unique in database. History logging moved to new session 668
    . Any idea on what could be the isue?
    n
    • 2
    • 3
  • a

    Artur Janik

    06/04/2023, 8:03 PM
    Hello, https://github.com/kedro-org/kedro/issues/1457 appears to break custom datasets in kedro 0.18.*, contrary to https://docs.kedro.org/en/stable/extend_kedro/custom_datasets.html. I've tested some datasets in my project, and
    kedro ipython
    appears to accept and tolerate the old way of doing things, with the extras folder, while
    kedro run
    and
    kedro viz
    do not, and cannot find the dataset definitions. https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets doesn't appear to provide any new advice as to how to declare datasets that are not in the extras folder What is the correct way to declare custom datasets in kedro 0.18.*?
  • a

    Artur Janik

    06/05/2023, 12:55 AM
    yeah idk, I did the whole playing around thing of pip installing with -e and
    kedro ipython
    is still happy with it, while neither
    kedro viz
    nor
    kedro jupyter
    is.
    j
    n
    • 3
    • 3
  • d

    Dan Knott

    06/05/2023, 1:18 PM
    Hello kedroids! kedroid Does Kedro have any datasets for 3D geometrical data (.stl, .vtk files etc) Had a quick look at the docs but couldn’t see anything! Thanks
    n
    • 2
    • 2
  • n

    Nok Lam Chan

    06/05/2023, 1:56 PM
    https://github.com/kedro-org/kedro/issues/2639
    • 1
    • 2
  • j

    Joseph Mehltretter

    06/05/2023, 5:36 PM
    Hello!! Is there any way during node runtime to access what version the data catalog will use to save the outputs?
    m
    • 2
    • 3
  • z

    Zhe An

    06/05/2023, 10:47 PM
    Hi team, quick question: how to add this test case ( I am not sure how to load the catalog in tests folder) e.g. I have defined a node
    Copy code
    node(func=create_class_code_list, inputs=["full_raw_data_dump", 'params:feature_engineering.primary_policy_key', 'params:feature_engineering.class_code_col'], outputs="full_data_with_agg_features")
    I want to test the inputs 1.
    full_raw_data_dump
    is a dataframe from catalog.yaml. I want to test keys in this df. 2.
    params:feature_engineering.primary_policy_key
    is str from catalog.yaml. I want to test the string using a keyword pattern.
    j
    • 2
    • 2
  • c

    charles

    06/05/2023, 11:15 PM
    hey folks - I've got a python module named database sitting in the same repo as my kedro project. I can't seem to get it recognized. tried using hooks, altering the PYTHONPATH but nothing. just ModuleNotFoundError. any idea how I can overcome? • database/ • kdr_project/ ◦ src/ ▪︎ project/ • pipelines/ ◦ mypipeline/ ( file in here is where I am importing database
    j
    • 2
    • 10
  • i

    Iñigo Hidalgo

    06/06/2023, 7:40 AM
    Hi, I would like to set a default copy_mode for datasets of a certain type, Ibis Tables should always be passed through as "assign" I would like to build a query on an Ibis table over multiple nodes which would imply creating lots of MemoryDatasets, I would like to avoid needing to specify an instance in the catalog for each to specify their copy_mode. https://github.com/kedro-org/kedro/blob/39f2168b81c550873c685eea42f1018c2927dbb8/kedro/io/memory_dataset.py#L83 Would it make sense to somehow modify the behavior of
    _infer_copy_mode
    ? In this issue it was mentioned as a possibility but was discarded because it’s too “heavy” but I think adding one additional branch to the already-existing pandas check could be worth it for incorporating Ibis functionality.
    đź‘€ 1
    👍 2
    a
    j
    • 3
    • 8
  • a

    Andreas_Kokolantonakis

    06/06/2023, 2:03 PM
    hello , I am currently using kedro viz to visualize a pipeline, and I am noticing that the intermediate outputs between the nodes are showing alone when are not expanded. Is there an easy way to hide them? Thank you in advance!
    t
    • 2
    • 2
  • f

    fmfreeze

    06/07/2023, 9:48 AM
    I have a problem with experiment tracking. I setup (locally on windows) as described in the docs and everything worked fine. I pushed my repository without the session.db. Then on another machine (linux), i pulled the changes and
    kedro run
    showed this error (attached screenshot). How can I "reset" kedros experiment tracking?
    r
    n
    j
    • 4
    • 6
  • f

    fmfreeze

    06/07/2023, 10:29 AM
    Is it possible to load a "versioned" pipeline run (versioned with experiment tracking)? I have a couple of
    MemoryDataSet
    flowing around a pipeline, and I want to inspect them for individual tracked pipeline runs after the run (e.g. load them again like with
    session.run(to_outputs...)
    but for a specific experiment run from the past.)
    n
    t
    • 3
    • 6
  • m

    Manilson AntĂłnio Lussati

    06/07/2023, 11:33 AM
    Hello, anyone here tried to develop a test for kedro-mlfow?
    m
    • 2
    • 3
  • j

    Julius Hetzel

    06/08/2023, 6:30 AM
    Hi Everyone, I am running a Kedro Pipeline on AWS Step Functions with Lambda. I use S3 as for the data. Everything works fine. However whenever I add torch
    Copy code
    torch==2.0.1+cpu -f <https://download.pytorch.org/whl/torch_stable.html>
    torchvision==0.15.2+cpu -f <https://download.pytorch.org/whl/torch_stable.html>
    the lambda is not able to access s3 and fails with
    Install s3fs to access S3
    . If I install everything locally on my linux and run
    Kedro run
    it runs fine. Anyone came across this problem or has an idea on how to fix it?
    j
    • 2
    • 5
  • h

    Hannes

    06/08/2023, 1:53 PM
    Hi Everyone, I am trying to load a file from an SFTP server and am facing the following error:
    Copy code
    DataSetError: Failed while loading data from data set CSVDataSet(filepath=/home/foo/dev.csv, load_args={}, protocol=sftp, save_args={'index': False}).
    <urlopen error unknown url type: sftp>
    The file is referenced in
    conf\base\catalog.yml
    using the following syntax:
    Copy code
    input_data:
        type: pandas.CSVDataSet
        filepath: "sftp:///home/foo/dev.csv"
        credentials: cluster_credentials
    Where the cluster_credentials are as follows in my
    conf\local\credentials.yml
    if
    Copy code
    cluster_credentials:
      username: username
      host: localhost
      port: 22
      password: password
    I am running Kedro version 0.18.8 and I have Paramiko version 3.2.0 installed running on a Windows machine. I have followed the instruction in the data catalog docs here. I would greatly appreciate any insights or suggestions on how to debug and resolve this issue. Thank you in advance for your help! Best Regards Hannes
    i
    n
    • 3
    • 3
  • i

    Iñigo Hidalgo

    06/08/2023, 5:01 PM
    Has something been done around type checking in kedro pipelines? Could be an interesting option for ensuring data correctness
    n
    f
    • 3
    • 9
  • m

    Melvin Kok

    06/09/2023, 7:14 AM
    Hi team, Kedro 0.18.10 doesn’t work with starters?
    Copy code
    > kedro new --starter=spaceflights
    kedro.framework.cli.utils.KedroCliError: Kedro project template not found at git+<https://github.com/kedro-org/kedro-starters.git>. Specified tag 0.18.10. The following tags are available: 0.17.0, 0.17.1, 0.17.2, 0.17.3, 0.17.4, 0.17.5, 0.17.6, 0.17.7, 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.18.4, 0.18.5, 0.18.6, 0.18.7, 0.18.8, 0.18.9. The aliases for the official Kedro starters are:
    - astro-airflow-iris
    - astro-iris
    - pandas-iris
    - pyspark
    - pyspark-iris
    - spaceflights
    - standalone-datacatalog
    
    Run with --verbose to see the full exception
    Error: Kedro project template not found at git+<https://github.com/kedro-org/kedro-starters.git>. Specified tag 0.18.10. The following tags are available: 0.17.0, 0.17.1, 0.17.2, 0.17.3, 0.17.4, 0.17.5, 0.17.6, 0.17.7, 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.18.4, 0.18.5, 0.18.6, 0.18.7, 0.18.8, 0.18.9. The aliases for the official Kedro starters are:
    - astro-airflow-iris
    - astro-iris
    - pandas-iris
    - pyspark
    - pyspark-iris
    - spaceflights
    - standalone-datacatalog
    j
    m
    • 3
    • 4
  • s

    Sebastian Cardona Lozano

    06/10/2023, 12:31 AM
    Hi all. I'm using Annoy library to perform a nearest neighbors search. To save the index created by the algorithm I need to build a custom dataset. I tried to follow these examples: 1) Docs. Example and Recommendation system example). My code is this:
    Copy code
    import fsspec
    from pathlib import PurePosixPath
    from typing import Any, Dict
    from annoy import AnnoyIndex
    from <http://kedro.io|kedro.io> import AbstractDataSet
    from kedro.io.core import get_filepath_str, get_protocol_and_path
    
    
    class AnnoyIndexDataSet(AbstractDataSet[AnnoyIndex, AnnoyIndex]):
        """``AnnoyIndexDataSet`` loads / save Annoy index from a given filepath.
        """
    
        def __init__(self, filepath: str, dimension:int, metric:str):
            """Creates a new instance of AnnoyIndexDataSet to load / save an Annoy
            Index at the given filepath.
    
            Args:
                filepath (str): The path to the file where the index will be saved
                    or loaded from.
                dimension (int): The length of the vectors that will be indexed.
                metric (str): The distance metric to use. One of "angular",
                    "euclidean", "manhattan", "hamming", or "dot".
            """
            # parse the path and protocol (e.g. file, http, s3, etc.)
            protocol, path = get_protocol_and_path(filepath)
            
            self._protocol = protocol
            self._filepath = PurePosixPath(path)
            self._fs = fsspec.filesystem(self._protocol)
            
            self.dimension = dimension
            self.metric = metric
    
        def _load(self) -> AnnoyIndex:
            """Load the index from the file.
    
            Returns:
                An instance of AnnoyIndex.
            """
            # using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
            load_path = get_filepath_str(self._filepath, self._protocol)
            
            annoy_index = AnnoyIndex(self.dimension, self.metric)
            annoy_index.load(load_path)
            return annoy_index
    
        def _save(self, annoy_index: AnnoyIndex) -> None:
            """Save the index to the file.
    
            Args:
                data: An instance of AnnoyIndex.
            """
            save_path = get_filepath_str(self._filepath, self._protocol)
    
            annoy_index.save(save_path)
    
        def _describe(self) -> Dict[str, Any]:
            """Return a dict describing the dataset.
    
            Returns:
                A dict with keys "filepath", "dimension", and "metric".
            """
            return {
                "filepath": self._filepath,
                "dimension": self.dimension,
                "metric": self.metric,
            }
    And in the data catalog I have this:
    Copy code
    annoy_index:
        type: pricing.extras.datasets.annoy_dataset.AnnoyIndexDataSet
        dimension: 1026
        metric: angular
        filepath: /data/06_models/products_index.ann
        layer: model_input
    My goal is to save the .ann file in Google Cloud Storage or in a local folder, but I got the next error when running the node that saves the file:
    Copy code
    DataSetError: Failed while saving data to data set AnnoyIndexDataSet(dimension=1026, 
    filepath=/data/06_models/products_index.ann, metric=angular).
    Unable to open: No such file or directory (2)
    Please your help. Thanks!!
    m
    n
    • 3
    • 7
1...232425...31Latest