https://kedro.org/ logo
Join Slack
Powered by
# questions
  • a

    Afaque Ahmad

    06/12/2023, 8:14 AM
    Hi Folks I'm working with reference to this feature on
    kedro-plugins
    . How should I setup my local development environment? I cannot find a
    requirements.txt
    file.
    n
    • 2
    • 2
  • a

    Abhishek Bhatia

    06/12/2023, 1:06 PM
    Hi Team, I am developing a kedro pipeline in which I pass around
    MemoryDataSet
    from nodes. By default kedro, deep copies the memoery dataset which leads to loss of information so I created a catalog entry with
    copy_mode
    set to
    assign
    . This solves our basic problem of objects being retained as is but messes up the DAG order displayed in kedro viz. Any solutions?
    n
    d
    • 3
    • 12
  • j

    Jose Nuñez

    06/12/2023, 3:32 PM
    Hello fellow Kedroids K🤖! . I'm having a very strange issue when saving a file to parquet. I'm getting this error:
    DataSetError: Failed while saving data to data set ParquetDataSet(filepath=/Users/jose_darnott/PycharmProjects/planta-litio/data/01_raw/data_sql.parquet, load_args={'engine': pyarrow}, protocol=file, save_args={'engine': pyarrow}). Duplicate column names found:  ['timestamp', 'lims_BFIL CO3', 'lims_BFIL Ca %', ...]
    # It's basically showing all the columns inside the dataframe, (here I'm showing only 3 of them) . My catalog entry looks like this:
    Copy code
    data_sql:
      type: pandas.ParquetDataSet
      filepath: data/01_raw/data_sql.parquet
      load_args:
        engine: pyarrow
      save_args:
        engine: pyarrow
      layer: III
    . I'm using: kedro==0.18.8 pandas==2.0.1 pyarrow==12.0.0 . The problem is quite similiar to this issue from 2022: https://github.com/kedro-org/kedro/discussions/1286 but in my case removing the load and save args as the OP mentions won't solve my problem. . This is quite puzzling, since I just did a df.to_clipboard() inside the node before returning my output, open it on a jupyter notebook and I see no problems with the dataframe, I can even save it to parquet without any issues. So that makes me thing the problem comes from kedro (?) . Anyways, as a workaround I'm saving the dataframe as csv and it's working just fine. But I'd like to find a way to make the parquet work again since this is a huge file. Thanks in advance 🦜!
    n
    • 2
    • 11
  • t

    Trevor

    06/12/2023, 4:52 PM
    Is it possible to import an already packaged Kedro pipeline in a separate script and assign node return values to new variables for use later in the script? I've been trying to get people on our team on board with Kedro and a couple of us would be really interested in being able to use
    MemoryDataSet
    returned by nodes as pieces of larger scripts. Up until now, I've only needed to import
    main
    and that has worked for our purposes so far
    n
    • 2
    • 11
  • j

    Jared T

    06/12/2023, 4:54 PM
    Hi all I am having an issue defining a pipeline with using namespaces in multiple modular pipelines. I am following the structure of the spaceflights tutorial and I am getting this error:
    Copy code
    ValueError: Duplicate keys found in 
    <project repo>/conf/base/parameters/pr
    epare.yml and:
    - 
    <project repo>/conf/base/parameters/in
    gest.yml: train_pipeline
    I have the
    train_pipeline
    namespace in both the ingest and prepare modular sub-pipelines, here are the respective yamls:
    Copy code
    # The following is a list of parameters for ingest pipeline for each namespace (train, inference)
    
    
    # Parameters for train namespace
    train_pipeline:
      ingestion_options:
        #Portfolio to use
        portfolio_name: has_meds_portfolio.HasMedsPortfolio
        # Feature store sub-pipes, only one for now.
        feature_store_subpipe_name: BasicFeaturePipeline
        # Expected output columns
        expected_columns:
          datetime: datetime64[ns]
          patient_id: int64
          age_days: int64
          Male: int64
          binary_smoking_status: object
          overall_censorship_time: datetime64[ns]
          months_until_overall_censorship: int64
          death_date: datetime64[ns]
    
    # Parameters for inference namespace
    # currently same as train but this will change
    # first updated to Nightly Porrtfolio then to
    # an api call to the valuation queue.
    inference_pipeline:
      ingestion_options:
        #Portfolio to use
        portfolio_name: has_meds_portfolio.HasMedsPortfolio
        # Feature store sub-pipes, only one for now.
        feature_store_subpipe_name: BasicFeaturePipeline
        # Expected output columns
        expected_columns:
          datetime: datetime64[ns]
          patient_id: int64
          age_days: int64
          Male: int64
          binary_smoking_status: object
          overall_censorship_time: datetime64[ns]
          months_until_overall_censorship: int64
          death_date: datetime64[ns]
    Copy code
    # all parameters for prepare pipeline are in train_pipeline namespace
    train_pipeline:
      preparation_options:
        # target params
        target_death_buffer_months: 2
        
        # split params 
        splitter: TimeSeriesSplit
        holdout_size: 0.3
    am I not allowed to use the same namespace in multiple modular pipelines?
    d
    • 2
    • 5
  • c

    CHIRAG WADHWA

    06/13/2023, 4:34 AM
    Hi all, i have recently come across this error
    kedro-datasets 1.4.0 does not provide the extra 'pickle.pickledataset'
    does kedro-datasets not support pickle datasets ? context - i'm removing
    kedro.extras
    datasets from our asset codebase and using kedro-datasets
    j
    n
    • 3
    • 12
  • a

    Abhishek Bhatia

    06/13/2023, 10:21 AM
    Hi Team, I have a basic doubt about using
    PartitionedDataSet
    . In the below pipeline, I have a node which returns a dictionary with values as pandas dataframes, so I define a
    PartionedDataSet
    catalog entry for it. If I run the nodes till only this node then the files do get saved in the correct location but the output is an empty dictionary. If I add an identity node, then the correct key-value pair is returned. Is this the desired behaviour?
    d
    • 2
    • 9
  • j

    Jose Nuñez

    06/13/2023, 1:39 PM
    Hi Kedroids 🦜🤖! I updated my kedro viz to the latest version but now I'm unable to preview datasets as in the previous version... I got used to that feature 😄! Is there any way to have that back? I was checking the setting but there is nothing there neither. Thanks in advance!
    n
    r
    +3
    • 6
    • 12
  • j

    Jeremi DeBlois-Beaucage

    06/13/2023, 4:32 PM
    Hi team, did anyone use Kedro in a multi-GPU training setup? Would love to ask a few questions on how to best setup the repo. We are using Databricks and MLFlow, and are trying to assess whether Kedro can handle multi-GPU training in a straightforward way. Thanks!
    d
    m
    n
    • 4
    • 4
  • a

    Andreas_Kokolantonakis

    06/14/2023, 12:19 PM
    hello everyone, I am using Kedro docker and I am running on an issue where docker cannot find the globals I am specifying for my enviroments. e,g I want to run
    kedro run --env=dev
    from docker and I am getting
    ValueError: Failed to format pattern '${s3_root_path}': no config value found, no default provided
    What’s the best way to fix it? Thank you in advance!
    d
    n
    • 3
    • 21
  • r

    Rafał Nowak

    06/14/2023, 4:49 PM
    Hello, I am using kedro with dvc for data version control. The dvc is based on
    gto
    which depends on
    semver >= 3
    Unfortunately I cannot install
    kedro-viz
    since
    kedro-viz 6.3.0
    depends on
    semver < 3
    Is there any reason why
    kedro-viz
    is limited to
    semver < 3
    ? The current
    semver
    is
    3.0.1
    . Could anyone from kedro-viz team relax this dependency limitation?
    n
    j
    • 3
    • 12
  • a

    Alexandre Ouellet

    06/14/2023, 7:07 PM
    I believe I have found a bug when running the same pipieline with different parameters. For instance, I have the following pipeline : function X-> versionned dataset -> function Y if I start this pipeline twice, if the 2nd pipeline's X node finishes earlier, I don't get the expected dataset
    👀 1
    n
    i
    • 3
    • 40
  • k

    Khangjrakpam Arjun

    06/15/2023, 12:08 PM
    Hi team, I am trying to save a plotly image to html for reporting purpose. Is there a way where we can save a plotly image as an html plot in kedro catalog? I tried using the following class .
    Copy code
    type: kedro.extras.datasets.pandas.HTMLDataSet
    On using the above class I am getting this error :
    Copy code
    kedro.io.core.DataSetError: An exception occurred when parsing config for DataSet 'boxplot_figures_cfa':
    Class 'kedro.extras.datasets.pandas.HTMLDataSet' not found or one of its dependencies has not been installed.
    Does this class even exist?
    d
    m
    • 3
    • 2
  • j

    Javier del Villar

    06/15/2023, 6:51 PM
    Hi all! I was trying the collaborative experiment tracking feature https://kedro-org.slack.com/archives/C03RKAQ0MGQ/p1686144020499809 Is it possible that "Notes" are not been shared? I should be seeing a note a coworker left me. I can see everything else.
    t
    • 2
    • 4
  • g

    Georgi Iliev

    06/16/2023, 7:56 AM
    Hi team! I need advice on using
    ONNX
    files and uploading them to S3 automatically using "only" the catalog definition. Broadly speaking, the main flow of what we're trying to build is the following: 1. There is a process that trains and creates some files (PCA, scaler, some K-Means models, etc.) and saves them as
    Pickle
    to use them between different nodes. 2. Once the main
    pipeline
    is done, we're ready to distribute the model to our services. 3. We're using
    ONNX
    because our services are not built in Python and the ONNX libraries we use are a bit faster. 4. So taking this into account, we have a
    publish
    pipeline now that takes this
    Picke
    files, converts them to
    ONNX
    using
    convert_sklearn
    , and then uploads to S3. So, my main question here is: Is there a way to implement this so the transformation and the S3 upload is done automatically? • I know that we can specify a S3 path in the catalog, but I didn't see how to set the
    .onnx
    file type.
    K 3
    j
    • 2
    • 2
  • k

    Khangjrakpam Arjun

    06/16/2023, 8:23 AM
    Hi Team, Is there a way to save a plot as a pdf/png/jpeg in kedro catalog? I tried using the
    kedro.extras.datasets.matplotlib.MatplotlibWriter
    class to save a figure object as a .png file in the kedro catalog and I got the below error:
    Copy code
    'Figure' object has no attribute 'save'
    Is there a way to use
    sav_fig
    method instead of
    save
    method to save an figure object in the kedro catalog?
    n
    • 2
    • 1
  • c

    Camilo López

    06/16/2023, 12:18 PM
    Hi Team, I'm deploying Kedro with Databricks Workflows. We have a way to breakdown each node of the kedro pipeline in to a task of Databricks workflows Job. The issues is that each task takes ~10 seconds to create the Kedro session which generates a lot of overhead for the pipeline. Is a way to create the Kedro session faster or a recommendation to avoid this 10 additional seconds for each node?
    n
    • 2
    • 1
  • g

    Guilherme Parreira

    06/16/2023, 12:28 PM
    Hi Guys. I am using Kedro with python `3.10.6`: (photo attached). For
    auto-sklearn
    I will need to downgrade my Kedro project to
    3.9
    version. I already installed
    python 3.9.16
    with
    pyenv
    . Which would be my next steps? (I need to change the python version in
    Pipfile
    to
    3.9
    , and individually change the kernel of the notebook?) If I change manually the kernel version of my notebook, it does not recognize as being part of the project (second photo attached) Thanks in advance!
    n
    • 2
    • 7
  • v

    Vici

    06/16/2023, 1:08 PM
    Hi everybody! I'm currently analyzing a large number of signals (N~=100), where the source data is organized as a partitioned data set. For each of these signals I want to make a plotly plot of each signal, such that I can explore the plots nicely e.g. in
    kedro viz
    . I saved the plots as follows:
    Copy code
    plots:
      type: PartitionedDataSet
      path: data/08_reporting/plots
      dataset:
        type: plotly.JSONDataSet
      filename_suffix: '.json'
    Saving all the plots worked just fine (and I was able to load and show individual JSONs via
    fig = plotly.io.read_json(file); fig.show()
    . But it turns out, when you save plots in bulk like this, they cannot be displayed in kedro viz. Is there a way to allow accessing bulk-saved-plots from kedro-viz (e.g., clicking the partitioned dataset in kedro viz, then having the option to select a specific plots), without forcing me to literally have a hundred JSONDataSets cluttering kedro viz? Thank you so much 😊 Edit: I'm also open for other (non-kedronic) ideas regarding the exploration of a large bulk of plotly plots.
    👀 1
    loading 1
    n
    • 2
    • 1
  • s

    Sebastian Cardona Lozano

    06/16/2023, 2:14 PM
    Hi kedroids. In my pipeline, I have this logic for 2 nodes: node 1: reads a table and executes a data process only to new items that are not in the table node 2: Executes transformations to those new items and append them to the same table. I'm getting this error:
    Copy code
    CircularDependencyError: Circular dependencies exist among these items: [node1 ...., node2]
    Yes, the output of node 2 is an input for node 1. My goal is to not process all the items every time I run the pipeline, but only the new items not in that table. How can I do this? Thanks!! 🙂
    d
    n
    j
    • 4
    • 6
  • n

    Nok Lam Chan

    06/17/2023, 10:33 AM
    Hi, I wonder if anyone have experience using Kedro with Prefect 2.0? How different is it from Prefect 1?
    i
    j
    • 3
    • 7
  • a

    Abhishek Bhatia

    06/17/2023, 1:15 PM
    Hi Team, is there a way to have multiple nested partitions in
    PartitionedDataSet
    ? It seems kedro assumes, the keys to be flat and string so neither a specification of tuple as keys nor nested dictionary specification works.
    n
    • 2
    • 7
  • a

    Abhishek Bhatia

    06/19/2023, 7:46 AM
    Hi Folks! I have a
    PartitionedDataSet
    like this:
    Copy code
    scenario_x/
    ├── iter_1/
    │   ├── run_1.csv
    │   ├── run_2.csv
    │   └── run_3.csv
    └── iter_2/
        ├── run_1.csv
        ├── run_2.csv
        └── run_3.csv
    scenario_y/
    ├── iter_1/
    │   ├── run_1.csv
    │   ├── run_2.csv
    │   └── run_3.csv
    └── iter_2/
        ├── run_1.csv
        ├── run_2.csv
        └── run_3.csv
    The catalog entry is like this:
    Copy code
    _partitioned_csvs: &_partitioned_csvs
      type: PartitionedDataSet
      dataset:
        type: pandas.CSVDataSet
        load_args:
          index_col: 0
        save_args:
          index: true
      overwrite: true
      filename_suffix: ".csv"
    
    _partitioned_jsons: &_partitioned_jsons
      type: PartitionedDataSet
      dataset:
        type: json.JSONDataSet
      filename_suffix: ".json"
    
    my_csv_part_ds:
      path: data/07_model_output/my_csv_part_ds
      <<: *_partitioned_csvs
    
    my_json_part_ds:
      path: data/07_model_output/my_json_part_ds
      <<: *_partitioned_jsons
    When I run the pipeline, the csv partitioned dataset gets deleted first, and then new one gets written, but the json partitioned dataset remains, and new ones get added. I need a sort of a custom behaviour, wherein, the 2nd level of the partition should get overwritten, and not first level partition i.e. in the node which produces the partitioned csv, the return value is like this:
    Copy code
    def node_that_generates_part_ds(scenario, **kwargs):
      res = {'scenario_x/iter_1/run_1': df1, 'scenario_x/iter_1/run_2': df2,  .... and so on}}
      return res
    so when return
    res
    keys contain scenario_x, scenario_y shoul NOT get deleted. Can anyone guide me on how can I achieve this? Thanks! 🙂
  • m

    marrrcin

    06/19/2023, 7:49 AM
    [Custom starters] Is there a way to make sure that some of the prompts from starter’s cookiecutter prompts will be actual
    bool
    ? We experience an issue where all values from the interactive prompts are being casted to
    str
    , which is really inconvenient for `true`/`false` values, because they enforce such syntax:
    {%- if <http://cookiecutter.my|cookiecutter.my>_flag != "False" %}
    .
    d
    n
    • 3
    • 5
  • j

    Juan Luis

    06/20/2023, 10:42 AM
    just helped a colleague get Kedro Viz working. two observations: • Kedro Viz launched, but
    127.0.0.1
    was not working. I suspect it's because they were using an SSH connection to a Linux machine on AWS.
    localhost
    worked perfectly. any reason to use the IP directly? (user was on Windows) • their pipelines were huuuuuuge. he asked me about a way to group sub-pipelines visually, but I'm not versed enough. is there any way to do it?
    d
    i
    t
    • 4
    • 10
  • p

    Pranav Khurana

    06/20/2023, 11:32 AM
    Hi folks I'm trying to create a custom kedro dataset (inherited from AbstractVersionedDataSet) I have to write a few tests similar to the ones existing for CsvDataSet However I'm witnessing that a few tests are failing. Need some advice around the same. happy to hop on a call to discuss the details
    j
    • 2
    • 23
  • k

    Kevin Mills

    06/20/2023, 7:32 PM
    Hi all. I am new to using kedro. Went through the spaceflight tutorial and other parts of the documentation. I was wondering if there was a tutorial around how to use the API by chance.
    d
    j
    • 3
    • 8
  • i

    Idris Benkhelil

    06/21/2023, 6:02 AM
    Hello, Thank you for this great library. I am a DS working in France. I have a question, I want to make my pipeline dynamic, ie: Pipeline:
    Copy code
    [etape 1] > [etape 2] > [if score_etape2 < X ] > [etape4]
    				      > [if score_etape2 >= X ] > [etape5]
    Do you have any indication of how I can do this? Or an example of code already implemented? Thanks in advance. Idris
    m
    j
    • 3
    • 8
  • m

    Marc Gris

    06/21/2023, 7:10 AM
    Hi Everyone, I was experimenting with
    @singledispatchmethod
    from the
    functools
    library to refactor my code and create “per-model type” implementations of
    fit()
    ,
    predict()
    …). Unfortunately, this results in a
    ValueError: Invalid Node definition: first argument must be a function, not 'singledispatchmethod'.
    And indeed in kedro/pipeline/node.py:72
    Copy code
    if not callable(func):
                raise ValueError(
                    _node_error_message(
                        f"first argument must be a function, not '{type(func).__name__}'."
                    )
                )
    Is this “rejection” of functools.singledispatchmethod a “un-intended collateral” of the test (in which case I could make a pull request to handle it) or are there some things “down the line” that would justify not allow the use of functools & co ? 🙂 Thx
    j
    d
    i
    • 4
    • 6
  • m

    Marc Gris

    06/21/2023, 9:31 AM
    CONFIGS CONSOLIDATIONS / INTERPOLATIONS in conf/config.yml
    random_state: 42
    and in conf/model_training.yml
    random_state: ${random_state}
    Copy code
    kedro run 
    >>> [...] 
    TypeError: Cannot cast scalar from dtype('<U15') to dtype('int64') according to the rule 'safe'
    If I get this correctly, the consolidation / interpolation process resulted in
    random_state
    being assigned the value
    "42"
    instead of
    42
    Granted, I could easily circumvent this issue with
    int(params['random_state'])
    , but I’m curious and would like to know if this is an expected behavior, and whether there is a more robust / elegant way of handling it. Thx in advance M
    d
    • 2
    • 8
1...242526...31Latest