https://kedro.org/ logo
Join SlackCommunities
Powered by
# questions
  • n

    noam

    05/04/2023, 2:55 PM
    Hi Kedro users, Does anyone know how one would implement versioning for a PartitionedDataSet? In other words, does anyone have a convenient solution for enabling versioning (i.e. setting โ€œversioned: Trueโ€ in the data catalog) for a PartitionedDataSet the same way one can for a PickleDataSet? The following is my code in conf/data/local/catalog.yml:
    Copy code
    validation_data:
      type: kedro.io.PartitionedDataSet
      path: data/03_primary/validation_data/
      dataset: pickle.PickleDataSet
      filename_suffix: ".df"
      versioned: True
    Thanks in advance!
    a
    d
    • 3
    • 3
  • m

    marrrcin

    05/05/2023, 8:45 AM
    [Kedro Starters] Iโ€™m wondering, whether custom Kedro starters (especially from plugins) should have the same behaviour as official ones w.r.t the tags. Kedro starters work with
    kedro new --starter=spaceflights
    , but when weโ€™ve developed our own starter for Kedro Snowflake, it requires to additionally specify
    --checkout=
    flag, because of the default mechanism in Kedro:
    Copy code
    Error: Kedro project template not found at git+<https://github.com/getindata/kedro-snowflake>. Specified tag 0.18.8. The following tags are available: 0.0.1, 0.1.0, 0.1.1.
    Is there a way (or if not, I think there should be) for the plugin to specify the default tag to use? The versioning of Kedro should not affect the versioning of custom starters/plugins ๐Ÿค”
    ๐Ÿ‘€ 2
    ๐Ÿ‘ 1
    ๐Ÿ‘๐Ÿผ 1
    a
    • 2
    • 3
  • n

    noam

    05/05/2023, 12:01 PM
    Thank you @Antony Milne and @Deepyaman Datta for responding to question about data versioning with a PartitionedDataSet (one cannot use
    versioned: True
    argument in the data catalog for this kind of dataset). Perhaps it is better than I explain the root issue/challenge, in case there are solutions I am missing. The Problem: By default, Kedro overwrites data objects with each run, using the paths set in the data catalog. The Question: What is a convenient solution/tech stack for enabling the execution of multiple parallel ML experiments in my Kedro pipeline, while maintaining thatโ€ฆ 1. Each experiment triggers the data to be versioned effectively. Ideallyโ€ฆ a. When there are changes to the data, the data is copied and assigned a unique ID (sha, md5, timestamp), perhaps with metadata regarding the parameters that were used to generate the data. In this case, it is important that the data is stored a sensible, organized manner. b. When there have been no changes the data, the same unique ID (and metadata) are used and can be extracted 2. The unique IDs (and metadata) for each relevant dataset relevant to the ML run can be extracted and stored alongside the (presumably lighter) results of the experiment 3. Given 1. and 2. above, the results are reproducible (they offer point-time-correctness) The solutions I have thus far come across are problematic: โ€ข Writing a class to set dynamic dataset filepaths โ—ฆ The main issue with this approach is that it is incredibly high-maintenance. It requires continuous, careful attention to the parameters used to define the dynamic filepaths. โ–ช๏ธŽ For example, if I set the filepath to
    training_data_{a}_{b}
    using parameters
    a
    and
    b
    and I change parameter
    c
    , changing the composition of the data, a new dataset will overwrite the previous dataset. If I wanted to have kept them both, I would have had to remember to update filepath parameters to include
    c
    . Of course, with many different data-defining parameters, this becomes problematic rather quickly. โ€ข Use Kedro versioning - use the
    versioned: True
    argument in the catalog underneath datasets for which you desire to version โ—ฆ The first issue with this approach is that it appears to version all of the data with every new run, presenting a massive storage issue and the necessity for a custom retention policy to clear useless/outdated data. โ—ฆ The second issue is that this doesnt work with PartitionedDataSet datasets. Are there any effective solutions I am missing?
    ๐Ÿ‘ 1
    ๐Ÿ‘€ 1
    1000 1
    n
    • 2
    • 2
  • o

    Ofir

    05/05/2023, 2:22 PM
    Does Kedro work with DVC or any Data Version Control solution?
    j
    n
    • 3
    • 30
  • o

    Ofir

    05/05/2023, 3:04 PM
    Does Kedro allow dumping experiment outputs (results) into different folders without changing the YAML files? We would like to run the same code (pipeline) but with different input datasets. Whatโ€™s the best way to achieve that without having to copying-n-pasting the Kedro folders and then manually modifying the YAML files? I looked at the documentation of
    kedro run
    but it doesnโ€™t accept output dirs / workspace as a parameter. It assumes you have the configuration files already in place. Perhaps I need to come up with my own
    kedro
    wrapper? (to auto-generate the configuration files and then call
    kedro run
    )
  • o

    Ofir

    05/05/2023, 3:11 PM
    Itโ€™s like the concept of โ€œvirtually forking/cloningโ€ an existing experiment, with slight changes in the I/O.
  • o

    Ofir

    05/05/2023, 3:16 PM
    Please correct me if Iโ€™m wrong but it looks like Kedroโ€™s implementation has slightly overlooked input dataset as a differentiating factor for an experiment. That is, Kedro doesnโ€™t consider a different input dataset as a different session/experiment run.
    j
    • 2
    • 9
  • a

    Adrien

    05/05/2023, 3:38 PM
    Someone succed to deploy a kedro pipeline on OVH ?
  • j

    Jose Nuรฑez

    05/05/2023, 7:51 PM
    Hi Team, I have always had this question, but never asked before: Here I have a very short & simple pipeline (look at the picture) and I have notice that the
    Run Command
    of every function starts with
    None.
    for instance in the picture you can read
    kedro run --to-nodes=None.clean_mdt
    why is this? my pipeline executes just fine without any issues after a regular
    kedro run
    . If I do a
    kedro run --to-nodes=None.clean_mdt
    I'll get an error so I manually need to erase the
    None.
    before running. So running this instead works just fine
    kedro run --to-nodes=clean_mdt
    d
    • 2
    • 6
  • r

    Rob

    05/06/2023, 3:59 AM
    Hi everyone, Why are nodes in a namespace collapsed by default when deploying
    kedro viz
    over Host
    '0.0.0.0'
    ? I want to show them expanded by default. Here's an example of the behavior I'm seeing: https://brawlstars-retention-pipeline-6u27jcczha-uw.a.run.app/, and this doesn't happen with localhost. Can someone suggest a way to modify the code used in this thread to show the nodes expanded? Thanks ๐Ÿ™‚
    • 1
    • 1
  • d

    Dawid Bugajny

    05/08/2023, 10:34 AM
    Hello! I would like to ask if there is any way to create multiple pipelines in a single directory. I want to have 2 pipelines with the same nodes, but with different outputs. If I create a function that is not called "create_pipeline", but for example "create_pipeline_2" in the same directory, it will not be found by the find_pipelines() function (kedro.framework.project). I have seen modular pipelines, but I would still have to create a new directory for pipelines (correct me if I am wrong)
    j
    m
    j
    • 4
    • 4
  • a

    Andreas_Kokolantonakis

    05/09/2023, 2:51 PM
    Hi everyone, I am facing the following issue when I am trying to read a CSV with spark. With pandas, it works fine but with spark, it seems I need some extra configurations. Could you please point me in the right direction? thank you in advance!
    j
    j
    n
    • 4
    • 23
  • j

    Juan Luis

    05/09/2023, 3:38 PM
    anybody with a Windows machine would be willing to lend me a hand with this? windows https://github.com/kedro-org/kedro/pull/2568#issuecomment-1539949117 there seems to be a test using
    python -m build
    that is failing, but I can't reproduce it locally
  • e

    Elena Mironova

    05/09/2023, 3:43 PM
    Hi team, Question on
    kedro pull micropkg
    and sdist here. Would anyone know how to best deal with the multiple egg.info error in the screenshot? Context: with
    kedro==0.18.3
    on a mac I used
    python -m build --sdist path/to/package
    to create tar.gz inside our Git repo (i know alternative is available to create sdist through
    kedro micropkg package
    , but i can't do it from inside a kedro project). I see the archive, but when doing
    kedro pull micropkg
    (from kedro project root), the following error comes up. This may or may not be related to the existing issue .
    j
    • 2
    • 8
  • j

    Javier del Villar

    05/10/2023, 1:45 PM
    Hello everybody, I have a problem. Iโ€™m unable to load a CSV folder from local storage using
    SparkDataSet
    . More details in thread.
    i
    s
    • 3
    • 13
  • r

    Richard Bownes

    05/10/2023, 2:46 PM
    If I wanted to get the total run time for each node in a pipeline what's the best method?
    j
    • 2
    • 1
  • m

    Mate Scharnitzky

    05/10/2023, 3:33 PM
    Hi Everyone, Weโ€™re in the process of relaxing our kedro dependencies in our repository. The current Kedro version is
    ==0.18.3
    . When we relax it to
    ~=0.18.3
    , pip would install
    0.18.8
    while compiling and we get the below error for some of our pipelines:
    Copy code
    KeyError: 'logging'
    Can you provide some pointers what could be the reason behind this?
    0.18.*
    should have no breaking changes based on the RELEASE notes so Iโ€™m not sure what could explain this. Thanks for the help!
    j
    a
    • 3
    • 19
  • p

    Panos P

    05/10/2023, 5:08 PM
    Hello Kedro folks, I have a question on dynamic generation of nodes via parameters in my catalog. I have this paramaters.yml file:
    Copy code
    params:
     - p1
     - p2
    
    p1: value1
    p2: value2
    I want to create a pipeline with 2 nodes that each node take as input one of these params. e.g.
    Copy code
    nodes = [node(func, f"params:{p}", f"output_{p}" for p in params]
    pipeline(nodes)
    Is that possible and how?
    j
    n
    • 3
    • 4
  • b

    Brandon Meek

    05/10/2023, 11:58 PM
    Hey everyone, I'm have a modular pipeline that I'm running with the only difference being a small parameter change but the dataset is a really large dataset and the issue I'm running into is Kedro is running the first node of each modular pipeline causing a OOM issue, is there a way to make it finish running a modular pipeline before beginning the next?
    i
    n
    • 3
    • 2
  • m

    Melvin Kok

    05/11/2023, 1:21 AM
    Hi team, is there a way to access the kedro catalog at some arbitrary point in a run? Context: Iโ€™m using hooks to run Great Expectations, and need access to the Kedro context and catalog during an Action in Great Expectations. However I canโ€™t pass in the catalog from the hook into the action because Iโ€™m not calling the Action
    run
    function directly, Great Expectations is doing it, and the config only allows for passing in serializable objects
    n
    • 2
    • 1
  • a

    Artur Dobrogowski

    05/11/2023, 1:35 PM
    Hello, where can I find list of available data catalog types? I'm looking for something that parses yaml but I can't find it in the docs
    j
    • 2
    • 4
  • e

    Erwin

    05/11/2023, 1:38 PM
    Hi team! What are the options to orchestrate jobs using a docker image in azure cloud? I have a docker image with a K kedro project. Iโ€™m looking for some options to run this in a daily basis enabling observability, alerts, retries, etc. I worked with data factory in the past, but not sure if there is a activity to execute dockers from the container registry.
  • g

    Giuseppe Ughi

    05/11/2023, 3:34 PM
    Hi Team! Iโ€™m currently trying to create a dynamic catalog on Kedro with Jinja2 but struggling to import the list on which to build the catalog entries. Iโ€™m new to Jinja so the solution might be trivial. For reproducibility sake Iโ€™m basing the following examples on the jinja-example ๐Ÿ™‚. When considering a hard-coded list in the file
    conf/base/catalog.yml
    as follows
    Copy code
    {% for region in ['parasubicular', 'parainsular'] %}
    
    {{ region }}.data_right:
        type: PartitionedDataSet
        path: data/01_raw/ClinicalDTI/R_VIM/seedmasks/
        dataset: pandas.CSVDataSet
        filename_suffix: /{{ region  }}_R_T1.nii.gz
    
    {{ region }}.data_right_output:
        type: pandas.CSVDataSet
        filepath: data/03_primary/{{ region }}_output.csv
    
    {% endfor %}
    everything works fine. However, I need to iterate over a list that is not practical to hard-code therefore I was hoping to have something like follows
    Copy code
    regions:
      - 'parasubicular'
      - 'parainsular'
    
    {% for region in regions %}
    
    {{ region }}.data_right:
        type: PartitionedDataSet
        path: data/01_raw/ClinicalDTI/R_VIM/seedmasks/
        dataset: pandas.CSVDataSet
        filename_suffix: /{{ region  }}_R_T1.nii.gz
    
    {{ region }}.data_right_output:
        type: pandas.CSVDataSet
        filepath: data/03_primary/{{ region }}_output.csv
    
    {% endfor %}
    but no matter where I define the
    regions
    list (i tried to define it in different
    .yml
    files) I stumble on the same error screen-shotted below. Do you by chance know if I have to save the jinja pattern in a different file, there is a specific place where I have to save the list that I want to read, or if I have to change the parsing somehow? Thank you in advance!!
    g
    n
    • 3
    • 3
  • t

    Toni - TomTom - Madrid

    05/12/2023, 7:52 AM
    Good morning! anyone has tried to run Databricks Mosaic with Kedro? run it locally is challenging https://stackoverflow.com/questions/72289894/how-to-run-mosaic-locally-outside-databricks This solution did not work, but maybe because I am not tuning it properly in kedro :S, any help is appreciated! thanks
    ๐Ÿ“ข 1
  • n

    Nitin Soni

    05/12/2023, 12:23 PM
    hi i'm nitin soni tried to create conditional pipelines how can i do it suggest me something i'm totally new in kedro
    j
    m
    • 3
    • 6
  • o

    Ofir

    05/14/2023, 8:43 AM
    A colleague of mine is trying to join Kedroโ€™s Slack but it seems like Kedroโ€™s slack has exceeded its members capacity Can anyone help with that? moderators? admins?
    j
    • 2
    • 2
  • c

    Chengjun Jin

    05/14/2023, 8:55 PM
    Hi there, A question regarding Kedro + Great Expectation, I am following the example on https://docs.kedro.org/en/stable/hooks/examples.html#add-data-validation. V3. This is my check points yaml:
    Copy code
    name: pm_stat_checkpoint
    config_version: 1.0
    template_name:
    module_name: great_expectations.checkpoint
    class_name: Checkpoint
    run_name_template:
    expectation_suite_name:
    batch_request: {}
    action_list:
      - name: store_validation_result
        action:
          class_name: StoreValidationResultAction
      - name: store_evaluation_params
        action:
          class_name: StoreEvaluationParametersAction
      - name: update_data_docs
        action:
          class_name: UpdateDataDocsAction
          site_names: []
    evaluation_parameters: {}
    runtime_configuration: {}
    validations:
      - batch_request:
          datasource_name: default_pandas_datasource
          data_asset_name: my_runtime_asset_name
          data_connector_name: default_runtime_data_connector_name
        expectation_suite_name: pm_expectation_suite
    profilers: []
    ge_cloud_id:
    expectation_suite_ge_cloud_id:
    Copy code
    INFO - Loading data from 'pm_sales_raw' (ParquetDataSet)...
    INFO - FileDataContext loading fluent config
    INFO - Loading 'datasources' ->
    [{'name': 'default_pandas_datasource', 'type': 'pandas'}]
    INFO - Loaded 'datasources' ->
    []
    INFO - Of 1 entries, no 'datasources' could be loaded
    ...
    ...
    DatasourceError: Cannot initialize datasource default_pandas_datasource, error: The given datasource 
    could not be retrieved from the DataContext; please confirm that your configuration is accurate.
    It seems that there is a problem in loading the data source. Did I miss some steps? Thank you
    • 1
    • 1
  • m

    Mate Scharnitzky

    05/15/2023, 12:03 PM
    K `kedro-datasets`: dependencies Hi Team, Where do you define dependencies for the
    kedro-datasets
    package? We ran into some pip resolver issues and turned out that from
    kedro-datasets==1.0.0
    and above would require kedro to be
    kedro~=0.18.4
    . We can verify this by
    pip install kedro-datasets==0.0.7 --dry-run
    but we donโ€™t find where this dependency is actually defined. In
    setup.py
    itโ€™s actually not mentioned. Thank you! @Kasper Janehag
    n
    j
    • 3
    • 43
  • a

    Andrew Doherty

    05/15/2023, 2:27 PM
    Hi all, I am developing a namespace pipeline and have come into an issue when using Neptune to track my experiments. When I pass
    "neptune_run"
    as an input to a pipeline node I get the following error:
    ValueError: Pipeline input(s) {'NAMESPACE.neptune_run'} not found in the DataCatalog
    Where
    "NAMESPACE"
    is my namespace pipeline name. Is there a way to use Neptune along with namespace pipelines? Thanks again.
    ๐Ÿ‘€ 1
    n
    j
    s
    • 4
    • 27
  • a

    Amanda Locatelli

    05/15/2023, 2:52 PM
    Hey Everyone, I have to enforce a timeout argument to my postgress connection in a kedro project. I have included in the catalog.yml, but I keep receiving this error:
    Copy code
    lib/python3.7/site-packages/kedro/io/core.py", line 191, in load
        raise DataSetError(message) from exc
    kedro.io.core.DataSetError: Failed while loading data from data set SparkPostgresJDBCDataSet(load_args={'properties': {'connectTimeout': 300, 'driver': org.postgresql.Driver}}, option_args=True, save_args={'properties': {'driver': org.postgresql.Driver}}, table= , url=jdbc:postgresql:).
    An error occurred while calling o49135.setProperty. Trace:
    py4j.Py4JException: Method setProperty([class java.lang.String, class java.lang.Integer]) does not exist
            at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
            at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
            at py4j.Gateway.invoke(Gateway.java:274)
            at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
            at py4j.commands.CallCommand.execute(CallCommand.java:79)
            at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
            at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
            at java.lang.Thread.run(Thread.java:750)
    Does anyone know what I am doing wrong, and how to fix it? Or what are the files which I am suppose to update this timeout argument?
    j
    • 2
    • 2
1...212223...31Latest