https://kedro.org/ logo
Join Slack
Powered by
# questions
  • e

    Elior Cohen

    11/09/2022, 3:01 PM
    ~Question regarding using kedro on managed notebooks. I'm working on AzureML workspace, which very similar to Databricks, have a managed Juptyer instance. Now I have a working kedro project. By working I mean I can access the data with~
    Copy code
    %> kedro ipython
    >>> catalog.load('companies')
    The same kernel, is also registered in the managed Jupyter instance. In the first cell I run
    Copy code
    %load_ext kedro.ipython
    %reload_kedro path/to/my/project
    ~Executing this results in the attached stacktrace What am I doing wrong?~
    stacktrace.txt
  • h

    Hervé Lauwerier

    11/09/2022, 3:25 PM
    Hello everyone! Is it possible to use a jinja generated catalog in order to generate all the nodes in a pipeline. I haven’t found a way yet to generate every node that would save every table to GCP with an identity function. The catalog contains around 150 tables to backup and I would like to avoid declaring every node.
    m
    • 2
    • 4
  • a

    AVallarino

    11/09/2022, 6:27 PM
    Hello everyone! I posted this question on Discord a few days ago, but I still can’t figure it out. I’m trying the started Iris project, reading the raw .csv files from S3, but I’m having issues with versions (kedro, boto3, s3fs) using Virtualenv.
    ✅ 1
    d
    • 2
    • 8
  • z

    Zihao Xu

    11/09/2022, 9:48 PM
    Hi team, I am following the tutorial for
    kedro viz
    experiment tracking: https://kedro.readthedocs.io/en/stable/tutorial/set_up_experiment_tracking.html#set-up-your-nodes-and-pipelines-to-log-metrics. But I keep getting the error “You don’t have any experiments” within
    kedro viz
    view. Here are a few observations: 1. Within the spaceflight starter, I do not see the file
    src/settings.py
    , but had to create it myself and pasted in the specified content for
    SQLiteStore
    2. The tutorial mentions “proceed to set up the tracking datasets or set up your nodes and pipelines to log metrics; these two activities are interchangeable.” But I had to implement both steps to make it work. 3. After a few
    kedro run
    , I do not see
    session_store.db
    appearing within
    09_tracking
    folder, which could be the reason for my error? Any insights would be greatly appreciated! Thanks team!
    ✅ 1
    m
    • 2
    • 6
  • a

    Amala

    11/09/2022, 10:29 PM
    Hi team, I’m looking for a programmatic way to rerun the failed nodes in a kedro pipeline. The kedro pipeline is triggered by airflow from the outside. So we were taking the approach of using kedro-airflow plugin and do the node/task retry by passing an argument. Looking for ways to do it without the plugin. Any help/pointers? Thanks in advance
    i
    a
    a
    • 4
    • 5
  • c

    Cyril Verluise

    11/10/2022, 1:49 PM
    hello there, hope you are doing great! kedroid Issue: I'm trying to persist a dict as a yaml file. The related catalog entry is the following:
    Copy code
    optimisation_programme:
      type: yaml.YAMLDataSet
      filepath: data/test/05_model_input/optimisation_programme.yaml
      layer: model_input
    However, it fails with the following error:
    Copy code
    DataSetError: Failed while saving data to data set YAMLDataSet(filepath=/Users/cyril_verluise/Documents/GitHub/ClimaTeX/dist/apps/alhambra/rendered/alhambra/data/test/05_model_input/optimisation_programme.yaml, protocol=file, 
    save_args={'default_flow_style': False}).
    'str' object has no attribute '__name__'
    Expected behaviour From the doc, I understood that the save function was just a wrapper around yaml.dump which should work with my `optimisation_programme`(dict) kedro version
    0.18.3
    Any idea?
    d
    • 2
    • 16
  • z

    Zihao Xu

    11/10/2022, 2:32 PM
    Hi team, I have a quick question regarding
    kedro viz
    . When I try to re-open
    kedro viz
    for the second time, after using “^z” to terminate the 1st use, I often receive an error message like the following. Is there anything I should do differently to prevent this from happening (e.g., a different way to terminate)?
    Copy code
    [Errno 48] error while attempting to bind on address ('127.0.0.1', 4141):        server.py:156
                                 address already in use
    t
    • 2
    • 2
  • h

    Hervé Lauwerier

    11/10/2022, 4:51 PM
    Hello Team, how to I use a generated catalog with create_pipeline. I have my catalog generated with a function and I would like my pipeline to be aware of it. But when I run it i got all my datasets
    Copy code
    not found in the DataCatalog
    a
    • 2
    • 6
  • u

    user

    11/11/2022, 9:18 AM
    Kedro - Getting path to item in the datacatalog I'm training an nlp model using spacy. I have the preprocessing steps all written as a pipeline, and now I need to do the training. According to spacy's documentation I need to run the following command: python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy The files config.cfg, train.spacy and dev.spacy are all registered in my data catalog. I want to run this command...
    👍 1
  • i

    Ian Whalen

    11/11/2022, 3:19 PM
    Question on using
    ParallelRunner
    : is there a way to supply the number of processes to use from command line? Couldn’t find anything here.
    d
    • 2
    • 1
  • a

    Andrew Stewart

    11/11/2022, 4:47 PM
    Anyone here ever use kedro to run glue jobs? I know someone was talking about a glue runner plugin at one point.
    d
    y
    n
    • 4
    • 12
  • a

    Alicja Fras

    11/14/2022, 11:05 AM
    Question on Kedro with Jupyter Notebook. We have pipelines already built and we want to prepare notebooks as a guide for CSTs and other Data Scientists. Ideally we would like to have a notebook running nodes one by one, showing significant outputs along the way, step by step. What I found in Kedro resources were rather Notebook to Kedro pipeline guides, and what I need is rather the other way around. https://kedro.readthedocs.io/en/stable/tools_integration/ipython.html#use-kedro-with-ipython-and-jupyter Also big limitation is the fact, that I can only run pipeline once, if I do not want to restart notebook kernel. So it is impossible to run first node, then pause, look at outputs, run the second node, etc. Only option is to run full pipeline beforehand and only show intermediate outputs afterwards, without corresponding snippets of code running. All things considered, it seems to me that the best solution here would be to import nodes and using access to the catalog.yaml run nodes one by one as functions, providing all arguments in a notebook. Please, let me know if you had similar challenge and if you found some better solutions. 🙂
    👍 2
    s
    d
    n
    • 4
    • 4
  • s

    Sasha Collin

    11/15/2022, 6:35 PM
    Hey 🙂 Is it possible to version a PartitionedDataSet?
    i
    d
    • 3
    • 4
  • z

    Zemeio

    11/16/2022, 5:11 AM
    Hello everyone. Does anyone know if there is documentation on how to use google storage files instead of amazon s3?
    m
    • 2
    • 4
  • s

    Safouane Chergui

    11/16/2022, 10:07 AM
    Hello everyone, Is there a way to force a pipeline to run its nodes in the order they are declared within the pipeline ? I know that I can create a dummy output just to force one node to run after another but I’d like to know if there is a better way to accomplish this Thanks
    m
    d
    f
    • 4
    • 12
  • s

    Sean Westgate

    11/16/2022, 10:24 AM
    Hi Team, Is there a way to add the the kedro-viz interactive graph to a static website? I found the kedro-static-viz plugin, but it doesn't work with 0.18.3. Looking at the
    kedro-viz/readme
    it suggests using a standalone React component with the
    pipeline.json
    as prop. I tried it, but couldn't get it to work - I am not familiar with React. Do you have a simple example using a static website or do you need a proper React environment? Many thanks
    d
    t
    a
    • 4
    • 29
  • m

    Mate Scharnitzky

    11/16/2022, 10:33 AM
    Hi Team, We have developed a
    kedro
    project locally and the team we’re supporting has an Amazon
    EMR
    spark cluster environment. We would like to be able to run this
    kedro
    project on EMR, but we’re struggling to create a virtual environment, install requirements into that env…etc. given the security protocols. Do you have any recommendation/pointers on what steps we need to take to be able to run this kedro project on EMR? I didn’t find a supporting document in the official documentation .
  • s

    Safouane Chergui

    11/16/2022, 3:37 PM
    Hello, I’d like to know how does kedro process regex specified in globals_pattern. I’d like to do something like this below but the method below doesn’t work:
    Copy code
    TemplatedConfigLoader(
                conf_paths,
                globals_pattern="(globals*)|(another_parameters_file*)",
                globals_dict={"param1": "pandas.CSVDataSet"}
            )
    How can I accomplish this ?
    ✅ 1
    d
    • 2
    • 9
  • f

    Filip Panovski

    11/17/2022, 10:03 AM
    Hello! Is the
    yaml
    Loader part of the
    ConfigLoader
    somewhat configurable in any meaningful way? Or does kedro implement its own
    yaml
    parsing mechanism? We're trying to use some custom filtering that gets passed to the
    kedro.extras.datasets.dask.ParquetDataSet
    load_args
    . Specifically, we want to be able to do something like:
    Copy code
    # catalog.yml
    raw_data:
      type: dask.ParquetDataSet
      filepath: 's3://...'
      load_args:
        filters:
          - !!python/tuple ['year', '=', '2022']
          - !!python/tuple ['day', '=', '3']
          - !!python/tuple ['id', '=', 'someVal']
    dask
    (via
    filters
    , see docs) supports row-filtering on loaded data via this way and
    yaml
    (via tuple support in
    .yml
    files) supports the above definition. However,
    yaml
    unfortunately supports this using either the non-default
    FullLoader
    or the
    UnsafeLoader
    (for controlled environments, see here). Is it possible to configure the
    ConfigLoader
    to use either of these? An example use case for this would be to filter only the rows belonging to all
    day = 3
    partitions of any month in
    year = 2022
    . I could alternatively write a DataSet that parses this logic from plain string lists, but I was wondering if there's any existing support for something like this.
    • 1
    • 2
  • s

    Safouane Chergui

    11/17/2022, 10:19 AM
    Hello everyone, Is there a way to use parameters that are passed to
    kedro run
    as an input to a node ? Here is a quick example: • Kedro run command:
    kedro run --pipeline my_pipeline --params first_param:first_value
    • I’d like to use first_param as an input to a node without having to put it in parameters.yml just to use it as an input to my node. If not, is there a way to use it directly into code ?
    Copy code
    Pipeline([
                node(
                    do_something,
                     inputs="first_param",
                     outputs="some_output"
                )
           ])
    Thanks 👍
    f
    • 2
    • 15
  • d

    Debanjan Banerjee

    11/17/2022, 11:12 AM
    Hi Team, I created a
    CustomDataSet
    and strangely enough, it works when i invoke
    kedro run --xxxxx
    from terminal but when i try to do
    catalog.load(xxxx)
    in
    ipython
    or
    kedro jupyter
    , it fails and it raised the famous
    DataSet error : Dataset is not installed
    here is my catalog definition :
    Copy code
    ft_spine_prd : 
     type : project_name.extras.datasets.dataset_filename.DataSetClass
     dataset_args : 
         arg1
         ....
    d
    • 2
    • 2
  • d

    Debanjan Banerjee

    11/17/2022, 11:13 AM
    What would i be doing wrong ?
  • d

    Debanjan Banerjee

    11/17/2022, 1:32 PM
    Hi Team , can we make dataset load to a kedro node parallel ? I have 19 datasets that i need to read in a single function but when i see the logs it taking 2 minutes per dataset i.e. ~40 minutes of read time , any way we can make the read parallel ? I dont think they are conflicting in any way so i dont see why it needs to be sequential ?
    d
    j
    m
    • 4
    • 34
  • p

    Panos P

    11/17/2022, 11:17 PM
    Hello kedro experts, Is it possible to create dynamic pipelines based on params? For example: My parameters.yml file:
    Copy code
    pipelines:
     - test1
     - test2
    I want to return 2 pipelines like this:
    Copy code
    pipes = []
    for pipeline in pipelines:
       pipes.append(Pipeline([
          node(do_something, [f"params:{pipeline}"], [f"output_{pipeline}"], tags=pipeline)
       ]))
    In the older versions of Kedro I was able to get the params before the creation of pipelines and then work from there. Like this:
    Copy code
    def get_kedro_env() -> str:
        """Get the kedro --env parameter or local
    
        Returns:
            The kedro --env parameter
        """
        return os.getenv("KEDRO_ENV", "local")
    
    
    def _get_config() -> ConfigLoader:
        """Get the kedro configuration context
        Returns:
            The kedro configuration context
        """
        try:
            return get_current_session().load_context().config_loader
        except Exception:  # NOQA
            env = get_kedro_env()
            return ConfigLoader(["./conf/base", f"./conf/{env}"])
    
    
    def get_params() -> Dict[str, Any]:
        """Get all the parameter values from the parameters.yml as a dictionary.
        Returns:
            The parameter values
        """
        return _get_config().get("parameters*", "parameters*/**", "**/parameters*")
    It seems like in kedro 0.18.3 There is no more load_context() any thoughts?
    n
    • 2
    • 4
  • l

    Luis Cano

    11/18/2022, 4:53 AM
    Hi team, beginner question, is it possible to create a partitioned dataset where you store many pickles with models, i.e:
    Copy code
    s3_path/train_outputs/
                     ├── 202201.pkl
                     ├── 202202.pkl
                     ├── 202203.pkl
                     ├── 202204.pkl
    
    ...
    
    so on.
    Is there a way of saving pickles this way? or maybe what would be a better way of doing this? any thoughts?
    d
    y
    • 3
    • 3
  • j

    Jo Stichbury

    11/18/2022, 10:12 AM
    Hi all! A request from me to improve the docs 🤓 Has anyone found/made/published any great examples of converting a Jupyter notebook to a Kedro project? I know there's one from DataEngineerOne on YouTube from the natural history archives, but I was looking for something a bit simpler, without it being completely pointless. Something like an Iris datasets example level of detail.
    👀 1
  • v

    Vaibhav

    11/18/2022, 12:05 PM
    Hey, A small question on mandatory attributes of [tools.kedro] in pyproject.toml. I noticed that
    project_name
    refers to ‘name of the project’ where kedro is being used, whereas
    project_version
    refers to the the ‘version of kedro’ being used instead of the project version which is defined in project_name kedro validates version match as part of the checks. Is this by design that the two attributes refer to different
    project
    entities ?
    m
    • 2
    • 6
  • m

    MarioFeynman

    11/18/2022, 1:28 PM
    Hey! Quick question, is there any easy way to read a delta table in kedro, but the final object is a pandas dataframe? Or should I have to work in a custom DataSet in order to make this happen? I would like to avoid using .toPandas() function inside every node for every input, or have to add a decorator to every func to achieve this. Main goal is only use kedro catalog to manage this problem
    m
    • 2
    • 2
  • d

    Debanjan Banerjee

    11/18/2022, 1:49 PM
    Hey Team We were recently playing with a remote connection dataset (needed
    credentials.yml
    from
    /local
    ) and --env argument that we can specify to dictate the directionality of the pipeline. We know that if we specify our --env , kedro "ignores" the /base and /local and overrides to whatever we pass to the --env. Shouldn't Kedro do a search for **credentials.yml separately (not with parameters.yml/globals.yml) ?? Every user will have their own credentials so shouldnt we make kedro check for credentials.yml separately ? say only in /locals and not in --env ?
    d
    • 2
    • 1
  • d

    Doug Warner

    11/20/2022, 12:45 PM
    Hi Team. Loving the look of kedro as a way to enforce modularity in work done by my actuarial team. I've done the spaceflights tutorial, and now I'm implementing my first real pipeline and I have a two general "advice" questions. My current aim is modularity and best practice in a data processing pipeline (but with a lower learning curve for data scientist than full on ETL pipelines). We're talking about "small data" with order of 100k rows and 50 columns. I'm not so worried about the model building part yet. Questions are: 1. I'd like to split the data prep pipeline into modular steps (some geneic i.e. "format all Boolean fields", "replace nulls", and some quite specific edge cases "fix this particular named field"). My instinct is that each of these steps should be a node, but I note in the tutorials that quite often "prep this whole dataset" is one node. Is that just to keep the tutorial simple, or am I going down a bad path? 2. I have tended in recent times to be a "no pandas in place operations" nazi. (i.e. use "assign" to create a new column rather than df['new_column_name']) I'm seeing in place operations in the tutorials --is that just to keep things simple, or is there either a reason not to worry, or a performance overhead reason why actually I should be using in place? (Bearing in mind my "small data" use case) Sorry to ask basic questions - but I'm super keen on what I think of as the "kedro mindset" of "do it neat and tidy first time to avoid problems later on" :-)
    n
    • 2
    • 3
12345...31Latest