https://kedro.org/ logo
Join Slack
Powered by
# questions
  • d

    Dotun O

    04/11/2023, 7:27 PM
    Hi team, is there a quick way to get all the node names (across multiple groups) for a pipeline run programmatically ?
    k
    d
    • 3
    • 9
  • е

    Елена Сидорова

    04/12/2023, 5:15 AM
    Hi everyone - I have a single output from a node which is a dictionary/. I get an Error of:
    "ValueError: Failed to save outputs of node parallel_get_temp_data([dates_to_download]) -> [downloaded_dates].
    The node definition contains a list of outputs ['downloaded_dates'], whereas the node function returned a 'dict'."
    Could you, please, help me to make it work? Thank you in advance!
    Copy code
    import logging
    from typing import Dict
    import typing
    import pandas as pd
    from sg_api.nodes.kedro_temperature_nodes import get_temp_data, choose_station
    from kedro.pipeline import Pipeline, node
    from multiprocessing.dummy import Pool
    
    
    def generate_date_range(
            start_date: str,
            end_date: str,
    ):
        dates_to_download = {
            str(dt.date()): True
            for dt in pd.date_range(start_date, end_date)
        }
        return dates_to_download
    
    
    def parallel_get_temp_data(dates_to_download: Dict[str, bool]) -> Dict[str, Dict]:
        """
    
        Args:
            dates_to_download (object):
        """
        logger = logging.getLogger('parallel_get_temp_data')
    
        def _get_temp_data(dt):
            <http://logger.info|logger.info>(f"Start  Download {dt}")
            try:
                date_data = get_temp_data(dt)
            except KeyboardInterrupt:
                raise
            except Exception as e:
                logger.error(f"Failed Download {e}")
                date_data = None
            <http://logger.info|logger.info>(f"Finish Download {dt}")
            return dt, date_data
    
        with Pool(10) as p:
            downloaded_data = p.map(_get_temp_data, dates_to_download.keys())
            downloaded_data = filter(lambda x: x is not None, downloaded_data)
    
        downloaded_data_dict = dict(downloaded_data)
        return downloaded_data_dict
    
    
    def parallel_choose_station(
            downloaded_data_dict: Dict,
            station_id: str,
    ):
        logger = logging.getLogger('parallel_choose_station')
    
        def _choose_station(item):
            dt = item[0]
            dt_data = item[1]
            <http://logger.info|logger.info>(f"Start Choose Station {dt}")
            station_data = choose_station(dt_data, station_id)
            <http://logger.info|logger.info>(f"Start Choose Station {dt}")
            return dt, station_data
    
        with Pool(10) as p:
            downloaded_station_data = p.map(_choose_station, downloaded_data_dict.items())
    
        return dict(downloaded_station_data)
    
    
    def create_pipeline():
    
        return Pipeline([
            node(
                generate_date_range,
                inputs=['params:start_date', 'params:end_date'],
                outputs='dates_to_download'
            ),
            node(
                parallel_get_temp_data,
                inputs=['dates_to_download'],
                outputs=['downloaded_dates'],
            ),
            node(
                parallel_choose_station,
                inputs=['downloaded_dates', 'params:station_id'],
                outputs=['downloaded_station_data'],
            )
        ])
    m
    • 2
    • 2
  • v

    viveca

    04/12/2023, 3:45 PM
    Hi all, I have a question possibly more about s3fs than kedro, but at least it’s related 🙂 I’m trying to export a plotly image to s3 using
    fig.write_html()
    . So I made a simple custom dataset
    Copy code
    class PlotlyHTMLDataSet(JSONDataSet):
        """Export plotly figure to html"""
    
        def _save(self, data: go.Figure) -> None:
            save_path = get_filepath_str(self._get_save_path(), self._protocol)
    
            with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
                data.write_html(fs_file, **self._save_args)
    
            self._invalidate_cache()
    This worked fine… except the content-type of the html file on s3 “ends up” being “binary/octet-stream”, but should be “text/html”. This becomes a problem when trying to display this in a browser. Anyone got experience of args you could pass here to manually set the content type? Not my area of expertise. Thanks, Viveca
    d
    • 2
    • 8
  • f

    Filip Wójcik

    04/13/2023, 7:51 AM
    Hi guys, our team is wondering is it possible to have collapsible pipelines in Kedro viz, without applying namespace to every dataset corresponding to that pipeline? Sometimes it seems to over-complicate things, if your only use case is to have a nice, collapsible look 🙂 Thanks in advance Filip
    d
    m
    a
    • 4
    • 12
  • m

    marrrcin

    04/13/2023, 10:41 AM
    Is there any specific reason why
    .get
    method of the config loader behaves differently between
    ConfigLoader
    and
    OmegaConfigLoader
    ?
    Copy code
    cl = ConfigLoader("conf")
    cl.get("custom*")
    {'test': ':)'}
    
    cl = OmegaConfigLoader("conf")
    cl.get("custom*")
    None
    d
    m
    • 3
    • 15
  • f

    FlorianGD

    04/13/2023, 12:57 PM
    Hi, I have a question regarding the
    APIDataSet
    . Would you accept a PR to add a
    _save
    method so we can save data with a POST request to an endpoint ? We sometimes have to update data on a django app (or any rest api), and we create a custom dataset, but it feels like this could be integrated to the generic
    APIDataSet
    d
    • 2
    • 6
  • j

    Juan Diego

    04/13/2023, 3:07 PM
    Hi all! Kedro 0.18.7: We’re trying to modify pipelines “extra params” on the fly, and we guessed that a
    before_pipeline_run
    hook is the way to go. Can you advise us on the bests to achieve this? What we tried so far is condensed is this code.
    Copy code
    class ParamsHook:
        @hook_impl
        def before_pipeline_run(
            self, run_params: Dict[str, Any], pipeline: Any, catalog: DataCatalog
        ) -> None:
            catalog.add_feed_dict({"params:country": MemoryDataSet("ESP")}, replace=True)
    In the hook: 1. In
    run_params
    we can see:
    'extra_params': {'country': 'USA'}
    2. In
    catalog.list()
    this entry:
    'params:country'
    ,before and after invoking
    add_feed_dict
    But when params are printed in the node he value persist with the original value parsed by:
    kedro run --params country=USA
    Many thanks in advance! NOTE: The objective here is to be able to parse a list from the CLI, let’s say:
    --params countries="ESP<>USA"
    and do the split in the hook.
    f
    • 2
    • 18
  • v

    Vladislav Stepanov

    04/13/2023, 3:46 PM
    hello all! from kedro.framework.session import KedroSession from kedro.framework.startup import bootstrap_project I'm trying to run my project in databricks and I'm initiating session: bootstrap_project(project_root) with KedroSession.create(project_path=project_root, env="databricks") as session: session.run() bootstrap_project works and gives me info which point correctly to the folder my project is stored in databricks
    Copy code
    ProjectMetadata(
        config_file=PosixPath('/Workspace/Repos/.../.../.../modeling/.../.../scalability/pyproject.toml'),
        package_name='scalability',
        project_name='scalability',
        project_path=PosixPath('/Workspace/Repos/.../.../.../modeling/.../.../scalability'),
        project_version='0.18.4',
        source_dir=PosixPath('/Workspace/Repos/.../.../.../modeling/.../.../scalability/src')
    )
    but when session.run() is executed it gives me an error: ModuleNotFoundError: No module named 'scalability.pipeline_registry' and I have this file under /Workspace/Repos/.../.../.../modeling/.../.../scalability/src/scalability/pipeline_registry even though source_dir from ProjectMetadata shows the correct path, why it gives me this error? thanks in advance!
    k
    • 2
    • 2
  • s

    Suhas Kotaki

    04/14/2023, 7:29 AM
    Hello All! I have an issue with regards to error handling: the pipeline in airflow would run showing all of the tasks (each task is a kedro pipeline) as successful despite errors. Currently I am trying to run kedro pipeline as tasks in airflow. The pipeline is orchestrated to run tasks in sequence.Given the definition of DAG, the success status of one task will trigger the the next task. Currently all of the tasks are defined as Notebook tasks where each task will execute the code in the cells of Jupiter Notebook. Once the kedro module is imported, the error display with traceback is of the output type “display_data” and not “error”. As a consequence, the task would report the status as success despite the failure to execute the code. Current understanding is the “rich” library is formatting the output type of the cell for the error, hence the task does not understand the failure and assumes success and the next task would get triggered in the pipeline running on airflow. What is the right version of the rich library that can avoid this issue ? Is there way we could turn off the traceback module that is changing output type from “error” to “display_data” Note: try catch and raise exception was tried and the error output type is still “display_data” or INFO and not error. CC: @Bernardo Sandi
    d
    • 2
    • 1
  • d

    divas verma

    04/14/2023, 11:17 AM
    Hi Team seeing this error while trying to load a parquet file using
    catalog.load
    in kedro 0.18.4
    Copy code
    Py4JJavaError: An error occurred while calling o186.load.
    : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
    org.apache.hadoop.fs.s3a.S3AFileSystem not found
    any thoughts on what could be going on here?
    w
    • 2
    • 1
  • r

    Rafael Hoffmann Fallgatter

    04/14/2023, 11:41 AM
    Hey guys! Can I edit the globals params using the CLI? For example, change one parameter of the globals during kedro run
    d
    • 2
    • 1
  • b

    Ben Levy

    04/14/2023, 1:14 PM
    Hey fellow kedroids 🤖! Apologies if this has already been answered, but is there a standard or semi-standard approach to CI/CD for kedro projects? I.e., some example config files for CircleCI or GH actions, maybe an orb or a GH action that can be used off-the-shelf? Thanks friends!
    d
    n
    • 3
    • 2
  • a

    Andrew Stewart

    04/14/2023, 6:10 PM
    Anyone ever encounter being unable to import tensorflow in a kedro project even when any other module installed in the project can be imported? (this is all in a fresh docker build)
    d
    • 2
    • 11
  • a

    Andrew Doherty

    04/15/2023, 11:31 AM
    Hi all, I am developing a modular namespace data science pipeline very similar to the ML example shown in this documentation: https://docs.kedro.org/en/0.18.2/tutorial/namespace_pipelines.html. In the latest documentation this is not covered. For 0.18.7 is this no longer a recommended design pattern? Also, when using namespace pipelines is it possible to have global parameters that apply to all pipelines? For example if I had several ML pipelines that I wanted to train with the same
    train_start
    and
    train_end
    time I could set a global parameter rather than duplicating these for each namespace? If not is there a way to replicate this functionality? Thanks a lot for your time, Andrew
    d
    • 2
    • 2
  • r

    Rob

    04/16/2023, 2:48 AM
    Hi everyone, QQ - Is there a way to save a dictionary with multiple
    'plotly.JSONDataSet'
    s , save them into the catalog as a single item and plot all of them in
    kedro-viz
    ? Any other suggestion is welcome 🙂 Thanks in advance!
    d
    r
    • 3
    • 4
  • m

    Massinissa Saïdi

    04/17/2023, 9:03 AM
    Hello kedroids, is it possible to use a whole pipeline as a function to use it inside an api endpoint ? Thx
    d
    • 2
    • 4
  • e

    Elielson Silva

    04/17/2023, 12:28 PM
    Hello folks! Should Kedro replace global parameters specifying parameters at runtime? Ex: kedro run --params=“base_path=./new_source/data”
    d
    n
    • 3
    • 3
  • m

    Matheus Sampaio

    04/17/2023, 2:22 PM
    Hi folks! Does the command
    kedro package
    supports
    pyproject.toml
    ? documentation
    j
    d
    • 3
    • 3
  • g

    Gary McCormack

    04/17/2023, 3:56 PM
    Hi All, I want my logs directory to have sub directories inside that segment them by date. So logs that are generated by daily pipeline runs will be in their own individual folder. I assume from what I read that this is not available using Kedro's
    conf/base/logging.yaml
    by default (please correct me if I'm wrong), so I would have thought that I would need to create my own custom logger. If I remove the logging conf file however then this raises an exception (
    ValueError: dictionary doesn't specify a version
    ). Any suggestions or advice on how to get setup a custom logger to run on Kedro would be great!
    a
    • 2
    • 4
  • j

    Jordan

    04/17/2023, 4:05 PM
    I'm experimenting with building docs. I haven't done this in Kedro before, and I don't have any experience in web development. After I build the docs and open the resulting html file in the browser, I would have expected the content to fill the browser window, rather than occupy only a certain width. Is this the case for everyone, or have I done something wrong? If this is the default behaviour, can it be changed?
    j
    • 2
    • 10
  • a

    Andrew Doherty

    04/17/2023, 4:36 PM
    Hi all, another question 🙂 . I am creating a modular namespace pipeline for ingesting multiple files. The
    parameters.yml
    file looks something like this:
    Copy code
    namespace1:
        raw_data:
            datasource1
            datasource2
    
    namespace2:
        raw_data:
            datasource1
            datasource3
    I have configured a catalog with aligned naming that looks like the following:
    Copy code
    namespace1.datasource1:
        filepath: data/namespace1/source1.csv
        type: pandas.CSVDataSet
    
    namespace1.datasource2:
        filepath: data/namespace1/source2.csv
        type: pandas.CSVDataSet
    
    namespace2.datasource1:
        filepath: data/namespace2/source1.csv
        type: pandas.CSVDataSet
    namespace2.datasource3:
        filepath: data/namespace2/source3.csv
        type: pandas.CSVDataSet
    I am have many more datasources than shown here which is where the challenge lies. I was wondering if I could create a node that would loop round all of the datasources and then dynamically save to the correct locations like:
    Copy code
    nodes = [
        node(
            func=get_data, # this would loop through all the raw_data entries and download data into a list of df's
            inputs="params:raw_data", # passing the list ["datasource1", "datasource2"]
            outputs="params:raw_data" # passing the list ["datasource1", "datasource2"] as catalog entries
        )
    ]
    This would mean that the inputs and outputs would be dynamic based on the
    parameters.yml
    and if any additional datasources are added/removed this would be reflected. This method does not work as the string params:raw_data" is passed rather than the parameters for the outputs. Does anyone have a suggestion for how to make this dynamic and avoid creating a node per data source with hard coded inputs and outputs or modifying the structure of my parameters file? Thanks again
    m
    • 2
    • 7
  • c

    charles

    04/18/2023, 4:10 PM
    Hi folks - I am trying to do something very simple ->. load a json from s3 and I am seeing the following issue:
    Copy code
    DataSetError: Failed while loading data from data set JSONDataSet().
    module 'aiobotocore' has no attribute 'AioSession'
    d
    i
    • 3
    • 32
  • a

    Aaditya

    04/19/2023, 8:56 AM
    Hello guys, I am trying to pass arguments to my pipeline from command line as the pipeline is supposed to load different datasets from s3 everyday. Is there a way to do this? Currently my pipeline is just generating the s3 url from different variables in parameters.yml.
    👍🏼 1
    d
    • 2
    • 2
  • d

    Dharmesh Soni

    04/19/2023, 11:15 AM
    Hey team, I'm trying to get all the pipelines into a test setup. I'm using pytest.fixture in conftest.py, to initiate custom_kedro_config and custom_kedro_context.
  • d

    Dharmesh Soni

    04/19/2023, 11:20 AM
    In custom_kedro_context class, I've setup_env_variables, spark_session, configure_imports, import_pipelines. Only
    import_pipelines
    is returning a dictionary. When I try to import
    import_pipelines
    into the test, it is not producing any results. I tried to print the results within class and it is printing but when I try to print in the test it is not. Can someone help here?
    m
    • 2
    • 1
  • l

    Leo Cunha

    04/19/2023, 11:38 AM
    Hello, I am running a kedro run with a custom cli arg. like
    kedro run --custom-flag
    . I wanted to pass this flag into
    KedroContext
    . Does anyone know how I could do that? p.s If I pass to
    extra_params
    it will complain that this param is not in the config files (I didn't want it to be in the config files)
    j
    • 2
    • 6
  • l

    Luca Disse

    04/19/2023, 2:32 PM
    Hey team, I have a problem when converting a spark written parquet file into a pandas parquet file. When reading the file with @spark it works, but when reading the catalog entry with @pandas I get the error , “No such file or directory: ’/dbfs/…“. Any hints to resolve this issue?
    d
    • 2
    • 4
  • y

    Yinghao Dai

    04/19/2023, 3:58 PM
    Hi all, is there a possibility to change display properties (e.g., font size, or maybe even colors) in Kedro viz? Would be helpful to make things easier to read / understand when putting it onto a slide. Many thanks!
    🎨 1
    d
    • 2
    • 5
  • m

    Mate Scharnitzky

    04/20/2023, 8:29 AM
    K Kedro Datasets:
    pandas
    dependencies Hi All, What is the recommended way to handle dependencies for Kedro datasets together with other dependencies in a repo? • either specifying them through kedro, e.g.,
    kedro[pandas.ExcelDataSet]
    • or using
    kedro_datasets
    ? Context • We’re in the process to upgrade our Python env from
    3.7
    to
    3.9
    • Our current kedro version is
    0.18.3
    • When upgrading our branch to
    Python 3.9
    and keeping all other things intact, we get a requirement compilation error for
    pandas
    . In our repo, we consistently pin pandas to
    ~=1.3.0
    which should be aligned with kedro’s pin
    ~=1.3
    defined in the form of
    kedro[pandas.ExcelDataSet]==0.18.3
    . Interestingly and surprisingly, if we remove
    kedro[pandas.ExcelDataSet]==0.18.3
    , the compilation error disappears, while
    openpyxl
    is missing (this latter is expected). • We’re thinking to change the way we load kedro datasets dependencies and use
    kedro_datasets
    instead, but we would like to get your guidance what’s your recommended handling kedro dataset dependencies, especially from a maintenance point of view. Thank you!
    j
    • 2
    • 6
  • a

    Ana Man

    04/20/2023, 9:20 AM
    Hi all! I wanted to ask what is the expected behaviour in the following example: I have created a hook (B) that implements a hook specification (e.g
    before_pipeline_run
    ) that is a subclass of another hook (A) that implements a different hook specification (e.g
    after_context_created
    ) that is defined in a plugin to kedro (Plugin A). If I disable Plugin A and register hook B in my settings.py file, would the hook spec defined in hook A still run? I am finding that it does but was unsure if that was what was suppose to happen. Hope that makes sense @Nok Lam Chan
1...192021...31Latest