https://kedro.org/ logo
Join Slack
Powered by
# questions
  • f

    Filip Panovski

    02/06/2023, 3:55 PM
    Hi everyone. I'm not sure if this is the right place to ask, but does anybody have experience with using Airflow vs Prefect
    >= 2.0.0
    to run Kedro pipelines? Currently, only Prefect 1.x is tested to work with
    0.18.x
    according to the docs which is making us hesitate a bit on that end. We're currently evaluating both as a higher level orchestration platform for our Kedro pipelines, and both seem great for generic workflows, so some community feedback would be much appreciated.
    K 1
    👍 2
    d
    m
    • 3
    • 3
  • z

    Zoran

    02/06/2023, 5:26 PM
    Hi everyone, is it possible to change globals parameters (conf/<env>/globals.yml) dynamically at runtime (as params do with --params)?
    👍 2
    d
    • 2
    • 12
  • m

    MarioFeynman

    02/07/2023, 3:00 AM
    Hi everyone! If i would like to use deltatables for update, delete or merge, should i do that inside the node? Or there is something that i can use for this goal using only catalog entries?
    d
    • 2
    • 2
  • j

    JOEL WILSON

    02/07/2023, 7:15 AM
    Hi everyone! This might not be a pure kedro issue but looking for some inputs around kedro - SparkDataSet save method Getting this error on running a kedro pipline; i think this has to do with the dependencies / environment variables. Lmk your thoughts. Windows machine python 3.7
    pyarrow==0.14.0
    Copy code
    java version "1.8.0_341"
    Java(TM) SE Runtime Environment (build 1.8.0_341-b10)
    Java HotSpot(TM) 64-Bit Server VM (build 25.341-b10, mixed mode)
    Copy code
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
          /_/
    
    Using Scala version 2.12.15, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_341
    Branch HEAD
    Compiled by user yumwang on 2022-10-15T09:47:01Z
    Revision fbbcf9434ac070dd4ced4fb9efe32899c6db12a9
    Url <https://github.com/apache/spark>
    Untitled
    f
    b
    • 3
    • 6
  • v

    Vassilis Kalofolias

    02/07/2023, 3:06 PM
    d
  • v

    Vassilis Kalofolias

    02/07/2023, 3:15 PM
    Hello, thanks for a great framework! After having set up my pipelines I am trying to develop on jupyter new features (and create new pipelines from there). In that case, from what I understand it is more convenient to run pipelines using manually
    SequentialRunner
    instead of using the
    session
    of jupyter. For example I would like to run the same pipeline in a loop with different partitions of a
    PartitionedDataSet
    and I find it weird to call
    %reload_ext kedro.ipython
    in a loop. Is this discouraged practice? What is the benefit of having a session in jupyter if you develop interactively? (related but not answering my question: https://kedro-org.slack.com/archives/C03RKP2LW64/p1668423931294329) Thanks a lot!
    d
    n
    • 3
    • 7
  • l

    Lawrence Shaban

    02/07/2023, 7:32 PM
    Hello everyone, I am having a little problem with the logger. I thought I could print out "logger.debug" values by updating the project side logging file (conf/base/logging.yml) console handler level to "DEBUG" but it doesn't seem to change anything? I'm trying to output logs from pipeline nodes.
    Copy code
    handlers:
        console:
            class: logging.StreamHandler
            level: DEBUG
            formatter: simple
            stream: <ext://sys.stdout>
    Copy code
    import logging
    logger = logging.getLogger(__name__)
    
    def example_node(input):
       logger.debug(input)
       output = input + 1
       return output
    I might be just doing something simple wrong but any help be appreciated! It works for info, so just using that for now but would be good to have the option of debug! 🙂
    a
    • 2
    • 4
  • d

    Dustin

    02/08/2023, 12:58 AM
    hi team, just a quick question. let's say I have output O1 from node1 with the associated catalog configured so the content of O1 will be saved to CSV. node2 will use O1 as input. The current behaviour is that node2 will reload the data from O1 file instead of from memory (this is expected I assume, due to the catalog configuration). Is there any way I could still have O1 saved as CSV (easier for business people to check data quality) while having O1 loaded to node2 through memory (faster and no need to deal with csv save/load tricks), Thanks
    e
    d
    m
    • 4
    • 7
  • a

    Afaque Ahmad

    02/08/2023, 4:46 AM
    Hi Folks I'm trying to run
    kedro
    on EMR. The run fails because it is not able to find the
    conf
    folder. Is there a way to package
    conf
    folder together when doing
    kedro package
    ?
  • u

    user

    02/08/2023, 8:28 AM
    kedro dynamic catalog creation only for specific nodes before their run I have several thousands of files to be processed of the different types. I am using dynamic catalog creation with hooks. I used first after_catalog_created hook but it is too early in and I need those entries only for specific nodes. My try is with before_node_run for specific node tags returning the dictionary with just dynamically created entries. Node function is **kwargs only. It works as I see that node get updated inputs, but the problem is that I need to provide for the node...
  • d

    David Pérez

    02/08/2023, 10:41 AM
    Hi Team, qq: When doing kedro viz, if we select our main pipeline and expand it, all the modular pipelines appear nested. However, when we collapse it, one of the pipelines is no longer within the main pipeline, it appears isolated on the side. Do you know why might be happening?
    d
    • 2
    • 8
  • s

    Szymon Czop

    02/08/2023, 10:52 AM
    Hi guys, im having problem with follwing set up experimnet tracking in visualisation with kedro viz tutorial. I added things to catalog. Yaml and changed code in data_science/node.py after running kedro run there is No data stored in 09_tracking folder. Visualisation of nodes and pipeline is working. All is set up but No data stored. AM i missing smth ? Some extra package. Please let me know. Wit regards Szymon
    m
    • 2
    • 5
  • m

    Massinissa Saïdi

    02/08/2023, 1:59 PM
    Hello kedroids ! I have a question about the read priority of credentials files. Suppose I have a
    conf/base
    and
    conf/prod
    environment and my
    credentials.yml
    file in
    conf/local
    . If I run
    kedro run
    , will
    conf/local/credentials.yml
    overwrite
    conf/base/credentials.yml
    ? And if I run
    kedro run --env prod
    which credentials file will be used? I have the impression that it is the local file that is always used? Thank you
    d
    a
    • 3
    • 22
  • o

    Oscar Villa

    02/08/2023, 9:42 PM
    Hi, guys. Maybe somebody knows what it is the pattern when you have very long queries? I'm getting data from BigQuery trought pandas.GBQQueryDataSet but the queries are so long, making the catalog.yaml looks dirty. Is that the right way or should I store the queries in files and call from them, or store the queries as views in Bigquery? What do you use to do? Any suggestion is appreciated. Thanks in advance.
    ✅ 1
    n
    • 2
    • 3
  • a

    Ankar Yadav

    02/09/2023, 12:18 PM
    Hi team, I am trying to run kedro on windows and when I start my pipeline I get the following errror:
    Copy code
    keyerror: "logging"
    I immediately get this message as soon as I run the pipeline, any idea why this is happening?
    d
    • 2
    • 12
  • u

    user

    02/09/2023, 2:18 PM
    Parametrize input datasets in kedro I'm trying to move my project into a kedro pipeline but I'm struggling with the following step: my prediction pipeline is being run by a scheduler. The scheduler supplies all the necessary parameters (dates, country codes etc.). Up until now I had a CLI which would get input parameters such as below python predict --date 2022-01-03 --country UK The code would then read the input dataset for a given date and for a given country, so the query would be something like: SELECT * FROM...
  • c

    Chouaib Nemri

    02/09/2023, 4:39 PM
    deleted
  • j

    Jorge sendino

    02/09/2023, 5:15 PM
    Hello everyone, is there a way to modify
    ConfigLoader
    to namespace catalog and parameter entries using the folder structure inside
    conf
    ? For example, I have:
    Copy code
    conf/
        catalog/
           ns1/
           ns2/
        parameters/
           ns1/
           ns2/
    Ideally I would modify
    ConfigLoader
    to automatically add
    ns1
    and
    ns2
    as namespaces for all entries in the catalog and parameters below that folder. Is this possible?
    d
    • 2
    • 7
  • s

    Sebastian Pehle

    02/10/2023, 12:06 AM
    lets say i have the following: - source: a csv rest api with time series data and a 'duration to pull' parameter - task: weekly preparation of a dataset (historic and recent data) to be used by a BI tool for vizualization what would be the kedroic way to implement this? my guess: define a 'first run/update run' parameter in the conf/parameters.yml. if first run, pull all the data there is (duration to pull in last weeks = nan) and save as partitioned dataset into 01_raw (yearweek as partition key). if update run, determine amount of weeks to pull by checking whats already downloaded (difference between begin(='most recent' yearweek foldername in the partitioned dataset) and end(=current yearweek)) and save in same partitioned dataset (in fact i guess it would happen inside the same node as 'first run', the only difference being the computed 'duration to pull' parameter). in another node, the report dataset would be prepared (concat all data, save as multi sheet xlsx) and saved into 08_reporting. any advice is appreciated!
    b
    • 2
    • 10
  • a

    Andrew Stewart

    02/10/2023, 5:10 AM
    Anyone happen to get poetry+kedro+jupyter to work in VSCode's notebook ui ?
    n
    • 2
    • 2
  • w

    Wojciech Szenic

    02/10/2023, 6:37 AM
    Heyy guys! I'm trying to avoid doing some ugly, non kedronic solutions so perhaps You could help me with my problem. I would like for kedro to take in a command line arguments such as date or country and then do processing based on these arguments. So for example, I have a trained machine learning model, and a
    predict
    pipeline can output predictions. Ideally, this predict pipeline can be run as
    kedro run --pipeline=predict --date=2023-01-05
    and this would ingest the dataset for 05th of Jan 2023 and run the prediction on it. I'm wondering how can I pass the CLI argument into the dataset catalog?
    ✅ 1
    n
    • 2
    • 6
  • j

    Jong Hyeok Lee

    02/10/2023, 9:32 AM
    Hello everyone! Does anyone know how to pass in list of dataframes as an input in the pipeline node for Kedro? Because I have a function that takes in list of dataframes but doesn’t seem like it’s straightforward to implement
    m
    n
    +3
    • 6
    • 11
  • s

    Sergei Benkovich

    02/12/2023, 8:35 PM
    is there a way to save a kedro project template? there are the usual things i change and add when setting up a new project, is it possible to save this template and instead of kedro new , do kedro load ?
    ✅ 1
    w
    n
    b
    • 4
    • 4
  • o

    Olivia Lihn

    02/13/2023, 1:49 PM
    Hi everyone! I'm trying to create a hook to overwrite some parameters if scoring pipeline runs, but it does not seem to be working (the parameters dont get written - if not present - not overwritten - if present-). The code im using is the following:
    Copy code
    def before_pipeline_run(self, run_params, catalog: DataCatalog) -> None:
            """Change feature inclusion parameters for
            scoring pipeline
            """
            if run_params["pipeline_name"] == "scoring":
                # retrieve feature_list from catalog
                feature_list_df = catalog.load("modeling.feature_selection_report")
                feature_list = list(feature_list_df[feature_list_df.selected == True].feature.unique())
    
                # get list of feature engineering pipelines
                params = catalog.load("parameters")
                feateng_pipes = [fteng_name for fteng_name in params.keys() if fteng_name.endswith("_fteng")]
    
                # overwrite parameters
                for pipeline in feateng_pipes:
                    catalog.add_all(
                        {f"params:{pipeline}.feature_inclusion_params.feature_list": feature_list,
                        f"params:{pipeline}.feature_inclusion_params.enable_regex": True},
                        replace=True
                    )
    I also tried using
    run_params["params"]
    without any luck, and tried returning the catalog but no luck. The hook runs (tested with print statements), so my guess is i'm missing something. Thanks!
    K 1
    m
    n
    d
    • 4
    • 15
  • r

    Rob

    02/13/2023, 4:52 PM
    Hi everyone, Is there a way to dynamically set the name of an output without setting manually the same outputs with variations on the catalog? Context: I've a pipeline that saves 15 different outputs that are defined in my catalog, but now I need to save each one of them by category as as
    {category}_output_1.parquet
    ,
    {category}_output_2.parquet
    and so on... Any alternative suggestion is welcome 🙂
    d
    i
    +2
    • 5
    • 12
  • a

    Akshay

    02/14/2023, 5:03 AM
    Hello Everyone, I am seeing an issue with partitionedDataset not found in Kedro pipeline when running on Azure Databricks notebook. It throws error - DataSetError: No Partitions found in ''`/mnt/testmount/data/05_model_input/partitions`'' ADLS has been mounted to /mnt/testmount/ Partitions are getting created at
    /mnt/testmount/data/05_model_input/partitions
    Details-- I am running a Kedro pipelines on Azure Databricks notebook. There are 4 pipelines in the project. First two, Parse and Clean works fine, read the raw data from ADLS, do the transformation and write the data back to ADLS. Third pipeline 'optimize' has spark dataset as input and generates 2 outputs. PartitionedDataset and transformed Pandas Dataframe.
    Copy code
    Optimize.partition@spark:
      type: kedro.io.PartitionedDataSet
      dataset:<<: *spark_parquet_partitioned
      load_args:
        maxdepth: 1
        withdirs: True
      layer : Data Transformation
      path : /mnt/testmount/data/05_model_input/partitions
    
    model_input@pandas:
      type: kedro.io.PartitionedDataSet
      dataset:<<: *pandas_parquet_partitioned
      load_args:
        maxdepth: 1
        withdirs: True
      layer : Data Transformation
      path : /mnt/testmount/data/05_model_input/model_data
    Note- pipeline works fine when run in local environment. Kedro =0.18.3 Python =3.8.10 Cluster= Spark 3.2.1
    n
    • 2
    • 2
  • f

    Filip Wójcik

    02/14/2023, 9:40 AM
    Hello all! I'm wrapping my head around the following problem/use case. So far, no luck. Imagine you have a data pipeline where you run, e.g., a web scraper every day, so it saves some amount of data (a couple of hundred records, so no big data case) every day. Can we configure a dataset so that we can append it to it? I was trying with
    pandas.CSVDataSet
    with
    save_args: mode: "a"
    and
    PartitionedDataSet
    - but every time a dataset is overridden. I cannot find any such case in the docs. Should I create my implementation, deriving from the AbstractDataSet? I've heard from many fellow DS-Kedro Users that a similar use case happens from time to time, so probably I'm not alone. Thanks in advance, and best regards, Kedro is an absolute blast!
    m
    • 2
    • 6
  • f

    Filip Panovski

    02/14/2023, 12:58 PM
    Can anyone explain to me why Kedro attempts to load all catalog definitions, even if running only a specific Pipeline that uses a subset of the catalog? For example, let's say I have a catalog with
    input
    ,
    output
    and
    wrong
    entries.
    wrong
    has a configuration problem (e.g. no credentials could be found), but I'm running a pipeline
    mypipeline
    which only uses
    input
    and
    output
    . Why does
    kedro run --pipeline mypipeline
    fail if
    wrong
    is configured improperly in this case? I get that you usually want to be able to view the entire catalog, but is
    --pipeline <...>
    not enough information to let Kedro know that I potentially don't want that?
    d
    • 2
    • 1
  • z

    Zirui Xu

    02/14/2023, 3:01 PM
    Why is
    setuptools
    a Kedro dependency? It gets ignored when I piptools compile a
    <http://requirements.in|requirements.in>
    that contains kedro because setuptools is “considered to be unsafe in a requirements file”
    d
    j
    • 3
    • 15
  • f

    FlorianGD

    02/14/2023, 4:32 PM
    Hello, is there a reason why
    pandas.ParquetDataSet
    does not use pandas all the time? I would like to use it for partitioned data, and I want to use the
    filters
    that
    pandas.read_parquet
    provides, but it is not available for
    pyarrow.parquet.ParquetDataset.read
    . Doing a quick test and using
    pd.read_parquet
    every time seems to work ok, even though it does not behave exactly the same.
    d
    j
    • 3
    • 9
1...121314...31Latest