https://kedro.org/ logo
Join Slack
Powered by
# questions
  • n

    Nikola Shahpazov

    03/15/2023, 1:17 PM
    Hi guys, Quick question Is there a way to interpolate a SQLDataset query in
    catalog.yml
    passing some argument\parameter. Example:
    Copy code
    yaml
    person:
      type: pandas.SQLQueryDataSet
      sql: "SELECT * FROM public.people WHERE id = ${id};"
      credentials: db_credentials
    Thanks in advance!
    d
    d
    j
    • 4
    • 10
  • r

    rss

    03/15/2023, 1:18 PM
    Interpolate sql in SQLDataset in catalog.yml Is there a way to interpolate a SQLDataset query in catalog.yml passing some argument\parameter. Example: person: type: pandas.SQLQueryDataSet sql: "SELECT * FROM public.people WHERE id = ${id};" credentials: db_credentials Thanks in advance!
  • n

    Nikola Shahpazov

    03/15/2023, 2:43 PM
    Another question from me 😅 What would be the proper way to describe a dataset in the catalog with mongodb source? I can see there is pandas.SQLDataset, but is there something similar for mongodb?
    d
    z
    • 3
    • 12
  • o

    Olivier Ho

    03/16/2023, 3:00 PM
    Small question, is there any dataset that is glob compatible? For example, if I have a folder of images.
    d
    • 2
    • 5
  • r

    Ricardo Araújo

    03/16/2023, 11:31 PM
    Hey! Should OmegaConf work for data catalog entries? It works fine for parameters, but interpolation keys in the data catalog fails to resolve (
    InterpolationKeyError: Interpolation key 'temp' not found
    ).
    j
    • 2
    • 6
  • a

    Andrew Stewart

    03/16/2023, 11:49 PM
    Is kedro-docker intended more to facilitate local interactive environment? As opposed to packaging a self-contained image artifact intended for distribution to something like an ECS compute cluster ?
    d
    m
    • 3
    • 5
  • o

    Olivier Ho

    03/17/2023, 9:16 AM
    hello, is there a dataset type to read a file in bytes?
    d
    • 2
    • 2
  • o

    Olivier Ho

    03/17/2023, 11:06 AM
    if you use yield in a node to obtain an iterable, how can we store it in a partitioned dataset as the partitioned dataset requires a dictionary
    d
    d
    • 3
    • 36
  • s

    Slackbot

    03/17/2023, 12:08 PM
    This message was deleted.
    j
    • 2
    • 1
  • a

    Abhishek Gupta

    03/17/2023, 2:52 PM
    Hi Everyone! Getting this error while executing a pipeline.
    d
    • 2
    • 2
  • r

    Ricardo Araújo

    03/17/2023, 6:15 PM
    This is sort of a follow up on previous question that was solved. Say I have a project that ingests a large dataset, but I only process a part of it -- it is a big time series, I want to process a specific month. I want to pass a CLI argument to do that, which I currently can by overriding a parameter. However, I'd also like the output to be written to different places depending on the argument (that is, I want the e.g. filename to be prefixed with the CLI argument).
    n
    • 2
    • 4
  • a

    Andrew Stewart

    03/18/2023, 2:46 AM
    Where in the kedro project structure do most folks manage their sql files? • data seems like it could be appropriate • maybe src? • maybe a separate sql dir?
    d
    b
    • 3
    • 3
  • a

    Andrej Zachar

    03/20/2023, 1:36 AM
    Hello, I would like to know how can if I can pass from multiple nodes living in a different namespace / tag exactly the same output so it can be reused then later
    Copy code
    node(
        first_namespace_fn,
        inputs=["some_input"],
        outputs="shared_name_so_it_can_reused_somewhere_else",
        namespace="first"
    ),
    
    node(
        second_namespace_fn,
        inputs=None,
        outputs="shared_name_so_it_can_reused_somewhere_else",
        namespace="second"
    ),
    
    node(
        third_common_fn,
        inputs='shared_name_so_it_can_reused_somewhere_else',
        outputs="final_output",
    ),
    Thank you!
    d
    • 2
    • 14
  • a

    Andrej Zachar

    03/20/2023, 1:40 AM
    PS: It is exactly the opposite problem as described here - https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html?highlight=namespace#using-a-modular-pipeline-multiple-times.
  • c

    Chew Lee

    03/20/2023, 8:11 AM
    Hi all, does anyone have experience using Kedro with GCS? My user is able to use
    gsutil
    to read and write files to the bucket. Kedro run also successfully reads/writes files from/to GCS. But when trying to load a dataset from the Catalog in jupyter notebook, I get a 401 access denied. I have a credentials.yml file set up with
    Copy code
    my_gcp_credentials:
      client_id: <REDACTED>
      client_secret: <REDACTED>
      refresh_token: <REDACTED>
      type: <REDACTED>
    which was obtained using
    Copy code
    gcloud auth login
    gcloud auth application-default login
    and copying the contents of the resulting json
    m
    • 2
    • 1
  • a

    Armen Paronikyan

    03/20/2023, 11:25 AM
    Hi guys, I would like to know if I can have access to the data in the credentials.yml file in kedro hooks?
    m
    • 2
    • 12
  • a

    AK

    03/20/2023, 1:54 PM
    Hi all, can someone share the installation guide for kendro?
    d
    • 2
    • 17
  • j

    Javier del Villar

    03/20/2023, 4:01 PM
    Hi everybody, is it there something equivalent to
    pandas.SQLQueryDataSet
    in spark? can I get the same functionality in spark? I can not make queries with
    spark.SparkJDBCDataSet
    , am I missing something? Thanks in advance!
    d
    • 2
    • 5
  • c

    Cyril Verluise

    03/20/2023, 9:53 PM
    Hello there, I hope that this finds you well. Thanks for the awesome work! Sorry to come with an issue. Issue I'm trying to set up kedro pipeline test as part of my CI/CD using GH action. Everything goes well until I receive a git related error
    Copy code
    TypeError: HEAD is a detached symbolic reference as it points to 
    'dc15ea87ce9d917bafb09d5d7bddb2aaf44f5989'
    Full error log and GH action config in thread. What I tried I have tried to checkout with fetch depth 0 but this did not fix the issue (I had a similar issue when building doc from GH action which was fixed using the above trick). Environment kedro version: 0.18.6 OS: ubuntu latest Any ideas?
    d
    a
    • 3
    • 15
  • s

    sujdurai

    03/21/2023, 2:39 AM
    Team, wondering if there is a way to control the node order execution in kedro, or an option to wait before executing another node. Context: I have a
    node
    that is used in two pipelines. They use the same input tables, but I expect the
    node
    in the second pipeline to run only after my first pipeline, because, the input files for the
    node
    in the second pipeline will be updated as part of the first pipeline run. Because I have registered both the pipelines to run as default in the
    registry
    , the
    node
    from the second pipeline runs sooner than I expect - I don’t want that.
    Copy code
    # Pipeline A
    Input X, Y --> node1 + node2 + node3 --> Output X (i.e Input X after update)
    
    # Pipeline B 
    Input X(after update from Pipeline A), Y --> node1 + node4 + node5. --> Output Z
    
    Order of execution (node_Pipelinename)
    node1_A
    node1_B
    node3_A
    node2_A
    node4_B
    node5_B
    
    Expected order of execution
    node1_A
    node3_A
    node2_A
    node1_B
    node4_B
    node5_B
    m
    a
    • 3
    • 4
  • d

    Dotun O

    03/21/2023, 1:28 PM
    Hi all quick newbie Kedro question here. If wanted to call catalog.load directly within the pipeline to observe the dataframes, how do I get the current catalog in the pipeline run. I see that kedro has kedro.io import DataCatalog but not sure how to get the specific catalog context
    d
    • 2
    • 21
  • r

    R P

    03/21/2023, 7:09 PM
    Hi everyone, I'm using Kedro with two main configuration envs: "conf/base" and "conf/test", and I'm running
    kedro run --env=test
    when I need to run a quick pipeline check. However, I have some code in my "settings.py" file that I must not run when I'm using the "conf/test" env, but I'm not managing to get this environment information in the "settings.py" code so I can write a simple if/else condition. What is the best way to do this? Thanks for this awesome open-source tool!
    j
    d
    • 3
    • 6
  • j

    Javier del Villar

    03/21/2023, 7:35 PM
    Hi everybody, i'm not new to kedro, but i'm new to using kedro, pyspark and databricks at the same time. The logs appear after all the jobs have been completed, is it there a way to see the logs as they occur? I think is more a databricks question Thanks in advance!
    n
    • 2
    • 6
  • a

    Anjali Datta

    03/22/2023, 1:00 AM
    I’m inexperienced, so this is basic question. I’m trying to add datasets programmatically. I’ve made a catalog.py file that contains:
    from <http://kedro.io|kedro.io> import DataCatalog
    from <http://kedro.io|kedro.io> import PartitionedDataSet
    from kedro.extras.datasets.pandas import CSVDataSet
    from kedro.config import ConfigLoader
    conf_paths = ['conf/base', 'conf/local']
    conf_loader = ConfigLoader(conf_paths)
    atlas_regions = conf_loader.get('atlas_regions*') # A .yml file consisting of regions with names
    catalog_dictionary = {}
    for region in atlas_regions['regions']:
    name = region['name']
    # catalog_dictionary[f'{name}_data_right'] = PartitionedDataSet(path = '../ClinicalDTI/R_VIM/', \
    #     dataset = '<http://programmatic_datasets.io|programmatic_datasets.io>.nifti.NIfTIDataSet', filename_suffix = f'seedmasks/{name}_R_T1.nii.gz')
    catalog_dictionary[f'{name}_data_right'] = CSVDataSet(filepath = "../data/01_raw/iris.csv")
    # catalog_dictionary[f'{name}_data_right_output'] = CSVDataSet(filepath = "../data/01_raw/iris.csv")
    io = DataCatalog(catalog_dictionary)
    print(io.list())
    (Kedro version 0.17.7) Running catalog.py prints the expected list of datasets. But what do I need to do to be able to use these datasets in a pipeline?
    d
    d
    • 3
    • 4
  • b

    Balachandran Ponnusamy

    03/22/2023, 2:51 PM
    Hi Kedro Team...Getting attached error when we submit job in dataproc cluster to run a Data Engineering pipeline, we have a datafile in ".txt.gz" format. Same if we run it in .master(local[*]) , it works fine. but fails when we submit with saprk.master:yarn and spark.submit.deploymentmode: client Any idea where it is going wrong?
    d
    o
    • 3
    • 12
  • s

    Stephane Durfort

    03/22/2023, 4:22 PM
    Hello, while playing with the
    OmegaConfigLoader
    to eventually replace the
    TemplateConfigLoader
    in my pipeline, I noticed that • variable interpolation does not seem to be applied on nested parameters (as in the
    model_options
    example mentioned in the documentation) • using
    kedro run --params
    only update parameters but does not propagate to references of these parameters in the configuration ? Am I doing something wrong ?
    n
    • 2
    • 18
  • p

    Priyanka Patil

    03/22/2023, 5:43 PM
    Hello team, I have the following catalog entry in my yaml file. Columns parameter below is not working. Am I missing something here? Thank you in advance!
    Copy code
    raw_dataset:
      type: spark.SparkDataSet
      filepath: "/data/01_raw/data.csv"
      file_format: csv
      load_args:
        header: True
        inferSchema: True
        index: False
        columns: ["a", "b", "c"]
    m
    d
    • 3
    • 5
  • v

    Valentin Martinez Gama

    03/22/2023, 9:42 PM
    Hello team. I have created a custom Class that inherits from sklearn BaseEstimator, TransformerMixin. So
    class CustomClass((BaseEstimator, TransformerMixin)
    I have created an object of that class and saved it to my Kedro catalog as a pickle object. Now the problem is when I try using
    catalog.load()
    on a pipeline to load that object I get the following error:
    DataSetError: Failed while loading data from data set PickleDataSet(backend=pickle,
    filepath=……./data/06_models/custom_model_V1.pkl,
    load_args={}, protocol=file, save_args={}).
    Can’t get attribute ‘CustomClass’ on <module ‘__main__’ from ‘……venv/bin/kedro’>
    I was able to make it work on a notebooks by first importint the class from the py file where it was defined:
    from custom_classes import CustomClass
    But when runing a kedro pipeline that uses this object as an input loaded from the catalog adding the import at the top of the pipeline fill did not fix it. Any usggestions on how to fix this?
    d
    b
    • 3
    • 2
  • k

    Kenny B

    03/23/2023, 10:27 PM
    hello, I'm trying see if the following functionality exists for versioned datasets: 1. list all available versions of the catalog item 2. limit the number of versions created of this dataset, ie - limit is 10, clean up the oldest 11th version when I save a newer version
    d
    d
    • 3
    • 9
  • m

    Maxime Steinmetz

    03/23/2023, 11:20 PM
    How can a predictive modelling project be designed for easy switching between steps, such as missing values imputation methods, class balancing methods, model types and so on? Should nodes be used to dispatch data to different implementations based on parameters, or should nodes containing the concrete logic be used? Alternatively, would a pipeline factory that produces a pipeline made of concrete nodes be more suitable?
    d
    • 2
    • 2
1...161718...31Latest