Kedro #questions

Elior Cohen

01/22/2023, 11:45 AM

Is there a way to create templates (like starters) for pipelines? I'd image something like

kedro pipeline create my_pipeline --template my_awesome_template

which will include template code for the pipeline

Massinissa Saïdi

01/23/2023, 9:26 AM

Hello again me 🙂 To write a file (csv, pandas or other) with kedro_dataset API as MatplotlibWriter (or other) whe sould specify the credentials. In documentation, credentials should be wrote like that:

Copy code

credentials: Credentials required to get access to the underlying filesystem.
                E.g. for ``S3FileSystem`` it should look like:
                `{'key': '<id>', 'secret': '<key>'}}`

But is it possible to add the endpoint_url like that:

{'key': '<id>', 'secret': '<key>'}, 'client_kwargs': {'endpoint_url': '<http://myurl:9000>'}}

? When i use the API code it doesnt work but when I use the catalog it works.

Prachi Jain

01/23/2023, 12:56 PM

Hi Team- i am new to kedro. I was looking at kedro tutorial spaceflights poject. I updated the nodes.py file and pipeline.py file as per the tutorial but when i am running

kedro run

then it gives me an error saying that

Pipeline contains no nodes after applying all provided filters

can someone help here? i am using latest version of kedro.

Safouane Chergui

01/23/2023, 2:06 PM

Hello, I’d like to know if there is a way to have kedro return None instead of raising an exception if loading an entry from the data catalog (catalog.yml) fails. Thanks

Rob

01/23/2023, 3:43 PM

Hi everyone, I'm using Kedro 0.17.4 and I'm having this issue:

Brandon Meek

01/23/2023, 7:41 PM

Hey everyone, I'm looking for the "Kedro" way of doing a Monte Carlo sim. I have a very large Dataset in Presto and I want to repeatedly pull samples from it and run each group of samples through a pipeline and then rollup all of the results of the pipeline, currently I'm thinking of calling the pipeline from outside the kedro project.

MarioFeynman

01/23/2023, 8:39 PM

Hi! Is there any reason why Kedro doesnt have a 1.x.x version?

Alex Ofori-Boahen

01/23/2023, 9:18 PM

Hi there, After packaging my app for deployment using kedro package and running pip install, I see the module is installed when I do the pip list check. However, when I an python -m (package-name) it says module no module named (package name). How can I resolve this issue?

Ivan Danov

01/24/2023, 11:34 AM

Has anyone used Kedro with Apache Beam or Google Cloud Dataflow?

Dustin

01/25/2023, 12:26 AM

Hi team, I have been trying to play with hooks and followed your doc to implement both memory profile and pipeline time hooks (just copied your scripts from doc) and registered in settings.py but no hook related information is shown in the console log with

kedro run

(no error but same console information without hooks). Just wondering do i need to do something to 'reload' settings? or

Joel Ramirez

01/25/2023, 2:33 PM

Hello

Joel Ramirez

01/25/2023, 2:33 PM

I am getting this error when try to run the data science pipeline

Joel Ramirez

01/25/2023, 2:33 PM

Failed to find the pipeline named 'data_science'. It needs to be generated and returned by the 'register_pipelines' function.

Joel Ramirez

01/25/2023, 2:33 PM

Do someone knows how to fix this ?

Miguel Angel Ortiz Marin

01/25/2023, 7:46 PM

Hi! I'm loading a plotly JSONDataSet but it's not loading a plotly fig, it's loading a python dictionary. Simple example from the docs below that gives an error: Could it be related to plotly version?

Jong Hyeok Lee

01/26/2023, 5:57 AM

Hello! Has anyone tried to ZIP the entire kedro pipeline and used it on AWS Glue? And also would there be a way to do CI/CD with this approach?

Sergei Benkovich

01/26/2023, 11:33 AM

is it possible to supply same catalog entry for inputs outputs? or how would you handle a situation where i want to extract new data based on existing and to append the newly extracted to the existing, i don’t want separate catalog entries for the two datasets

user

01/26/2023, 12:48 PM

Kedro catalog fails when overwriting a GeoJson dataset even though the driver is supported I have the following catalog item in my kedro project suggested_routes_table@geopandas: type: geopandas.GeoJSONDataSet filepath: data/04_feature/routes_suggestions_table.geojson load_args: driver: "GeoJSON" mode: "a" The keyword argument mode: "a" stands for append, meaning that every time the node is run, it should append new rows to the geojson instead of overwriting the file in the path. As stated in <a...

Sergei Benkovich

01/26/2023, 1:20 PM

is it possible to make the versioned results be saved in the same folder? i produce reports and i want per run all reports to be in the same folder, currently the version=True just places each figure in a separate folder with the timestamp it ran and not the whole pipeline ran

Andrew Stewart

01/27/2023, 4:55 AM

So just throwing this out there - but does anyone happen to have a solid example of using kedro w/ poetry +

kedro-docker

Paul Mora

01/27/2023, 8:44 AM

Hey guys - I am currently trying to save/load pyspark ml objects through the catalog. The documentation states the following: https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#use-memorydataset-with-copy-mode-assign-for-non-dataframe-spark-objects and the recommendation to use

MemoryDataSets

for those non-dataframe instances. That is all fine and well, though of course not being able to save any transformers becomes quite tedious at some point. Is there any guidance/ development on that front?

Massinissa Saïdi

01/27/2023, 11:42 AM

Hello kedroids! I have an error that I can't understand:

Copy code

DataSetError: 
botocore.session.session.create_client() got multiple values for keyword 
argument 'aws_access_key_id'.
DataSet 'dataset' must only contain valid arguments for the 
constructor of 'kedro.extras.datasets.pandas.csv_dataset.CSVDataSet'.

I run my code from a

docker-compose

with only one container (for now), I write files in s3. I specified the credentials this way:

Copy code

aws_credentials:
    aws_access_key_id: XXXXXXX
    aws_secret_access_key: XXXXXXX

and my dataframe in

catalog.yml

this way:

Copy code

dataset:
  type: pandas.CSVDataSet
  filepath: ${s3.path}/data/dataset.csv
  credentials: aws_credentials

docker-compose.yml

Copy code

version: '3.7'

services:
kedro:
      build:
        context: .
        args:
            PIP_USERNAME: ${PIP_USERNAME}
            PIP_PASSWORD: ${PIP_PASSWORD}
            PIP_REPO: ${PIP_REPO}
        dockerfile: dockerfile.kedro
        cache_from:
          - ia-churn
      image: ia-churn
      command: kedro run --env prod --pipeline data-processing
      volumes:
        - .:/usr/src/app/
        - ./data/01_raw/:/usr/src/app/data/01_raw

conda

environement evrything works. Someone has an idea please ? More informations: I used kedro v0.18.4 and python 3.10

Patrick Deutschmann

01/27/2023, 1:19 PM

Hey everyone! I’m new to Kedro, and I first want to thank all the contributors. You’ve genuinely built a fantastic tool! Is it possible to save outputs to multiple data sets? For instance, I’d like to write my feature data both to the local file system and to, say, an Azure blob storage. Thanks 😊

❤️ 3

Massinissa Saïdi

01/27/2023, 4:12 PM

Hello ! I use kedro with sagemaker following this kedro-tutorial And I have a question: is it possible to use functions created in nodes inside the

sagemaker_entry_point.py

script, example:

Copy code

...
from pipelines.ml_model.model import train_model

...

def main():
    ....
    regressor = train_model(...)
    ...

if __name__ == "__main__":
    # SageMaker will run this script as the main program
    main()

Because I have this error:

ModuleNotFoundError: No module named 'pipelines'

Thanks for your help 🙂

Alexandra Lorenzo

01/27/2023, 5:49 PM

Hello, How to read specific files (images) based on the filename prefix (as example) ? I'm using Partioned Dataset to read and write images with a specific extra dataset. My folder is organized as follow with more than (120.000 images):

Department 1

|-> Zone 1

|---> IMG_00001.tif

|---> MSK_00001.tif

I need to read first IMG_*****.tif then MSK_*****.tif is it possible ? Thanks for your help

Copy code

raw_images:
  type: PartitionedDataSet
  dataset:
    type: flair_ign.extras.datasets.satellite_image.SatelliteImageDataSet
  path: /home/ubuntu/train
  filename_suffix: .tif
  layer: raw

Andrew Stewart

01/28/2023, 12:15 AM

Anyone else happen to be using Athena as inputs for Kedro? Found this: https://github.com/atsangarides/kedroio but wondering if anyone is doing anything different

Rob

01/28/2023, 6:00 PM

Is there a way to set a

main.py

instead of using the CLI commands to run all the pipelines? (If so any docs or examples would be great) (Using

kedro==0.17.7

)

Ofir

01/28/2023, 6:26 PM

How does Git and Kedro play ball together? We have a classification data science pipeline written in Python and hosted on a GitHub repository. While I get the concept of Kedro project and having a workspace per data model, I don’t get how do I sync the code across projects/workspaces/experiments. Should Kedro tasks (and pipelines) be thin wrappers that import my existing Python code, or not? what are the best practices if you already have an existing code base and Git repository with your code? Thanks!

Ofir

01/28/2023, 6:33 PM

I guess what I’m missing is how is Kedro integrated as part of a real-world application, and not just data science in vacuum. Is there like a kedro folder in Git with per-experiment folder and relative Python imports for the core code? Pointers to a real-world application on GitHub that use Kedro across different experiments would be useful.

Sergei Benkovich

01/29/2023, 9:12 AM

in globals.yaml i try to use something like:

Copy code

split_folder: "split_1"

folders:
  raw: "{split_folder}/01_raw"

but it doesnt work and i just get a new folder called, {split_folder}/01_raw is there anyway to accomplish this? i’m running several versions one after the other, i want each one in different folder, but don’t want to have to change paths for all the subdirs i defined...