Kedro #questions

Join Slack

Eduardo Romero López

07/12/2023, 12:37 PM

and my dataframe contains those columns name

Eduardo Romero López

07/12/2023, 12:38 PM

Eduardo Romero López

07/12/2023, 12:39 PM

any suggestions?? please

J. Camilo V. Tieck

07/12/2023, 9:32 PM

hi everyone, going back to this discussion with @Deepyaman Datta and @Juan Luis, I wanted to ask if anyone has experience with this. in summary, I have three kedro pipelines. data_processing model_training inference • gets input either from file in batch or from a user request on-demand I want to run the inference pipeline in an scheduled/batch job, and on demand. 1. how would you deploy the inference pipeline? 2. what aws services do you recommend for the scheduled job? ECR + ECS? aws batch? the input is a file in an s3 location, and the output goes also to an s3 location. 3. for the user interface API I’m using a lambda function and I’m loading the model.pkl from s3:/data/06_model_output/. I would like to use here the inference pipeline instead. how can I pass the request input to the pipeline? it is a dataframe but it is not in the catalog. and the output has to be returned from the lambda, also the dataframe is converted to json. sorry if this sounds trivial, but I’m having a hard time figuring out the architecture for this. thanks in advance!

K 1

Leslie Wu

07/13/2023, 12:51 PM

Hey everyone Facing an issue where I am trying to LOAD Excel data (.xlsx) stored in an S3 bucket - writing and saving to the same S3 bucket is OK. Error messasge:

Copy code

kedro.io.core.DataSetError: Failed while loading data from data set ExcelDataSet(filepath=my/s3/path/file.xlsx, load_args={'engine': openpyxl, 'sheet_name': Sheet1}, protocol=s3, save_args={'index': False}, writer_args={'engine': xlsxwriter}).
my/s3/path/file.xlsx

Have no issues with other formats - parquet / csv / PDF. Anyone seen this before or have insights to where I am going wrong? FYI, I am using

kedro=0.17.7

Michel van den Berg

07/13/2023, 1:00 PM

Does a Kedro node always have one output dataset?

Merel

07/13/2023, 4:09 PM

Is it possible to directly load an

.xlsx

file into a Kedro

SparkDataSet

Rachid Cherqaoui

07/14/2023, 5:30 PM

Hi, I'm doing some unit tests for my kedro project and I'm trying to record some outputs to test with but I got this error and I didn't understand how to fix it, is it possible that someone can help me please, thanks in advance.

Nelson Zambrano

07/15/2023, 11:02 PM

Are posts older than 90 days archived somewhere?

Dawid Bugajny

07/17/2023, 9:14 AM

Hello I have created an endpoint using FastAPI. Every request creates KeddroSession, run specific pipeline and return results:

with KedroSession.create(...) as session:

context = session.load_context()

cat = context.catalog

return SequentialRunner().run(catalog=cat, pipeline=pipeline)[...]

I have just discovered, that my API is single-threaded and new requests have to wait untill previeous requests finish. Does anybody solution for this problem and knows how to make API multithreaded?

Eduardo Romero López

07/17/2023, 9:49 AM

hello!!! is it possible to show a dataset in kedro viz that it is not linked to one pipeline? I would like to show it for teaching the team. Thanks 🙂

Jo Stichbury

07/17/2023, 3:41 PM

Hi everyone! I wanted first to thank all the participants of the recent Kedro documentation survey thankyou and second, to apologise 🙇‍♀️. We are still waiting for our rebranded Kedro merch, so we have yet to send anything out to those who responded, but I am following up on that. 👕 K kedroid Can I ask a follow up question to everyone on this channel? It's still about docs, specifically the Spaceflights tutorial. We received feedback that you'd like more examples of how to extend Spaceflights for other, more advanced, scenarios, e.g. to add S3 as a filestore, deployment options, etc. We wouldn't extend the starter but we would add extra example code and how-to sections in the docs. The question is: what do you think we should add to extend Spaceflights for common tasks and scenarios? Please leave me a comment in the 🧵 And, finally, if you're interested in the outcome of the documentation user research, there's now a milestone on GitHub with some of the major activities we have planned. It's all work in progress but I'm sharing here for transparency. We always appreciate feedback and suggestions!

🥳 4

Higor Carmanini

07/17/2023, 7:26 PM

Hello Kedro people! I wonder if anyone has been able to resolve this annoying issue of VSCode's

pylance

incorrectly inferring that the

pipeline

function (as imported from

kedro.pipeline

is actually a module. It gets in the way of showing the proper documentation for

kedro.pipeline.modular_pipeline.pipeline()

, and I figure could turn some less Kedro-savvy devs away by thinking they're doing it wrong (me a while back 🙃)

Rachid Cherqaoui

07/17/2023, 9:04 PM

Hi, I'm doing some unit tests on my project and I'm using the

PartitionedDataSet

function from

<http://kedro.io|kedro.io>

to load a data but I've just seen that this function doesn't take the delimiter into account, how can I solve this? (knowing that I'm working on csv files on my local, here is the code used :

data_set = PartitionedDataSet(

Copy code

path = "data/01_raw/Tableaux",
                dataset= CSVDataSet,
                filename_suffix= ".csv",
                load_args= {"delimiter": ";", "header": 0,"encoding": "utf-8"}

Marc Gris

07/18/2023, 4:47 AM

Hi everyone, Is there a way to have “truly global params”, i.e params that are immune to “namespace-ing” ? So far, it seems using the namespace feature, involves creating as many duplicated params as there are namespaces… This is quite an unfortunate behavior since (and I guess that I’m not the only one in this situation) a fairly big portion of my config is immutable across namespaces and is uselessly duplicated… Granted, I could use templating or anchors / aliases, but this feels a bit “hacky”. Is there a “cleaner” / more “elegant” way ? Thanks M

Jackson

07/18/2023, 6:54 AM

Hi, I am curious about the best practice to use kedro. Currently, my applications involving initialize the vector stores, adding documents with their corresponding embeddings into the vector stores, which itself isn't a standalone function that can be written in nodes.py. The following is how the code will be written.

Copy code

class VectorStore:
    def __init__(
            self,
            client_path,
            embedding_func) -> None:
        self.collections = None
        self.client = chromadb.PersistentClient(path=client_path)
        self.embedding_func = embedding_func
        
    def create_collections(self,collection_name):
        self.collections = self.client.create_collection(collection_name,self.embedding_func)
        return self.collections
    
    def add_docs(
            collections,
            embeddings,
            metadatas,
            ids):
        collections.add(
            embeddings = embeddings,
            metadatas = metadatas,
            ids = ids
        )

However, putting this inside nodes.py doesn't seems ideal due to I still have other classes (like model class) and I believe mixing everything inside a nodes is an anti-pattern. But if I write a standalone function in nodes.py like below seems redundant.

Copy code

def create_collections(collections,collections_name):
    collections.create_collections(collections_name)

So my question is what are the best way to separate classes and nodes, while avoiding code redundant at the same time?

Daniel Lee

07/18/2023, 8:42 AM

Hi team under

DataCatalog

, I would like to

pandas.ParquetDataset

to partition by the date in the dataset and save into different folders by date in parquet like how we can do it for

spark.SparkDataSet

. Is there a way we could partition using pandas?

Zemeio

07/18/2023, 9:26 AM

Hey guys, I'm trying to do a "for loop" on the catalog, so I am using TemplatedConfig so I can use jinja. I am trying with the following way, and I am not sure why it is not working. Catalog:

Copy code

{%- for item in mylist %}
out.blind_predictions_{{ item-}}:
    type: pandas.CSVDataSet
    filepath: ${filepath1}_{{ item-}}.csv
    layer: out
{% endfor %}

Globals:

Copy code

mystli:
    - item1
    - item2

(For obvious reasons I removed the actual names from the text here) Does anyone know how to accomplish this? (do a for here)

Marc Gris

07/18/2023, 1:12 PM

Hi again ! Is there a way to template / interpolate in

catalog.yml

values that are defined in

parameters.yml

ex: in

conf/base/parameters.yml

Copy code

tenant_id: xyz

and in

conf/base/catalog.yml

Copy code

_tenant_id: ${tenant_id}

Thx in advance

Rachid Cherqaoui

07/18/2023, 3:00 PM

How can I reduce the execution time of a kedro project? Is there anything I should be looking at?

Rachid Cherqaoui

07/19/2023, 9:29 AM

Hello, I have a problem in relation to the execution time of my kedro project, in fact, when I execute it with

kedro run --async

, it takes less time (significant) compared to when I use the

KedroSession.create().run()

with FastAPI (knowing that in my post function I made the

async def

) my question is how can I use the

async

argument with

kedrSession

that it is at the level of

hooks

or otherwise, thank you in advance.

Marc Gris

07/19/2023, 10:01 AM

DATASET FACTORY Hi Everyone, I’m experimenting with dataset factories. I had managed to make it make work 30min ago, and now can’t… 😕 go figure 😅 If any of you can’t spot what I’m doing wrong, please let me know 🙂 Thx

Marc Gris

07/19/2023, 11:19 AM

regarding the above ⬆️ , I’ve just checked with the debugger and indeed… :

Rachid Cherqaoui

07/19/2023, 1:23 PM

Hello, I have a problem in relation to the execution time of my kedro project, in fact, when I execute it with

kedro run --async

, it takes less time (significant) compared to when I use the

KedroSession.create().run()

with FastAPI (knowing that in my post function I made the

async def

) my question is how can I use the

async

argument with

kedrSession

that it is at the level of

hooks

or otherwise, thank you in advance.

Marc Gris

07/19/2023, 2:11 PM

Hi everyone I’m having a bit of hard time understanding what is meant by the following error message:

ModularPipelineError: Inputs should be free inputs to the pipeline

Could some kindly unpack / explain it ? Thx

Cyril Verluise

07/19/2023, 5:55 PM

Hey there! hope you are all doing well. Some open ended questions for you 🤗 ℹ️ Context: Let's say that the input/output of a pipeline is a collection of files (e.g. images) and that I want to apply the same function over all these files. I don't want to load/dump all the files at once for memory reason. My understanding is that this does not fit off-the-shelf kedro approach, feel free to correct me if I'm wrong. 📄 📄 📄 Best practice for multi files data input/output: What's the best practice to handle such cases in kedro? I have seen workarounds in the past with internal functions loading/dumping blobs and faking input/output for kedro with an empty file. I'm wondering if there is anything you would recommend. 🤖 🤖 🤖 kedro distributed approach: Is there any recommended approach if I want to distrribute the above processing over multiple machines. I was considering argo workflows but I see that the kedro doc re argo is deprecated. Does it mean that this is not the recommended approach? if yes, what would be recommended? Thanks a lot in advance!

👍🏼 1

Vincent Liagre

07/20/2023, 11:44 AM

Hello team ; I have installed my module (within

src/my_module

) with

pip install -e src

; now

kedro

is looking for data within the

my_module

folder from root for some reason. Any clue whats going on here and how I can solve this ? Happy to provide more details if required 🙂

Marc Gris

07/20/2023, 2:47 PM

Hi everyone,

local/catalog.yml

does not override the `base/catalog.yml`… Any idea what could cause this behavior ? Thx M.

Christos Malliopoulos

07/21/2023, 11:18 AM

Hello all! II’m a newcommer. Do you know if there’s a process/channel to submit bugs/issues?

Nok Lam Chan

07/21/2023, 3:32 PM

[Not a Kedro Question] - What’s the most efficient way to find out some basic statistic of a CSV (filesize/number of rows and columns) ? Requirements: • Memory efficient (cannot load the full csv into a dataframe and do

df.describe()

• Need to work in Windows and Linux so

wc

is not an option • Need to be fast • Bonus: is it possible to generalised to Excel filetype?

👍 1