https://kedro.org/ logo
Join Slack
Powered by
# questions
  • e

    Eduardo Romero López

    07/12/2023, 12:37 PM
    and my dataframe contains those columns name
  • e

    Eduardo Romero López

    07/12/2023, 12:38 PM
  • e

    Eduardo Romero López

    07/12/2023, 12:39 PM
    any suggestions?? please
  • j

    J. Camilo V. Tieck

    07/12/2023, 9:32 PM
    hi everyone, going back to this discussion with @Deepyaman Datta and @Juan Luis, I wanted to ask if anyone has experience with this. in summary, I have three kedro pipelines. data_processing model_training inference • gets input either from file in batch or from a user request on-demand I want to run the inference pipeline in an scheduled/batch job, and on demand. 1. how would you deploy the inference pipeline? 2. what aws services do you recommend for the scheduled job? ECR + ECS? aws batch? the input is a file in an s3 location, and the output goes also to an s3 location. 3. for the user interface API I’m using a lambda function and I’m loading the model.pkl from s3:/data/06_model_output/. I would like to use here the inference pipeline instead. how can I pass the request input to the pipeline? it is a dataframe but it is not in the catalog. and the output has to be returned from the lambda, also the dataframe is converted to json. sorry if this sounds trivial, but I’m having a hard time figuring out the architecture for this. thanks in advance!
    K 1
    n
    • 2
    • 1
  • l

    Leslie Wu

    07/13/2023, 12:51 PM
    Hey everyone Facing an issue where I am trying to LOAD Excel data (.xlsx) stored in an S3 bucket - writing and saving to the same S3 bucket is OK. Error messasge:
    Copy code
    kedro.io.core.DataSetError: Failed while loading data from data set ExcelDataSet(filepath=my/s3/path/file.xlsx, load_args={'engine': openpyxl, 'sheet_name': Sheet1}, protocol=s3, save_args={'index': False}, writer_args={'engine': xlsxwriter}).
    my/s3/path/file.xlsx
    Have no issues with other formats - parquet / csv / PDF. Anyone seen this before or have insights to where I am going wrong? FYI, I am using
    kedro=0.17.7
    m
    • 2
    • 5
  • m

    Michel van den Berg

    07/13/2023, 1:00 PM
    Does a Kedro node always have one output dataset?
    l
    j
    • 3
    • 6
  • m

    Merel

    07/13/2023, 4:09 PM
    Is it possible to directly load an
    .xlsx
    file into a Kedro
    SparkDataSet
    ?
    d
    d
    • 3
    • 5
  • r

    Rachid Cherqaoui

    07/14/2023, 5:30 PM
    Hi, I'm doing some unit tests for my kedro project and I'm trying to record some outputs to test with but I got this error and I didn't understand how to fix it, is it possible that someone can help me please, thanks in advance.
    d
    • 2
    • 4
  • n

    Nelson Zambrano

    07/15/2023, 11:02 PM
    Are posts older than 90 days archived somewhere?
    d
    • 2
    • 1
  • d

    Dawid Bugajny

    07/17/2023, 9:14 AM
    Hello I have created an endpoint using FastAPI. Every request creates KeddroSession, run specific pipeline and return results:
    with KedroSession.create(...) as session:
    context = session.load_context()
    cat = context.catalog
    return SequentialRunner().run(catalog=cat, pipeline=pipeline)[...]
    I have just discovered, that my API is single-threaded and new requests have to wait untill previeous requests finish. Does anybody solution for this problem and knows how to make API multithreaded?
    n
    m
    a
    • 4
    • 16
  • e

    Eduardo Romero López

    07/17/2023, 9:49 AM
    hello!!! is it possible to show a dataset in kedro viz that it is not linked to one pipeline? I would like to show it for teaching the team. Thanks 🙂
    n
    • 2
    • 2
  • j

    Jo Stichbury

    07/17/2023, 3:41 PM
    Hi everyone! I wanted first to thank all the participants of the recent Kedro documentation survey thankyou and second, to apologise 🙇‍♀️. We are still waiting for our rebranded Kedro merch, so we have yet to send anything out to those who responded, but I am following up on that. 👕 K kedroid Can I ask a follow up question to everyone on this channel? It's still about docs, specifically the Spaceflights tutorial. We received feedback that you'd like more examples of how to extend Spaceflights for other, more advanced, scenarios, e.g. to add S3 as a filestore, deployment options, etc. We wouldn't extend the starter but we would add extra example code and how-to sections in the docs. The question is: what do you think we should add to extend Spaceflights for common tasks and scenarios? Please leave me a comment in the 🧵 And, finally, if you're interested in the outcome of the documentation user research, there's now a milestone on GitHub with some of the major activities we have planned. It's all work in progress but I'm sharing here for transparency. We always appreciate feedback and suggestions!
    🥳 4
    j
    n
    +2
    • 5
    • 16
  • h

    Higor Carmanini

    07/17/2023, 7:26 PM
    Hello Kedro people! I wonder if anyone has been able to resolve this annoying issue of VSCode's
    pylance
    incorrectly inferring that the
    pipeline
    function (as imported from
    kedro.pipeline
    is actually a module. It gets in the way of showing the proper documentation for
    kedro.pipeline.modular_pipeline.pipeline()
    , and I figure could turn some less Kedro-savvy devs away by thinking they're doing it wrong (me a while back 🙃)
    n
    • 2
    • 13
  • r

    Rachid Cherqaoui

    07/17/2023, 9:04 PM
    Hi, I'm doing some unit tests on my project and I'm using the
    PartitionedDataSet
    function from
    <http://kedro.io|kedro.io>
    to load a data but I've just seen that this function doesn't take the delimiter into account, how can I solve this? (knowing that I'm working on csv files on my local, here is the code used :
    data_set = PartitionedDataSet(
    Copy code
    path = "data/01_raw/Tableaux",
                    dataset= CSVDataSet,
                    filename_suffix= ".csv",
                    load_args= {"delimiter": ";", "header": 0,"encoding": "utf-8"}
    d
    • 2
    • 4
  • m

    Marc Gris

    07/18/2023, 4:47 AM
    Hi everyone, Is there a way to have “truly global params”, i.e params that are immune to “namespace-ing” ? So far, it seems using the namespace feature, involves creating as many duplicated params as there are namespaces… This is quite an unfortunate behavior since (and I guess that I’m not the only one in this situation) a fairly big portion of my config is immutable across namespaces and is uselessly duplicated… Granted, I could use templating or anchors / aliases, but this feels a bit “hacky”. Is there a “cleaner” / more “elegant” way ? Thanks M
    m
    n
    j
    • 4
    • 32
  • j

    Jackson

    07/18/2023, 6:54 AM
    Hi, I am curious about the best practice to use kedro. Currently, my applications involving initialize the vector stores, adding documents with their corresponding embeddings into the vector stores, which itself isn't a standalone function that can be written in nodes.py. The following is how the code will be written.
    Copy code
    class VectorStore:
        def __init__(
                self,
                client_path,
                embedding_func) -> None:
            self.collections = None
            self.client = chromadb.PersistentClient(path=client_path)
            self.embedding_func = embedding_func
            
        def create_collections(self,collection_name):
            self.collections = self.client.create_collection(collection_name,self.embedding_func)
            return self.collections
        
        def add_docs(
                collections,
                embeddings,
                metadatas,
                ids):
            collections.add(
                embeddings = embeddings,
                metadatas = metadatas,
                ids = ids
            )
    However, putting this inside nodes.py doesn't seems ideal due to I still have other classes (like model class) and I believe mixing everything inside a nodes is an anti-pattern. But if I write a standalone function in nodes.py like below seems redundant.
    Copy code
    def create_collections(collections,collections_name):
        collections.create_collections(collections_name)
    So my question is what are the best way to separate classes and nodes, while avoiding code redundant at the same time?
    d
    j
    n
    • 4
    • 6
  • d

    Daniel Lee

    07/18/2023, 8:42 AM
    Hi team under
    DataCatalog
    , I would like to
    pandas.ParquetDataset
    to partition by the date in the dataset and save into different folders by date in parquet like how we can do it for
    spark.SparkDataSet
    . Is there a way we could partition using pandas?
    n
    • 2
    • 1
  • z

    Zemeio

    07/18/2023, 9:26 AM
    Hey guys, I'm trying to do a "for loop" on the catalog, so I am using TemplatedConfig so I can use jinja. I am trying with the following way, and I am not sure why it is not working. Catalog:
    Copy code
    {%- for item in mylist %}
    out.blind_predictions_{{ item-}}:
        type: pandas.CSVDataSet
        filepath: ${filepath1}_{{ item-}}.csv
        layer: out
    {% endfor %}
    Globals:
    Copy code
    mystli:
        - item1
        - item2
    (For obvious reasons I removed the actual names from the text here) Does anyone know how to accomplish this? (do a for here)
    n
    • 2
    • 12
  • m

    Marc Gris

    07/18/2023, 1:12 PM
    Hi again ! Is there a way to template / interpolate in
    catalog.yml
    values that are defined in
    parameters.yml
    ex: in
    conf/base/parameters.yml
    Copy code
    tenant_id: xyz
    and in
    conf/base/catalog.yml
    Copy code
    _tenant_id: ${tenant_id}
    Thx in advance
    n
    m
    • 3
    • 6
  • r

    Rachid Cherqaoui

    07/18/2023, 3:00 PM
    How can I reduce the execution time of a kedro project? Is there anything I should be looking at?
    m
    j
    +2
    • 5
    • 11
  • r

    Rachid Cherqaoui

    07/19/2023, 9:29 AM
    Hello, I have a problem in relation to the execution time of my kedro project, in fact, when I execute it with
    kedro run --async
    , it takes less time (significant) compared to when I use the
    KedroSession.create().run()
    with FastAPI (knowing that in my post function I made the
    async def
    ) my question is how can I use the
    async
    argument with
    kedrSession
    that it is at the level of
    hooks
    or otherwise, thank you in advance.
    d
    • 2
    • 1
  • m

    Marc Gris

    07/19/2023, 10:01 AM
    DATASET FACTORY Hi Everyone, I’m experimenting with dataset factories. I had managed to make it make work 30min ago, and now can’t… 😕 go figure 😅 If any of you can’t spot what I’m doing wrong, please let me know 🙂 Thx
    j
    m
    d
    • 4
    • 10
  • m

    Marc Gris

    07/19/2023, 11:19 AM
    regarding the above ⬆️ , I’ve just checked with the debugger and indeed… :
    j
    • 2
    • 1
  • r

    Rachid Cherqaoui

    07/19/2023, 1:23 PM
    Hello, I have a problem in relation to the execution time of my kedro project, in fact, when I execute it with
    kedro run --async
    , it takes less time (significant) compared to when I use the
    KedroSession.create().run()
    with FastAPI (knowing that in my post function I made the
    async def
    ) my question is how can I use the
    async
    argument with
    kedrSession
    that it is at the level of
    hooks
    or otherwise, thank you in advance.
    d
    • 2
    • 5
  • m

    Marc Gris

    07/19/2023, 2:11 PM
    Hi everyone I’m having a bit of hard time understanding what is meant by the following error message:
    ModularPipelineError: Inputs should be free inputs to the pipeline
    Could some kindly unpack / explain it ? Thx
    d
    • 2
    • 10
  • c

    Cyril Verluise

    07/19/2023, 5:55 PM
    Hey there! hope you are all doing well. Some open ended questions for you 🤗 ℹ️ Context: Let's say that the input/output of a pipeline is a collection of files (e.g. images) and that I want to apply the same function over all these files. I don't want to load/dump all the files at once for memory reason. My understanding is that this does not fit off-the-shelf kedro approach, feel free to correct me if I'm wrong. 📄 📄 📄 Best practice for multi files data input/output: What's the best practice to handle such cases in kedro? I have seen workarounds in the past with internal functions loading/dumping blobs and faking input/output for kedro with an empty file. I'm wondering if there is anything you would recommend. 🤖 🤖 🤖 kedro distributed approach: Is there any recommended approach if I want to distrribute the above processing over multiple machines. I was considering argo workflows but I see that the kedro doc re argo is deprecated. Does it mean that this is not the recommended approach? if yes, what would be recommended? Thanks a lot in advance!
    👍🏼 1
    n
    d
    m
    • 4
    • 15
  • v

    Vincent Liagre

    07/20/2023, 11:44 AM
    Hello team ; I have installed my module (within
    src/my_module
    ) with
    pip install -e src
    ; now
    kedro
    is looking for data within the
    my_module
    folder from root for some reason. Any clue whats going on here and how I can solve this ? Happy to provide more details if required 🙂
    d
    • 2
    • 30
  • m

    Marc Gris

    07/20/2023, 2:47 PM
    Hi everyone,
    local/catalog.yml
    does not override the `base/catalog.yml`… Any idea what could cause this behavior ? Thx M.
    n
    i
    • 3
    • 30
  • c

    Christos Malliopoulos

    07/21/2023, 11:18 AM
    Hello all! II’m a newcommer. Do you know if there’s a process/channel to submit bugs/issues?
    d
    • 2
    • 3
  • n

    Nok Lam Chan

    07/21/2023, 3:32 PM
    [Not a Kedro Question] - What’s the most efficient way to find out some basic statistic of a CSV (filesize/number of rows and columns) ? Requirements: • Memory efficient (cannot load the full csv into a dataframe and do
    df.describe()
    • Need to work in Windows and Linux so
    wc
    is not an option • Need to be fast • Bonus: is it possible to generalised to Excel filetype?
    👍 1
    m
    d
    +4
    • 7
    • 23
1...2728293031Latest