Kedro #questions

Bernardo Branco

05/15/2023, 7:36 PM

Hey everyone, Im building tests for a kedro pyspark pipeline and I would like to pass a specific spark configuration needed to pass the tests. I have tried various things but nothing works. Which is the best way to pass the spark configurations typically found in spark.yml into tests? Thank you in advance!

Juan Luis

05/16/2023, 7:53 AM

hi folks, are custom omegaconf resolvers supposed to work for the catalog? I’m trying to define one but the raw

${name:value}

strings are passed to the dataset. I can elaborate more if needed

Dotun O

05/16/2023, 8:18 PM

Hey team, can we set a way for the pipeline not to fail if the catalog entry does not exist? Is there a way to set a default None value if the catalog does not return a directory, will this be set in the hooks.py file?

Afaque Ahmad

05/17/2023, 3:41 AM

Hi Kedro Team I'm using Kedro

0.16.6

and using the

load_context

to get

params

"credentials*", "credentials*/**"

. We're upgrading

Kedro

0.18.8

and it seems

load_context

is no more accessible. How can I replicate the same functionality in the Kedro 0.18.8?

Afaque Ahmad

05/18/2023, 11:35 AM

Hi Kedro Team I'm using a

spark.SparkDataSet

to load and save datasets to a

delta

lake. I need to save

incremental

numeric versions of data e.g 1, 2, .. as opposed to the current timestamp. Is there a way to do that in the current implementation and specify the version number while loading?

👍 1

Andrej Zachar

05/18/2023, 3:59 PM

Hello Kedro Team, I am currently working on a pipeline that trains a model, and I'm employing versioning for the said model. In the subsequent steps of the pipeline, I would like to use a specific version of the model to make predictions. However, it seems that the Kedro node input doesn't support the use of versions. Below is the snippet of the problematic code:

Copy code

python
node(
    predict,
    inputs=["classifier_flaml:<version_ideally_from_params>", "X_src"],
    outputs=["y_src_pred", "y_src_pred_proba"],
),

Can you provide guidance on how to accomplish this task? Thank you.

✅ 1

Jose Nuñez

05/18/2023, 11:31 PM

K Hi Everyone! K Quick question: In one of my nodes I have a function

that takes a dataframe as input, makes some stuff and output a python
dict
. Is there any way to save that dict in the data catalog? . as a workaround I was saving it as a pandas csv and later transforming it back to a dict. but I'm tired of doing that. Thanks in advance 😄

Muhammad Ghazalli

05/19/2023, 2:33 AM

Hi Kedro Team, I'm currently deploying Kedro in Kubernets and scheduling it via Kubernets Cron Jobs and running it using Kedro run -p <pipeline_name>. I'm facing an issue when there's a wrong inside the pipeline (data not available or else which is fine) it will continuously retry. I want to add an exit code into my Kedro pipeline so if there's an error it will exit immediately. Where to put it, can't find it in the docs. Thank you

Matthias Roels

05/19/2023, 6:59 AM

What’s the difference between the pandas generic dataset and, say, pandas csv dataset classes? From what I can tell, they offer the same functionality for reading csv files. Is one a legacy version that was supposed to be replaced by the other?

Luis Cano

05/19/2023, 3:29 PM

Hello everyone! Quick question, what is the correct way of defining an optional input in the Kedro pipeline? is it possible? The function takes some df's inputs as optional but I would also want to have that feature in the pipeline to not edit it everytime one input is not available. Thanks!

Sneha Kumari

05/19/2023, 6:11 PM

Hello everyone, I am following the documentation to use kedro-mlflow with my pipeline registry and it gives me the following error when running kedro mlflow ui: Any inputs are helpful. Thanks!

Copy code

/opt/anaconda3/envs/frontline/lib/python3.8/site-packages/kedro_mlflow/framework/cli/cli.py:161  │
│ in ui                                                                                             
│   158 │   ) as session:   
│   159 │   │   
│   160 │   │   context = session.load_context()                                                   
│ ❱ 161 │   │   host = host or context.mlflow.ui.host 
│   162 │   │   port = port or context.mlflow.ui.port   
│   163 │   │      
│   164 │   │   if context.mlflow.server.mlflow_tracking_uri.startswith("http"):                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'KedroContext' object has no attribute 'mlflow'

Python version: 3.8.16 Kedro: 0.18.6 kedro-mlflow: 0.11.8

noam

05/21/2023, 2:02 PM

Hi Kedro community! My team and are I trying to create an optimal setup for running experiments in parallel. Concerningly, it appears as though if we are to change the contents of a parameters file (i.e.

conf/local/parameters.yml

) during a run, the results of the run may be affected. For example, let's say I set

hyper_tune: False

parameters.yml,

and run

kedro run

in the terminal. If I change the text in

parameters.yml

hyper_tune: True

(for example, if I am setting up the parameters for my next experiment) before the "training" node begins executing, it appears that Kedro will then read hyper_tune: True. In this example, that would mean that Kedro would execute hyperparameter tuning (despite being instructed not to do so at the beginning of the run). Am I missing something? Is the answer as simple as passing all parameters to the pipeline one time as a whole (i.e. using a before_pipeline_runs hook) rather than to each node?

fmfreeze

05/22/2023, 12:51 PM

I hope this is a simple question and I am just missing out a basic configuration: When I write a simple pipeline like:

Copy code

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline([
        node(func=do_stuff, inputs=[], outputs='MyMemDS'),
        node(func=do_more_stuff, inputs=['MyMemDS'], outputs='SecondMemDS')
    ])

I thought my

conf/base/catalog.yml

needs the entries:

Copy code

MyMemDS:
  type: MemoryDataSet
SecondMemDS:
  type: MemoryDataSet

But when I run the pipeline - which works, also with kedro-viz - it does not utilize

catalog.yml

entries at all. The output of my first node is an empty

{}

dictionary and if I rename or delete the entries in

catalog.yml

it "works" like before and the first node returns an empty dictionary. Do I need to register the catalog anywhere? I simply want to access the object which is returned by my

do_stuff()

function. What am I missing out?

Juan Luis

05/23/2023, 6:00 AM

hi folks, I'm noticing a difference between

ConfigLoader

and

OmegaConfigLoader

. while following the standalone-datacatalog starter, I notice that

Copy code

ConfigLoader("conf").get("catalog.yml")

works, but

Copy code

OmegaConfigLoader("conf").get("catalog.yml")

returns

None

. on the other hand,

OmegaConfigLoader("conf").get("catalog")

seems to work (notice no

.yml

extension), and

OmegaConfigLoader("conf")["catalog"]

works consistently for both config loaders. is this intentional? compare for example https://github.com/kedro-org/kedro/blob/41f03d9/tests/config/test_config.py#L116 with https://github.com/kedro-org/kedro/blob/41f03d9/tests/config/test_omegaconf_config.py#L149

Richard Bownes

05/23/2023, 8:05 AM

If I have an established project, and I want to integrate MLflow into it, what's the most straight forward pathway?

Afaque Ahmad

05/23/2023, 9:12 AM

Hi Kedro Folks I have 2 hooks, one is

PipelineHooks

and

MLFlowHooks

and both have

before_pipeline_run

. I need the hook

before_pipeline_run

defined in

PipelineHooks

to run before the one in

MLFlowHooks

. I've specified this order below in

settings.py

, but it doens't work:

Copy code

HOOKS = (
    PipelineHooks(),
    MLFlowHooks(),
)

Is there any way to keep an order of execution?

👀 1

Debanjan Banerjee

05/23/2023, 2:43 PM

Hi Team, im on Kedro 18.8. I see on a fresh installation , it gives me this , any solutions ?

meow checkmark 1

Guilherme Parreira

05/24/2023, 12:15 PM

Hi guys! I am trying to load kedro functionalities in a

jupyter notebook

but it gives me the following error:

%load_ext kedro.ipython

Copy code

RuntimeError: Missing required keys ['project_version'] from 'pyproject.toml'.

Kedro was working fine for the last 2 weeks. I didn't do any update on

kedro

. In

requirements

I have

kedro~=0.18.6

Pipfile.lock

I have kedro

==0.18.4

project.toml

I have:

Copy code

[tool.kedro]
package_name = "cashflow_ml"
project_name = "cashflow-ml"
kedro_init_version = "0.18.6"

[tool.isort]
profile = "black"

[tool.pytest.ini_options]
addopts = """
--cov-report term-missing \
--cov src/cashflow_ml -ra"""

[tool.coverage.report]
fail_under = 0
show_missing = true
exclude_lines = ["pragma: no cover", "raise NotImplementedError"]

I tried to change the

kedro_init_version

0.18.4

but I still got the same error. Does someone have a clue on it?

Guilherme Parreira

05/24/2023, 1:00 PM

It worked brow! I don't know why it happened. I installed

prophet

package last night. But it shouldn't modify my

pyproject.toml

Thank you so much. It saved my day.

🙌🏼 1

Andreas_Kokolantonakis

05/25/2023, 1:23 PM

hello everyone! does anyone have an example using partition by when saving parquet files using kedro catalog? thank you very much in advance

Hugo Evers

05/25/2023, 2:10 PM

Would it be a good idea to add a “concatenate pandas pipeline” options to a pipeline? Which allows one to run the through in a pandas .pipe function instead of a traditional pipeline constructs with separate inmemory I/O when for example a run flag is supplied? My usecase is as follows: there is a long text-preprocessing pipeline we use, which looks kind of like this:

Copy code

return pipeline(
        [
            node(
                func=rename_columns,
                inputs="pretraining_set",
                outputs="renamed_df",
                name="rename_columns",
            ),
            node(
                func=truncate_description,
                inputs="renamed_df",
                outputs="truncated_df",
                name="truncate_description",
            ),
            node(
                func=drop_duplicates,
                inputs="truncated_df",
                outputs="deduped_df",
                name="drop_duplicates",
            ),
            node(
                func=pad_zeros,
                inputs="deduped_df",
                outputs="padded_df",
                name="pad_zeros",
            ),
            node(
                func=filter_0000,
                inputs="padded_df",
                outputs="filtered_df",
                name="filter_0000",
            ),
            node(
                func=clean_description,
                inputs="filtered_df",
                outputs="cleaned_df",
                name="clean_description",
            ),
            node(
                func=concat_title_description,
                inputs="cleaned_df",
                outputs="concatenated_df",
                name="concat_title_description",
            ),
        ]
    )

However, on AWS batch these would be run on separate containers, I now use the cloudpickle dataset to facilitate this, but it is actually not neccesary when i use something like dask. I could also instead run this pipeline like this:

Copy code

return (
        df.pipe(rename_columns)
        .pipe(truncate_description)
        .pipe(drop_duplicates)
        .pipe(pad_zeros)
        .pipe(filter_0000)
        .pipe(clean_description)
        .pipe(concat_title_description)
    )

The aforementioned pipeline has tags, and filtering in a modular pipeline depending on pre-training, tuning, which language, etc. The flatten pipeline would be nice to use in the case of kedro run runner=… concat_pipeline=true, or something like that. Is this idea worth exploring? It is really not essential, i can work around it, but the ability to have pipelines that can “fold” like this is quite appealing.

Hadeel Mustafa

05/25/2023, 4:25 PM

hey everyone! Has anyone used

redshift-spark

in kedro before? appreciate the help if someone can show me an example on how can this be done, specifically the driver used for redshift. Thanks in advance!

Higor Carmanini

05/25/2023, 10:31 PM

I have an issue with Kedro and SparkDatasets. I am using a

PartitionedDataSet

to read many CSVs into Spark DataFrames. I just found an issue where, apparently, Spark automatically appends the column position to the column name (as read from the header) to create the actual final name. See example in image. As this sometimes is done for deduplications, I investigated whether this was something close, and sure enough there is another dataset in this same

PartitionedDataSet

that reads another column of the same name. This could "explain" this funky behavior of Spark of thinking it is a duplicate. Of course, though, these are two separate DataFrames. Has anyone stumbled upon this issue before? I can't find any references online. Thank you! EDIT: Solved! It was due to Spark's default setting of case insensitiveness.

✅ 1

Sidharth Jahagirdar

05/25/2023, 11:06 PM

hey team! Can someone please share the documentation for kedro glass.

Rebecca Solcia

05/26/2023, 8:13 AM

Good morning! Has anybody ever tried to access Databricks tables from a local Kedro project? I would need help on this topic! Thank you 🙂

🧱 1

fmfreeze

05/26/2023, 12:24 PM

Hi Kedronistas. When I define an

AbstractDataSet

kedro-viz

does not display the Dataset Type and the File Path property in the details section for that Dataset. How can I make them show up?

fmfreeze

05/26/2023, 12:47 PM

And another interesting one: Is it possible that a

node

has a (dynamic)

Parameter

as output? E.g. I have multiple "normal" parameters defined which serve as input into a

process_params

node. That node should - dependend on the normal parameter inputs - output a single parameter which might serve as input to other nodes. Currently - by simply outputting that "parameter" value - this is by default a

MemoryDataSet

✅ 1

Artur Dobrogowski

05/30/2023, 8:15 AM

Hi has anyone found any issues when using OmegaConfigLoader from new kedro version? For me when I enabled it and made used env templating in file, the config Validator started raising issues for correct lists in yaml, error looks like this:

Copy code

ValidationError: 1 validation error for KedroMlflowConfig
tracking -> disable_tracking -> pipelines
  value is not a valid list (type=type_error.list)

While the config looks like this:

Copy code

tracking:
  disable_tracking:
    pipelines: []

Artur Dobrogowski

05/30/2023, 8:24 AM

Also relevant question - any ideas how to start debugging this? I'm not very familiar with debugging in kedro This https://docs.kedro.org/en/stable/development/debugging.html is not very helpful since the bug does not occur in pipeline or in node, but in config loading.

Florian d

05/30/2023, 1:48 PM

Does anyone know if there is a reason why we could not pass the context to the

before_pipeline_run

hook? In some cases it would be good to have access to the loaded config at that point.

✅ 1