Kedro #questions

charles

04/20/2023, 12:35 PM

Another probably 🤦‍♂️question. Could someone help me understand why the env parameter in my catalog isn't being injected into the catalog from my

locals/parameters.yml

file? catalog:

Copy code

parsed_documents:  # Just one document for now.
  type: json.JSONDataSet
  filepath: '<s3://mybucket/${env}/myjson.json>

local/parameters.yml file entry:

env: "main"

in kedro ipython trying to load i am getting:

Copy code

DataSetError: Failed while loading data from data set JSONDataSet(filepath=mybucket/${env}/myjson.json, protocol=s3, save_args={'indent': 2}).
mybucket/${env}/myjson.json

Leo Cunha

04/20/2023, 12:56 PM

Hello! Is there a way I can add a flag to kedro run using plugin framework without having to override the whole

cli.py

Merel

04/20/2023, 3:17 PM

Can anyone with Spark skills lend a hand and help fixing the Kedro pyspark-iris starter?

pyspark

3.4.0 was released on the 13th of April and has broken our

pyspark-iris

. I’ve written up my findings so far in an issue: https://github.com/kedro-org/kedro-starters/issues/123 but it could be I’ve been approaching this all wrong and I’ve now reached the point where I could really use some help figuring out what is going on 🙏

batsignal 1

✅ 1

spark 1

Beltra909

04/21/2023, 7:06 AM

Hello, first time Kedro user here. I have started experimenting with my own Data Sources and I am facing some issues. I have some pandas DataFrame that I would like to save in a parquet file inside my NetApp StorageGrid S3. Everything goes smootly until the next node in the pipeline try to load the file from s3. I can see the file is present in the bucket. However I get this expection:

DataSetError: Failed while loading data from data set

ParquetDataSet(filepath=<my file_path>,

load_args={'engine': pyarrow}, protocol=s3, save_args={'engine': pyarrow}).

AioSession.__init__() got an unexpected keyword argument 'target_options'

. I have tried with different versions of fsspec, s3fs, kedro and python and I get the same issue. Here is what I am using currently: Python 3.10.10, Kedro 0.18.7, s3fs 2023.3.0, fsspec 2023.3.0, aiobotocore 2.4.2, pandas 1.5.3. Pip check does not show any broken requirements. Has anyone experienced this problem before? Extensive googling didn't show any result....

Si Yan

04/21/2023, 8:11 PM

Hi All, I am new to Kedro. I need to load data from Snowflake in Kedro. I searched some previous posts and found that Snowflake dataset is now available in Kedro 0.18.7. But I can’t find any documentation showing how to use it. Can I write a sql query like sqlQueryDataset? How to define the credentials? Could someone give an example? Thanks!

Rob

04/22/2023, 6:09 PM

Hi everyone, I'm trying to use Jinja2 syntax on Kedro 0.18.4 to dynamically define the variable

storage_type

, this is my how

globals

YAML looks like:

Copy code

storage_mode: "local"

storage:
  local: "data/"
  gcp: "<gs://my-bucket/data/>"

data:
  {% if storage_mode == 'local' %}
  storage_type: ${storage.local}
  {% elif storage_mode == 'gcp' %}
  storage_type: ${storage.gcp}
  {% endif %}
  player_tags: ${storage_type}/01_player_tags
  raw_battlelogs: ${storage_type}/02_raw_battlelogs
  raw_metadata: ${storage_type}/03_raw_metadata
  enriched_data: ${storage_type}/04_enriched_data
  curated_data: ${storage_type}/05_curated_data
  viz_data: ${storage_type}/06_viz_data
  feature_store: ${storage_type}/07_feature_store
  model_registry: ${storage_type}/08_model_registry

I'm not familiar with this type of syntax, and I'm getting a

ScannerError

Jason

04/24/2023, 1:33 PM

Hi everyone, I have a kedro pipeline and want it to run on multiple datasets (the raw (input) data are different but following the same structure. Also want to keep the outputs in the same folder structure). What is the best practice using kedro to deal with this kind of problem?

Copy code

dataset1
|--01_raw
|--02_intermediate
|--03_primary
|--...
dataset2
|--01_raw
|--02_intermediate
|--03_primary
|--...

Giulio Morina

04/25/2023, 10:51 AM

Hello everyone! Is there a line magic or something similar to load a kedro-viz visualisation inside a jupyter notebook?

Balazs Konig

04/25/2023, 4:49 PM

Hi Team 🦜 I have a quite complex Kedro pipeline that spends several minutes getting through configloaders, when it's starting to run. In itself this is fine, but I'm struggling to spin up the kedro kernel in a jupyter notebook or jupyterlab, because they all time out. Is there a way to increase timeout in CLI or a config file I missed - also, is maybe my assumption wrong that this could cause timeout errors? (I'm guessing that because other pipelines with less of a configloader leadtime can spin up their kernels in an otherwise identical environment) Thanks!

Claire BAUDIER

04/26/2023, 8:47 AM

Hello everyone, I have a question concerning parameters. In a project I’m working on we are using Kedro framework. We are developing several pipelines and we would like to create different parameter files for simplicity. Indeed, as we are using a lot of different parameters for different pipelines, the parameters file can become quickly messy. I was wondering if there was a way to keep using the parameter system of Kedro for calling parameters with «

params

», but using a file different from the default

parameters.yml

file. Here is what I have in mind based on one of the documentation examples:

Copy code

from kedro.config import ConfigLoader
from kedro.framework.project import settings

conf_path = str(project_path / settings.CONF_SOURCE)
conf_loader = ConfigLoader(conf_source=conf_path, env="local")

params = conf_loader.get(« other_parameters_file.yml")

# in node definition
def increase_volume(volume, step):
  return volume + step

# in pipeline definition
node(
  func=increase_volume,
  inputs=["input_volume", "params:step_size"],
  outputs="output_volume",
)

And the parameter

step_size

would be in the

other_parameters_file.yml.

My question is to know whether it is feasible with kedro to do that ? If so, how should it be done ? Thanks a lot for your help !

Iñigo Hidalgo

04/26/2023, 3:16 PM

Hi 🙂 I am running a simple pipeline which has the following config in a yml

Copy code

simple_conn_pt_model_filter_predict:
    date_column: date
    window_length: 0d
    gap: 0d
    check_groups: null
    continue_if_missing: true

I am trying to edit the parameter

gap

through

kedro run --pipeline ... --params=...

, but I need to overwrite the whole dictionary

Juan Diego

04/26/2023, 3:42 PM

Hi team!, any suggestions on how to extract the Kedro version used to build a wheel via

kedro package

? It will be useful when used to raise an error when doesn’t meet the one expected for a launcher.

Agnaldo Luiz

04/27/2023, 12:04 PM

Hi team, quick question: How do I use parameters from my local/credentials.yml file in my base/catalog.yml file? For example,

#credentials.yml

win_user: 'user01'

#catalog.yml

data:

type: pandas.ExcelDataSet

filepath: C:\Users\${win_user}\data.xlsx

Rishabh Kasat

04/27/2023, 2:08 PM

Hi, when I a trying to run the Kedro Viz command I am getting the below error. Any idea how to resolve it? there is no _*pyspark_llap*_ module in pip as well

Copy code

kedro.framework.cli.utils.KedroCliError: No module named 'pyspark_llap'
Run with --verbose to see the full exception
Error: No module named 'pyspark_llap'

Season Yang

04/27/2023, 4:03 PM

Hi team, we are encountering a package dependencies conflict between kedro and kedro-starter for

ipython

and would love to get help from the team. Under the same release 0.18.7 for both kedro and kedro-starter with python 3.8, kedro provides

ipython~=8.1

(https://github.com/kedro-org/kedro/blob/main/test_requirements.txt#L22) while kedro-starter’s pyspark restrict

ipython>=7.31.1, <8.0

(https://github.com/kedro-org/kedro-starters/blob/main/pyspark/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/requirements.txt#L3) Would really appreciate any help on this! Thank you in advance!

Kelsey Sorrels

04/27/2023, 10:56 PM

Hi, I've been using the Kedro+Grafana example hook, but I want to extend it to not only capture node timings, but also (in certain cases) capture operations/sec. This of course depends on forming a notion of how many "operations" occurred during the execution of a node. I can think of a bunch of wrong ways to approach this, but I'm interested in hearing folks thoughts on a "right" way to capture "operations" counts inside nodes so they can be used by the hook after nodes are executed.

Jo Stichbury

04/28/2023, 4:11 PM

❓ What data science/ML articles have you been reading recently? Have any blog posts, tutorials or newsletters "brought you joy"? 👀 Have you watched any useful training videos or listened to any podcasts 🎧 about analytics that you want to share? I'm putting together a regular roundup of what the Kedro community has found online (not just on the Kedro blog). I'd love to share your greatest hits. Feel free to share here or DM me. Thank you 🙏 Slack conversation

Darshan

04/29/2023, 5:55 AM

I am trying to deploy the kedro package in AWS following the steps provided in the documentation but when I am running the step function, it fails with an error attached for your reference. The kedro package is developed in 0.18.7 and the python environment is 3.10. Can you suggest what could be the resolution for this error?

Rob

04/29/2023, 10:01 PM

Hello everyone, happy weekend! Does anyone have an example of how to set GCP Bucket credentials from the

catalog.yml

for a parquet of type

spark.SparkDataSet

? I'm trying to use the

.json

file from Google Cloud but I'm having problems not knowing how to define it in the catalog Thanks in advance 🙂

Darshan

04/30/2023, 6:51 AM

Copy code

companies:
  type: pandas.CSVDataSet
  filepath: s3://<your-bucket>/companies.csv

This is a sample provided by Kedro with the aws step function, might be useful.

Sebastian Cardona Lozano

05/01/2023, 5:03 PM

Hi all. Is there a way to generate the log files outside, in a Google Cloud Storage bucket?

Vandana Malik

05/02/2023, 9:34 AM

Hi Team, I am using kedro version 0.17.3 .. I have created custom hooks , I am able to run the pipeline but hooks are not running for me… settings.py-

Copy code

HOOKS = (ProjectHooks(),DataValidationHook())
CONTEXT_CLASS = ProjectContext

context.py-

Copy code

class ProjectContext(KedroContext):
    """Project context.

    Users can override the remaining methods from the parent class here,
    or create new ones (e.g. as required by plugins)
    """

    hooks = ProjectHooks()

    def __init__(
        self,
        package_name: str,
        project_path: Union[Path, str],
        env: str = None,
        extra_params: Dict[str, Any] = None,
    ):
        """Init class."""
        super().__init__(package_name, project_path, env, extra_params)
        self.hooks = DataValidationHook()
        self._spark_session = None
        self._experiment_tracker = None
        self._setup_env_variables()
        self._init_common_env_vars()
        self.init_spark_session()

Can you guide me where I can look or modify in order to check why hooks are not running

Jordan

05/02/2023, 11:16 AM

I am facing an issue where a MetricsDataSet successfully loads from the catalog in a notebook, where the catalog is created with

%load_ext kedro.ipython

. However, in a standalone file when I am creating the catalog as follows:

Copy code

from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project

project_path = Path(".").resolve()
metadata = bootstrap_project(project_path)
with KedroSession.create(metadata.package_name, project_path) as session:
    context = session.load_context()
    catalog = context.catalog

data = catalog.load("my_metrics")

I get the following error:

DataSetError: Loading not supported for 'MetricsDataSet'

If this is true, why does it load in a notebook?

Adrien

05/02/2023, 11:34 AM

Hello ! I got this error when deploying pipeline via vertexai :

<http://com.google.cloud.ai|com.google.cloud.ai>.platform.common.errors.AiPlatformException: code=RESOURCE_EXHAUSTED, message=The following quota metrics exceed quota limits: <http://aiplatform.googleapis.com/custom_model_training_cpus|aiplatform.googleapis.com/custom_model_training_cpus>, cause=null; Failed to create custom job for the task. Task: Project number: 496232377396, Job id: 1189445081858310144, Task id: 6444159035313750016, Task name: preprocess-shuttles-node, Task state: DRIVER_SUCCEEDED, Execution name: projects/496232377396/locations/europe-west1/metadataStores/default/executions/14295685814278275726; Failed to create external task or refresh its state. Task:Project number: 496232377396, Job id: 1189445081858310144, Task id: 6444159035313750016, Task name: preprocess-shuttles-node, Task state: DRIVER_SUCCEEDED, Execution name: projects/496232377396/locations/europe-west1/metadataStores/default/executions/14295685814278275726; Failed to handle the pipeline task. Task: Project number: 496232377396, Job id: 1189445081858310144, Task id: 6444159035313750016, Task name: preprocess-shuttles-node, Task state: DRIVER_SUCCEEDED, Execution name: projects/496232377396/locations/europe-west1/metadataStores/default/executions/14295685814278275726

I check the quotas specified but it's not the problem because it's set to 1 and I specify 0.2 cpus for each node (kedro vertexai starter guide). I think it come from gcp but i know know witch configuration to update. Someone has an explaination / face the same bug ? I'm on this issue for days and i can't find the solution...

Thaiza

05/02/2023, 11:54 AM

Guys, have you ever seen an error like that when running a specific pipeline on kedro? I just did a normal kedro run --pipeline SA and this error is reproduced. I don't see any significant difference from this pipeline compared to the others that are running normally... Any help is highly appreciated.

Afaque Ahmad

05/02/2023, 11:59 AM

Hi Kedro Folks I'm migrating from kedro
v0.16.x
to
0.18.7
. Is there a checklist of steps that I can follow for a smooth migration?

fmfreeze

05/02/2023, 5:22 PM

Hi kedronistas :) I have a question about customization: In my company we have our own cookiecutter template resp. folder structure and naming conventions. I am struggling integrating kedro capabilities into our existing template. E.g. we don't follow the "src" naming convention but "<pkg_name>". How can I configure kedro so it knows about that and looks e.g. for the kedro_cli.py in there? So to wrap up: is it possible and if yes, what is best practice, to configure kedro to integrate well into an existing repo structure without loosing kedro functionality?

👍 1

👀 1

Flavien

05/03/2023, 10:26 AM

Hi fellows, I am running a

kedro

project on Databricks (and have good hope to convince my team to go for

kedro

). The documentation is very well written, thanks for that. Scrolling through the messages in Slack, I did not find a way to directly use the object

spark

, the

SparkSession

provided directly in the Databricks notebooks. Is there any way to do so?

❤️ 5

Vandana Malik

05/03/2023, 10:37 AM

kedro run is able to run the hooks but when i am trying to hit it using the api, it is able to run the nodes but not hooks, using kedro version 0.17.7, __ ___main.py_ __ code --

Copy code

import os

from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from kedro.runner import SequentialRunner
from hooks import ControlTableHooks

if __name__ == "__main__":
    bootstrap_project(os.path.abspath(os.environ.get("PROJECT_PATH")))
    os.chdir(os.environ.get("PROJECT_PATH"))
    with KedroSession.create(env=os.environ.get("kedro_environment")) as session:
        runner = SequentialRunner()
        context = session.load_context()
        pipeline = context.pipelines[os.environ.get("pipeline_name")]
        catalog = context.catalog
        runner.run(pipeline, catalog)
        result_dict = {"message": "Success"}

any help

Pavan Naidu

05/03/2023, 10:10 PM

~~kedro gurus: has anyone encountered this python interpretor error?~~ had to re-open VSCode in project folder, sigh

✔️ 1