https://kedro.org/ logo
Join Slack
Powered by
# questions
  • f

    Flavien

    07/31/2023, 3:42 PM
    Hi fellows, I followed the documentation for packaging Iris on databricks and it works really well 👍. I wanted to go a step further, using
    ManagedTableDataset
    — which works great too — and run different independent pipelines defined on the same project, but I did not manage to do so. I modified the
    databricks_run.py
    to account for a
    --pipeline
    option but I think the problem is in packaging the project which does not take into account pipelines created through
    kedro pipeline create
    if I am not mistaken (but I probably am). Would you point me towards my mistake? Thanks!
    d
    • 2
    • 5
  • j

    Jon Cohen

    07/31/2023, 6:23 PM
    I'm setting up some better monitoring infrastructure for our data pipeline. I've only done observability for web servers before and don't know much about the monitoring ecosystem for data pipelines. Are there any services or systems that people here like using for this purpose?
    👍 1
    👀 1
  • e

    Emilio Gagliardi

    07/31/2023, 8:32 PM
    Has anyone incorporated an LLM pipeline in a kedro project yet? I'd like to try using OpenAI to perform some processing on a collection of json documents and I'd love to see a working example or hear about any lessons. THanks kindly!
    y
    d
    • 3
    • 8
  • m

    meharji arumilli

    08/01/2023, 8:46 AM
    Hi, Anyone here is running kedro project in Apache airflow? I have a question regarding logging. The DAG runs in airflow. However the logs we see in the console when a kedro project is run locally is not visible in the airflow UI. The UI shows only
    Copy code
    *** Found local files:
    ***   * /opt/airflow/logs/dag_id=test-fi/run_id=scheduled__2023-07-02T08:24:20.451204+00:00/task_id=preprocess/attempt=1.log
    [2023-08-01, 08:24:21 UTC] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: test-fi.preprocess scheduled__2023-07-02T08:24:20.451204+00:00 [queued]>
    [2023-08-01, 08:24:21 UTC] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: test-fi.preprocess scheduled__2023-07-02T08:24:20.451204+00:00 [queued]>
    [2023-08-01, 08:24:21 UTC] {taskinstance.py:1308} INFO - Starting attempt 1 of 2
    [2023-08-01, 08:24:21 UTC] {taskinstance.py:1327} INFO - Executing <Task(KedroOperator): preprocess> on 2023-07-02 08:24:20.451204+00:00
    [2023-08-01, 08:24:21 UTC] {standard_task_runner.py:57} INFO - Started process 114 to run task
    [2023-08-01, 08:24:21 UTC] {standard_task_runner.py:84} INFO - Running: ['***', 'tasks', 'run', 'test-fi', 'preprocess', 'scheduled__2023-07-02T08:24:20.451204+00:00', '--job-id', '486', '--raw', '--subdir', 'DAGS_FOLDER/test_fi_dag.py', '--cfg-path', '/tmp/tmpzsz4yrlp']
    [2023-08-01, 08:24:21 UTC] {standard_task_runner.py:85} INFO - Job 486: Subtask preprocess
    [2023-08-01, 08:24:21 UTC] {task_command.py:410} INFO - Running <TaskInstance: test-fi.preprocess scheduled__2023-07-02T08:24:20.451204+00:00 [running]> on host 829fb522c236
    [2023-08-01, 08:24:21 UTC] {taskinstance.py:1545} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='***' AIRFLOW_CTX_DAG_ID='test-fi' AIRFLOW_CTX_TASK_ID='preprocess-rre' AIRFLOW_CTX_EXECUTION_DATE='2023-07-02T08:24:20.451204+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2023-07-02T08:24:20.451204+00:00'
    [2023-08-01, 08:24:21 UTC] {test_fi_dag.py:61} INFO - Executing task preprocess, using model version: 20230801
    [2023-08-01, 08:37:16 UTC] {local_task_job_runner.py:225} INFO - Task exited with return code 1
    Can anyone make a configuration suggestion that could show the complete process log in the airflow UI. Thanks!!
    d
    • 2
    • 13
  • j

    Jordan Barlow

    08/01/2023, 9:33 AM
    Hi, a question regarding SQLQueryDataSet: Can I point the catalog entry to a
    .sql
    file?
    Copy code
    shuttle_id_dataset:
      type: pandas.SQLQueryDataSet
      sql: data/path/to/query.sql
      credentials: db_credentials
    👀 1
    d
    • 2
    • 3
  • e

    Elena Mironova

    08/01/2023, 1:24 PM
    Hi team, After yesterdays release of
    kedro-datasets==1.5.0
    , our CI started failing during system tests which do a
    kedro run
    for a pipeline with spark (see the screenshot). As far as i can see,
    SparkDataSet
    is still defined with the same name as before. When we used
    kedro-datasets==1.4.2
    the same tests were running smoothly. I also couldn't find anything specific in the release notes - do we have to update our code (mb some import statements or how it is specified within the requirements)?
    👀 1
    d
    e
    n
    • 4
    • 21
  • e

    Erwin

    08/01/2023, 7:42 PM
    Hi! Anyone using pyspark + OmegaConfigLoader? I have an issue: I cannot even do a
    kedro run,
    since
    _resolve_credentials
    fails [i dont have any credential in my project]
    AttributeError: 'str' object has no attribute 'items'
    ✅ 1
    • 1
    • 4
  • m

    meharji arumilli

    08/02/2023, 9:12 AM
    Hi, I have my config loader as below. It mainly assings the model_version variable to self.params. The model_version is generated using timestamp.
    class MyTemplatedConfigLoader(TemplatedConfigLoader):
    def __init__(self, conf_source, env, runtime_params):
    os.environ["model_version"] = datetime.now().strftime('%Y%m%d-%H%M%S')
    self.params = os.environ
    super().__init__(conf_source=conf_source, env=env, runtime_params=runtime_params, globals_dict=self.params)
    CONFIG_LOADER_CLASS = MyTemplatedConfigLoader
    This generates a unique model_version when the project is run in kedro. This model_version is used in the filepaths in Catalog to save the outputs from different nodes. However, when this kedro project is packaged and run in airflow, each node is generating a new model_version which causes the subsequent nodes to fail as it expects the output (file path with model_version) from previous node as input. Does anyone working with kedro and airflow offer a hack for this to keep the model_version unique across all nodes or tasks in airflow?
    n
    f
    • 3
    • 11
  • f

    Fazil B. Topal

    08/02/2023, 1:14 PM
    hey everyone, Quick question regarind using data catalog with python api. Following this documentation, I have the following questions • Should catalog.py be in conf/ folder? (same as where catalog.yaml is) • Does that work the same with nodes when i do
    kedro run
    or do i have to explicitly use this python object and load the data on my own? • Is it possible to define some sections in the yaml file and other parts in python? I know i can do something in the hooks but I wanted to check if there is way where this catalog variable would be accessible by the user? Thanks in advance! 🙂
    m
    j
    n
    • 4
    • 19
  • t

    Trevor

    08/02/2023, 5:15 PM
    Is there a way to dump the parameters to a file or access the parameters of the current run conveniently? If I run my Kedro pipeline and override parameter xyz to be 5 instead of 3 for that run only, is it possible to dump the parameters.yml with the overwritten parameter xyz?
    n
    • 2
    • 4
  • t

    Trevor

    08/02/2023, 5:35 PM
    Sorry, thought I was putting those previous messages in a single thread. Fixed New question, new thread: Is there a way to set a parameter in a node? If my first node calls a function
    date()
    that simply gets the datetime date, can I assign that date to a parameter?
    n
    a
    • 3
    • 4
  • f

    Fazil B. Topal

    08/03/2023, 4:08 PM
    Hey all, I have a slight problem with multi catalog files. Im using OmegaConfigLoader and I have the following structure. catalog in bigquery gets recognized but not s3. Is this the expected behavior? Thanks in advance
    d
    • 2
    • 7
  • a

    Ankit Kansal

    08/03/2023, 4:42 PM
    Hey Team,
  • a

    Ankit Kansal

    08/03/2023, 4:43 PM
    What is the latest way of implementing kedro / databricks in azure environment ? Is there any standardised approach of setting things up from development & production stand-point ?
    d
    • 2
    • 1
  • d

    Daniel Kirel

    08/03/2023, 8:25 PM
    Hey team, two questions on `kedro-mlflow`: 1. Is there a way to log git commit tag/sha through
    kedro-mlflow
    ? 2. Is there a good way to save input datasets without needing to create separate MLFlow artifact datasets and a node to read and save datasets? Appreciate any help/guidance on this 🙏
    m
    m
    • 3
    • 4
  • s

    Sid Shetty

    08/04/2023, 3:29 PM
    Hey team, I am saving partitioned dataset with pyspark parquet data types, catalog entry:
    Copy code
    cpa_llm.blocking_output@partitions:
      type: PartitionedDataSet
      path: data/cpa_llm/blocking_output
      overwrite: True
      filename_suffix: ".parquet"
      dataset:
        type: spark.SparkDataSet
        file_format: parquet
        save_args:
          mode: overwrite
    When I read the same data as a spark dataset I get the error that
    AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
    but when I read from one of the particular partitions it infers the schema. Was wondering if there maybe a step I am missing here or if you recommend some other data type over parquet to store the files. Appreciate any help here 😄
    d
    • 2
    • 3
  • e

    Emilio Gagliardi

    08/04/2023, 5:12 PM
    Hi everyone, I have a basic question about how to save a dataset in a kedro notebook. I understand how to load a dataset, but I'm not clear how to save a dataset. I have a custom dataset. I'm not sure if I need to pass in the catalog properties/credentials manually or how to pass the data. The custom dataset connects to a mongo db so I need to pass in credentials. Thanks kindly,
    d
    c
    • 3
    • 7
  • e

    Emilio Gagliardi

    08/06/2023, 2:52 AM
    I was working with GPT 4 to brainstorm how to connect to an azure container blob that stores 1-to-many JSON files. The suggestion it provided was not what I expected and I wonder if someone can comment? I want to create a partitioned dataset and the underlying files are JSON. GPT 4 suggested the following which references a
    <http://kedro.contrib.io|kedro.contrib.io>.azure.JSONBlobDataSet
    which I can't find in the documentation under 18.12, but under 15.6. Did something change in the way kedro organizes contrib.io? GPT 4 also said that the built-in kedro JSON dataset doesn't work on azure. Any guidance is appreciated. THanks kindly,
    Copy code
    my_partitioned_dataset:
      type: kedro.io.PartitionedDataSet
      path: <your_blob_folder_path>
      credentials: azure_blob_storage
      dataset:
        type: kedro.contrib.io.azure.JSONBlobDataSet <- is this valid?
        container_name: <your_container_name>
        credentials: azure_blob_storage
    d
    n
    • 3
    • 8
  • j

    Jackson

    08/07/2023, 3:17 AM
    Hi, I am curious about where should we put our folder in kedro project. For example, I have a
    dataset
    folder which store my defined pytorch Dataset class and another module called
    model
    , I will need to import the dataset and model classes into my kedro nodes. What are the best practices to store these module?
    d
    • 2
    • 1
  • j

    Jackson

    08/07/2023, 3:34 AM
    Also, why when I run kedro run it works, but when I run using python src/../nodes.py it show no module named xxx?
    d
    n
    • 3
    • 3
  • f

    Fazil B. Topal

    08/07/2023, 9:45 AM
    hey all, I know it's been asked many times but i am yet to find a solution on kedro node running order. I am building steps which creates some tables in bigquery (since the query is complex it is being done in a multi stage way so 01-query1.sql, 02.query2.sql etc. Each of these are a node in kedro but since my custom dataset implementation (creating tables in bigquery) only implemented a
    load
    method, i define outputs as
    None
    in the node. Question is how can I create a Ordered Pipeline in kedro? Im willing to hack the Pipeline class a bit but too many stuff going on there so seeking some help here. thanks in advance! 🙂
    m
    • 2
    • 6
  • d

    Debanjan Banerjee

    08/07/2023, 10:40 AM
    kedro
    versioned
    always points to a new version once writing the data right ? Can we ensure there is a
    prod
    version created that the rest of the datasets always read from in production and we can change it in params or somewhere when we want to? for eg., we can do this manually by doing this parameters.yml
    Copy code
    run_date: &run_date 20230101
    
    version : *run_date --this can also be prod/dev/uat etc.
    catalog.yml
    Copy code
    weather:
      type: spark.SparkDataSet
      filepath: <s3a://your_bucket/data/01_raw/weather/${version}/file.csv>
      file_format: csv
    but this wont usilise the
    versioned: True
    feature. Any way we can achieve the above functionality from
    versioned
    ? That would be much cleaner imo
    d
    n
    +2
    • 5
    • 7
  • t

    Thomas Gölles

    08/08/2023, 9:40 AM
    Hi. Is there a way to get the current run name? Like in Kedro viz experiment tracking I get names like "2023-08-08T08.17.05.592Z". I am using mlflow and tensorboard as well at the moment and want to have consistent naming in every tracking tool.
    s
    n
    • 3
    • 5
  • r

    Rosana EL-JURDI

    08/08/2023, 9:50 AM
    Hello Everyone
    🧵 1
  • r

    Rosana EL-JURDI

    08/08/2023, 9:50 AM
    I hope you are all doing well?.
    🧵 1
  • r

    Rosana EL-JURDI

    08/08/2023, 9:51 AM
    I am running into an issue with kedro installation. Kedro installation seems to work fine with proper version and kedro info workings well
    🧵 1
  • r

    Rosana EL-JURDI

    08/08/2023, 9:52 AM
    but when I try to run kedro ipython I recieve the following error message: Traceback (most recent call last): File "/home/usename/.local/lib/python3.10/site-packages/IPython/core/shellapp.py", line 282, in init_extensions self.shell.extension_manager.load_extension(ext) File "/home/usename/.local/lib/python3.10/site-packages/IPython/core/extensions.py", line 76, in load_extension return self._load_extension(module_str) File "/home/usename/.local/lib/python3.10/site-packages/IPython/core/extensions.py", line 91, in _load_extension mod = import_module(module_str) File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked ModuleNotFoundError: No module named 'kedro'
    🧵 1
    s
    n
    • 3
    • 7
  • r

    Rosana EL-JURDI

    08/08/2023, 9:53 AM
    Is anyone familiar with this error ?
    🧵 1
  • r

    Rosana EL-JURDI

    08/08/2023, 9:53 AM
    Thank you
    🧵 1
  • n

    Nok Lam Chan

    08/08/2023, 10:40 AM
    Hello everyone, I have a question regarding the usage of environments in combination with the OmegaConfigLoader.
    I have a file called
    catalog_globals.yml
    in my
    base/
    config folder, and also in my
    prod/
    config folder. When I execute
    kedro run --env=prod
    , the settings from the file in
    base/
    are still used.
    cc @Gerrit Schoettler
    g
    a
    • 3
    • 53
1...2728293031Latest