Kedro #questions

Nelson Zambrano

07/23/2023, 8:25 PM

Is it possible to disable

_validate_unique_outputs(nodes)

via hooks or by implementing a modified

Pipeline

class?

Baden Ashford

07/24/2023, 9:59 AM

Hi all, Has anyone used Kedro for building pipelines in a repo which also houses non-pipeline code, like lambda functions? I am bringing in Kedro, but also need some way of porting over our existing lambda functions to live in the same repo as our pipelines. Splitting them out into a separate repo is not really feasible due to the common code used by each and the extra work/dependency management that would introduce. We use aws sam as a framework of sorts within each of our lambda functions, so could just put them in

src/my_repo/lambdas/

next to

src/my_repo/pipelines/

and have some 3rd directory with shared code

src/my_repo/shared/

, but thought there may be a different way to go about this! Thanks!

Aleksander Jaworski

07/24/2023, 11:22 AM

[Kedro-version : 0.18.6 currently] Hi, I am working on a sort of 'pipeline monorepo' where I have dozens of pipelines. I have a question: would some sort of lazy-configuration-validation be a useful feature for kedro? I have 2 reasons for asking: 1. It feels a bit cumbersome that even a simple

hello_world.py

will take several seconds to run when the configuration is large enough, as first you will see all the logs and all the setup will be done for the data catalog etc, none of which would actually end up being used in a

hello_world.py

2. When setting up the project for someone, it is impossible to provide a credentials file with just the required credentials. In kedro all of them need to be filled right now as it is all validated at once. In a sort of lazy version, only the dependencies that follow from the pipeline would need to be evaluated. Are there any solutions or modifications I could use to improve my approaches here? Thanks in advance! :)

🎉 1

Sid Shetty

07/24/2023, 2:13 PM

Hello team, I was wondering if theres an approach to break a pandas dataframe into chunks, run a few operations on it and write each chunk to a parquet in append mode(without concatenating the chunks back)? So the kedro node would have multiple writes.

Jon Cohen

07/24/2023, 3:15 PM

Hi! My team wants to have separate client data ingestion pipelines which are kept separately from each other. We then want to be able to import our standard data processing pipeline from a central repo. Is it possible to use something like modular pipelines in this way?

Jon Cohen

07/24/2023, 3:17 PM

Thank you! Wow, fast response time

⏩ 4

Emilio Gagliardi

07/24/2023, 5:40 PM

hi kedronaughts, I'm trying to get my first pipeline working and I'm confused on a few pieces I'm hoping you can correct my thinking on. I have one custom DataSet that connects to an RSS feed. I have another custom DatSet that stores the processed feed items and saves them to a mongo db. I'm confused around how to setup the catalog entries and node functions in regards to how the catalog values get passed into the DataSets. how do I create a catalog entry that combines with values from credentials.yml? so 'mongo_url' contains my username and password which I stored in credentials.yml catalog entries:

Copy code

rss_feed_extract:
  type: kedro_workbench.extras.datasets.RSSDataSet.RSSFeedExtract
  url: <https://api.msrc.microsoft.com/update-guide/rss>

rss_feed_load:
  type: kedro_workbench.extras.datasets.RSSDataSet.RSSFeedLoad
  mongo_url: "mongodb+srv://<username>:<password>@bighatcluster.wamzrdr.mongodb.net/"
  mongo_db: "TBD"
  mongo_collection: "TBD"
  mongo_table: "TBD"
  credentials: mongo_atlas

nodes.py

Copy code

def extract_rss_feed() -> Dict[str, Any]:
    raw_rss_feed = RSSFeedExtract() # Q. how does the catalog 'url' value get passed to the __init__ method?
    raw_rss_feed.load()
    
    return {'key_1':'value_1', 'key_2': 'value_2'}
    
    
def transform_rss_feed(raw_rss_feed: Dict[str, Any]) -> List[Dict[str, Any]]:
    
    return [{'key_1_T':'value_1_T', 'key_2_T': 'value_2_T'}]
    
    
def load_rss_feed(prepped_rss_items: List[Dict[str, Any]]) -> None:
    rss_feed_load = RSSFeedLoad(prepped_rss_items) # not clear how to create the custom dataset that takes data from catalog and credentials and the previous node
    rss_feed_load.save()

pipeline.py

Copy code

pipeline([
        node(
            func=extract_rss_feed,
                inputs=None,
                outputs='rss_feed_for_transforming',
                name="extract_rss_feed",
        ),
        node(
            func=transform_rss_feed,
                inputs="rss_feed_for_transforming",
                outputs='rss_for_loading',
                name="transform_rss_items",
        ),
        node(
            func=load_rss_feed,
                inputs="rss_for_loading",
                outputs="rss_feed_load",
                name="load_rss_items",
        ),

    ])

custom datasets

Copy code

class RSSFeedExtract(AbstractDataSet):
    def __init__(self, url: str):
        self._url = URL

class RSSFeedLoad(AbstractDataSet):
    def __init__(self, mongo_url: str, mongo_db: str, mongo_collection: str, mongo_table: str, credentials: Dict[str, Any], data: Any = None):
        self._data = data # comes from the previous node
        self._mongo_url = mongo_url
        self._mongo_db = mongo_db
        self._mongo_collection = mongo_collection
        self._mongo_table = mongo_table
        self._username = credentials['username']
        self._password = credentials['password']

Jon Cohen

07/24/2023, 6:17 PM

I'm noticing that warnings like SyntaxErrors and type errors are considered "warnings" by Kedro, which continues to try to run the pipeline. Is there a setting to escalate these to Errors so they can abort the pipeline run?

Jon Cohen

07/24/2023, 6:25 PM

I also noticed in Kedro Viz (this is from the modular pipelines part of the tutorial) that two pipelines with the same static structure are rendering differently, which is a little frustrating for visual scanning

Jon Cohen

07/24/2023, 8:10 PM

More newb questions (sorry). I'm having trouble following the tutorial for running a packaged project. I have made a new directory with a new Kedro project (we expect each client of ours to have their own Kedro project) and have installed the built wheel. However Kedro is looking for nodes and pipelines locally in the new project and can't find the ones in the installed project. Does this mean I have to copy over all of my pipelines manually from the installed Kedro project?

VIOLETA MARÍA RIVERA

07/25/2023, 10:55 PM

Hello, I am new to kedro so I was doing the spaceflights tutorial. When I try to use kedro viz, a tab in my browser opens up but it's just a white screen, everything is missing. I tried saving the pipeline to a .json file and it isn't empty, so I don't know what is causing this display issue. I'd be grateful for any help. Thanks!

Suyash Shrivastava

07/26/2023, 3:23 PM

Hi Everyone! Has anyone used matplotlib 2.0.0 with Kedro 0.17.7 before? I am getting error. I have installed PyQt5 and pyside2 but still getting the same error. I'd grateful for any help. Thanks a lot!

Copy code

File "/usr/local/lib/python3.6/site-packages/matplotlib/backends/qt_compat.py", line 175, in <module>
    "Matplotlib qt-based backends require an external PyQt4, PyQt5,\n"
ImportError: Matplotlib qt-based backends require an external PyQt4, PyQt5,
or PySide package to be installed, but it was not found.

Sid Shetty

07/26/2023, 5:24 PM

Hello team, when I split a pandas dataframe and store using partitioned dataset, loading them back together appears to find schema differences. Since a few columns have

nulls

. Is there any workaround here that avoids me having to add another node to put these partitions together and ideally just read as a pandas.ParquetDataSet? Perhaps passing the schema of the original dataframe or even specifying it explicitly?

Lim H.

07/26/2023, 6:51 PM

Hi everyone, is it possible to pass credentials of the underlying dataset when using it with CachedDataSet? e.g.

Copy code

test:
  type: CachedDataset
  versioned: true
  dataset:
    type: pandas.JSONDataSet
    filepath: ...
    load_args:
      lines: True
    credentials: ...

doesn’t work but this works

Copy code

test:
  type: pandas.JSONDataSet
  filepath: ...
  load_args:
    lines: True
  credentials: ...

I thought this was working at some point? I might be hallucinating though. Just want to double check quickly before I create my own CachedDataSet

✅ 1

👀 1

J. Camilo V. Tieck

07/26/2023, 7:21 PM

hi everyone, how can I access the current env from python? is there a env_name variable somewhere? I want to use the env_name as a suffix for loading a file.

Emilio Gagliardi

07/26/2023, 8:57 PM

I'm trying to get logging working and was hoping someone could point me in the right direction. when you load kedro and run it out of the box, kedro automatically writes nodes and pipeline details to the console, I'd like to keep that as it is. What I'm not figuring out is how to use logger inside a module to write log entries to a file and not the console. I have a large json object I want to print to file so I can look at it. I tried setting up my logging.yml file but I'm not understanding something.

Copy code

logging.yml
handlers:
  ...other built-in kedro handlers...
  debug_file_handler:
    class: logging.handlers.RotatingFileHandler
    level: DEBUG
    formatter: simple
    filename: logs/debug.log
    maxBytes: 10485760 # 10MB
    backupCount: 20
    encoding: utf8
    delay: True

loggers:
  kedro:
    level: INFO

  kedro_workbench:
    level: INFO

  DataSets:
    level: DEBUG
    handlers: [debug_file_handler]

root:
  handlers: [rich, info_file_handler, error_file_handler]

Copy code

in my module I used:
import logging
logger = logging.getLogger('DataSets')
logger.debug(output)

but when I run the pipeline, the contents of output are still written to the console. What am I missing here? thanks kindly!

Fazil B. Topal

07/27/2023, 8:50 AM

hey all, Is there someway where i can see the high level overview of how kedro functions? I find

hooks

to be nice but without high level order of execution, not sure if i can do what i want. Context: I am trying to play around with the data versioning to change it a bit since I would run each nodes in a different k8s pod ideally. That means, dataset versioning should match. From what i gather

Session

class has this info but Im trying to find a proper how to make sure same code version + some envs ends up using the same data versioning etc. Any help is appreciated 🙂

Rahul Kumar

07/27/2023, 8:57 AM

Hi all, Any specific reason why versioning is not supported in PartitionedDataset ?

meharji arumilli

07/27/2023, 3:08 PM

Hello all, I have packaged the kedro project as:

kedro package

and created dag:

kedro airflow create

This created the .whl and dags. Then using the below docker file, i have built the docker image for the kedro project:

FROM apache/airflow:2.6.3-python3.8

# install project requirements

WORKDIR /opt/test-fi/

COPY src/requirements.txt .

USER root

RUN chmod -R a+rwx /opt/test-fi/

# Install necessary packages

RUN sudo apt-get update && apt-get install -y wget gnupg2 libgomp1 && apt-get -y install git

USER airflow

COPY data/ data/

COPY conf/ conf/

COPY logs/ logs/

COPY src/ src/

COPY output/ output/

COPY dist/ dist/

COPY pyproject.toml .

RUN --mount=type=bind,src=.env,dst=conf/.env . conf/.env && python -m pip install --upgrade pip && python -m pip install -r requirements.txt && python -m pip install dist/test_fi-0.1-py3-none-any.whl

EXPOSE 8888

CMD ["kedro", "run"]

The docker image is built as:

docker build -t test_fi .

Then i have installed airflow using docker-compose.yml file in EC2 instance. And attached the docker image to the worker and scheduler services. Tested the docker image test_fi, by docker exec into the container and ran the command `kedro run`and the project runs as expected. However, with the airflow when the dag is triggered, i get the below error in airflow UI without much information in logs to debug. The below log is using

logging_level = DEBUG

*** Found local files:

***   * /opt/airflow/logs/dag_id=test-fi/run_id=scheduled__2023-06-27T14:37:54.602904+00:00/task_id=define-project-parameters/attempt=1.log

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1037} DEBUG - previous_execution_date was called

[2023-07-27, 14:37:56 UTC] {__init__.py:51} DEBUG - Loading core task runner: StandardTaskRunner

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1037} DEBUG - previous_execution_date was called

[2023-07-27, 14:37:56 UTC] {base_task_runner.py:68} DEBUG - Planning to run as the  user

[2023-07-27, 14:37:56 UTC] {taskinstance.py:789} DEBUG - Refreshing TaskInstance <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]> from DB

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1112} DEBUG - <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]> dependency 'Trigger Rule' PASSED: True, The task instance did not have any upstream tasks.

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1112} DEBUG - <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]> dependency 'Not In Retry Period' PASSED: True, The task instance was not marked for retrying.

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1112} DEBUG - <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]> dependency 'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1112} DEBUG - <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]> dependency 'Task Instance State' PASSED: True, Task state queued was valid.

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1112} DEBUG - <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]> dependency 'Task Instance Not Running' PASSED: True, Task is not in running state.

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]>

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1112} DEBUG - <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]> dependency 'Trigger Rule' PASSED: True, The task instance did not have any upstream tasks.

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1112} DEBUG - <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]> dependency 'Task Concurrency' PASSED: True, Task concurrency is not set.

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1112} DEBUG - <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]> dependency 'Not In Retry Period' PASSED: True, The task instance was not marked for retrying.

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1112} DEBUG - <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]> dependency 'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1112} DEBUG - <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]> dependency 'Pool Slots Available' PASSED: True, There are enough open slots in default_pool to execute the task

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [queued]>

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1308} INFO - Starting attempt 1 of 2

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1327} INFO - Executing <Task(KedroOperator): define-project-parameters> on 2023-06-27 14:37:54.602904+00:00

[2023-07-27, 14:37:56 UTC] {standard_task_runner.py:57} INFO - Started process 85 to run task

[2023-07-27, 14:37:56 UTC] {standard_task_runner.py:84} INFO - Running: ['***', 'tasks', 'run', 'test-fi', 'define-project-parameters', 'scheduled__2023-06-27T14:37:54.602904+00:00', '--job-id', '884', '--raw', '--subdir', 'DAGS_FOLDER/test_fi_dag.py', '--cfg-path', '/tmp/tmpu1fp72mc']

[2023-07-27, 14:37:56 UTC] {standard_task_runner.py:85} INFO - Job 884: Subtask define-project-parameters

[2023-07-27, 14:37:56 UTC] {cli_action_loggers.py:65} DEBUG - Calling callbacks: [<function default_action_log at 0x7f4f6b6038b0>]

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1037} DEBUG - previous_execution_date was called

[2023-07-27, 14:37:56 UTC] {task_command.py:410} INFO - Running <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [running]> on host e1be34e2e4d4

[2023-07-27, 14:37:56 UTC] {settings.py:353} DEBUG - Disposing DB connection pool (PID 85)

[2023-07-27, 14:37:56 UTC] {settings.py:212} DEBUG - Setting up DB connection pool (PID 85)

[2023-07-27, 14:37:56 UTC] {settings.py:285} DEBUG - settings.prepare_engine_args(): Using NullPool

[2023-07-27, 14:37:56 UTC] {taskinstance.py:789} DEBUG - Refreshing TaskInstance <TaskInstance: test-fi.define-project-parameters scheduled__2023-06-27T14:37:54.602904+00:00 [running]> from DB

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1037} DEBUG - previous_execution_date was called

[2023-07-27, 14:37:56 UTC] {taskinstance.py:868} DEBUG - Clearing XCom data

[2023-07-27, 14:37:56 UTC] {retries.py:80} DEBUG - Running RenderedTaskInstanceFields.write with retries. Try 1 of 3

[2023-07-27, 14:37:56 UTC] {retries.py:80} DEBUG - Running RenderedTaskInstanceFields._do_delete_old_records with retries. Try 1 of 3

[2023-07-27, 14:37:56 UTC] {taskinstance.py:1545} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='***' AIRFLOW_CTX_DAG_ID='test-fi' AIRFLOW_CTX_TASK_ID='define-project-parameters' AIRFLOW_CTX_EXECUTION_DATE='2023-06-27T14:37:54.602904+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2023-06-27T14:37:54.602904+00:00'

[2023-07-27, 14:37:56 UTC] {__init__.py:117} DEBUG - Preparing lineage inlets and outlets

[2023-07-27, 14:37:56 UTC] {__init__.py:158} DEBUG - inlets: [], outlets: []

`[2023-07-27, 143757 UTC] {store.py:32} INFO -

read()

not implemented for

BaseSessionStore

. Assuming empty store.`

[2023-07-27, 14:37:57 UTC] {session.py:50} WARNING - Unable to git describe /opt/test-fi

[2023-07-27, 14:37:57 UTC] {logging_mixin.py:150} INFO - Model version 20230727-143757

[2023-07-27, 14:37:57 UTC] {common.py:123} DEBUG - Loading config file: '/opt/test-fi/conf/base/logging.yml'

[2023-07-27, 14:37:57 UTC] {local_task_job_runner.py:225} INFO - Task exited with return code Negsignal.SIGABRT

Can anyone offer help to fix this. It seems to be related to the line ``DEBUG - Loading config file: '/opt/test-fi/conf/base/logging.yml'``

jyoti goyal

07/27/2023, 6:20 PM

Hi everyone, I am working on a problem that requires a conditional data output i.e. only when the parameter is set to True it should export the dataset else no. Is there a way I can incorporate the logic in kedro? Any help is highly appreciated!

Emilio Gagliardi

07/28/2023, 4:52 AM

Still trying to wrap my head around custom datasets and how the pipeline works. So I created a custom dataset where the _save() method saves the data to a mongo db. In the pipeline, I define the node so that the inputs equal the data and the outputs equal the custom dataset. The part I don't understand clearly is if the class handles the actual save process, what do I put in the node function? the function doesn't do anything so I'm not sure what to do with it.

Copy code

pipeline([
        node(
            func=extract_rss_feed,
                inputs='rss_feed_extract',
                outputs='rss_feed_for_transforming',
                name="extract_rss_feed",
        ),
        node(
            func=transform_rss_feed,
                inputs=['rss_feed_for_transforming', 'params:rss_1'],
                outputs='rss_feed_for_loading',
                name="transform_rss_feed",
        ),
        node(
            func=load_rss_feed,
                inputs='rss_feed_for_loading', <- incoming data (in memory)
                outputs='rss_feed_load', <- calls the _save() of the class
                name="load_rss_feed",
        ),
        
    ])

nodes.py If all the save logic is in the class, then there's nothing for the function to do...what am I missing here? what typically goes in the function whose output is a dataset?

Copy code

def load_rss_feed(preprocessed_rss_feed):
    pass

When I try to run the pipeline, I get the following error:

DatasetError: Saving 'None' to a 'Dataset' is not allowed

thanks for your thoughts!

Rachid Cherqaoui

07/28/2023, 8:54 AM

Hello everyone, I have a problem concerning the connection of sqlalchemy with mysqlserver which occurs every morning around 9:00 AM and then it does not reproduce. Knowing that my database exists and all is good, here is the code used in `catalog.yml`:

Copy code

_mysql : &mysql
  type: pandas.SQLQueryDataSet
  credentials: 
      con: mysql+mysqlconnector://${mysql_connect.username}:${mysql_connect.password}@${mysql_connect.host}:${mysql_connect.port}/${mysql_connect.database}

table_insurers: 
  <<: *mysql
  sql: select * from underwriter_insurers

table_ccns: 
  <<: *mysql
  sql: select * from underwriter_ccns

table_departments: 
  <<: *mysql
  sql: select * from underwriter_departments

and this is the error produced :

Copy code

2023-07-26 09:33:48 - src.api.tarificateur_compte - ERROR - An error occurred in tarificateur_compte():
Traceback (most recent call last):
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1808, in _execute_context
    context = constructor(
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 1346, in _init_statement
    self.cursor = self.create_cursor()
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 1530, in create_cursor
    return self.create_default_cursor()
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 1533, in create_default_cursor
    return self._dbapi_connection.cursor()
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1494, in cursor
    return self.dbapi_connection.cursor(*args, **kwargs)
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/mysql/connector/connection_cext.py", line 678, in cursor
    raise OperationalError("MySQL Connection not available.")
mysql.connector.errors.OperationalError: MySQL Connection not available.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/kedro/io/core.py", line 210, in load
    return self._load()
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/kedro_datasets/pandas/sql_dataset.py", line 512, in _load
    return pd.read_sql_query(con=engine, **load_args)
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/pandas/io/sql.py", line 467, in read_sql_query
    return pandas_sql.read_query(
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/pandas/io/sql.py", line 1736, in read_query
    result = self.execute(sql, params)
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/pandas/io/sql.py", line 1560, in execute
    return self.con.exec_driver_sql(sql, *args)
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1772, in exec_driver_sql
    ret = self._execute_context(
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1814, in _execute_context
    self._handle_dbapi_exception(
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2326, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1808, in _execute_context
    context = constructor(
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 1346, in _init_statement
    self.cursor = self.create_cursor()
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 1530, in create_cursor
    return self.create_default_cursor()
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 1533, in create_default_cursor
    return self._dbapi_connection.cursor()
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1494, in cursor
    return self.dbapi_connection.cursor(*args, **kwargs)
  File "/home/debian/anaconda3/envs/env_tarificateur/lib/python3.10/site-packages/mysql/connector/connection_cext.py", line 678, in cursor
    raise OperationalError("MySQL Connection not available.")
sqlalchemy.exc.OperationalError: (mysql.connector.errors.OperationalError) MySQL Connection not available.

Can anyone help me fix this problem because I have tried everything I can but I have not managed to solve it, thank you in advance.

Mate Scharnitzky

07/28/2023, 11:21 AM

Hi Team, I’m working in a SageMaker notebook which is in the

/notebook

directory. I’m trying to load some nodes that I created locally but it doesn’t find path. Two questions: • How can load kedro context to this notebook? • How can I load python modules developed as part of the kedro project? Thank you! I looked into this but somehow I can’t make it work: https://docs.kedro.org/en/0.18.11/notebooks_and_ipython/kedro_and_notebooks.html

Hygor Xavier Araújo

07/28/2023, 5:59 PM

Hi, everyone. Is it possible to use pandas.CSVDataSet to read a compressed (zip) password protected CSV? It's a local file. The zip is password protected, not the csv inside it

J. Camilo V. Tieck

07/28/2023, 8:28 PM

hi all, I have a question regarding kedro docker. how can I build an image for a different platform? I have a mac, and for aws ECS I need to build the image with a different architecture. I use this command to build the images directly with docker:

docker buildx build --platform=linux/amd64 -t <image-name> .

Is there a ‘kedro docker’ way of doing this? thanks!

Erwin

07/29/2023, 2:36 AM

Hello Team, I would like to know the recommended approach for implementing schema evolution in Delta tables within Databricks. Currently, I am encountering an issue with the dataset kedro_datasets.databricks.managed_table_dataset Whenever I attempt to add new columns using the upsert mode, an Exception is raised (there is check in the dataset implementation) Fortunately, I have control over the schema before performing the upsert operation. Hence, once I approve the schema changes, I expect to be able to utilize schema evolution during the upsert. In this context, I strongly believe that the exception raised during schema changes should be made optional, allowing for a smoother schema evolution process. For me, who should accept/deny schema evolution must be spark session

# Enable automatic schema evolution

spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")

Daniel Lee

07/31/2023, 5:44 AM

Hello team, I’m currently using M1 Mac with kedro version 0.18.3 and was trying to run kedro to import the Pick DataSet that needs to use lightgbm package. However, even after I ran

brew install libomp

, I’m encountering an error where it says

(mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64'))

. Do you know how I can go around this issue if it’s related to the architecture? And normally how is this being resolved?

Baden Ashford

07/31/2023, 11:16 AM

Hi team, How can I conduct parallel IO with kedro? I have a larger than memory partitioned dataset. I'd like to run each partition through the node in some parallel fashion. Can I utilise ParallelRunner for this? Thank you 😁

Fazil B. Topal

07/31/2023, 11:46 AM

hey all, Is it possible to have this PR code as plugin to integrate with kedro? Not sure how general plugins work and i don't know whether it would work as a plugin or needs to be merged into main repo in order to work