Jo Stichbury
11/20/2022, 4:11 PMuser
11/20/2022, 9:38 PMLeo Casarsa
11/21/2022, 1:57 PMAhmed Afify
11/21/2022, 3:40 PMFrancisca Grandón
11/21/2022, 7:16 PMuser
11/21/2022, 7:48 PMZihao Xu
11/21/2022, 11:16 PMINFO Loading data from 'modeling.model_best_params_' (JSONDataSet)... ]8;id=949765;file:///databricks/python/lib/python3.8/site-packages/kedro/io/data_catalog.py\data_catalog.py]8;;\:]8;id=434875;file:///databricks/python/lib/python3.8/site-packages/kedro/io/data_catalog.py#343\343]8;;\
DataSetError: Loading not supported for 'JSONDataSet'
where we have the following catalog entry:
modeling.model_best_params_:
type: tracking.JSONDataSet
filepath: "${folders.tracking}/model_best_params.json"
layer: reporting
The same code runs completely fine locally, but is failing within data braicks.
Could you please help us understand why?Moinak Ghosal
11/22/2022, 8:30 AMAnkar Yadav
11/22/2022, 11:49 AMAnkar Yadav
11/22/2022, 1:04 PMAndreas Adamides
11/23/2022, 12:09 PMKedro now uses the Rich library to format terminal logs and tracebacks
Is there any way to revert to plain console logging and not use rich logging when running a Kedro pipeline using the Sequential Runner from the API and not via kedro CLI?
runner = SequentialRunner()
runner.run(pipeline_object, catalog, hook_manager)
I tried to look for configuration, but I believe you can only add configuration if you are in a kedro project and intend to run with Kedro CLI.
Any ideas?Afaque Ahmad
11/24/2022, 9:43 AMcache made available inside the _load method of multiple Kedro Datasets. How to go about it? Can we use hooks? or anything simpler?Fabian
11/24/2022, 12:13 PMJose Alejandro Montaña Cortes
11/24/2022, 7:40 PMAfaque Ahmad
11/25/2022, 6:53 AMget_spark inside the ProjectContext which I need to access in the register_catalog hook. How can I access that function?Elias
11/25/2022, 10:13 AMkedro.io.core.DataSetError:
__init__() got an unexpected keyword argument 'table_name'.
DataSet 'inspection_output' must only contain arguments valid for the constructor of `kedro.extras.datasets.pandas.sql_dataset.SQLQueryDataSet`.Elias
11/25/2022, 10:13 AMcatalog.yml:
inspection_output:
type: pandas.SQLQueryDataSet
credentials: postgresql_credentials
table_name: shuttles
layer: model_output
save_args:
index: trueElias
11/25/2022, 10:13 AMShreyas Nc
11/25/2022, 10:26 AMfrom <http://kedro.io|kedro.io> import DataCatalog
from kedro.extras.datasets.pillow import ImageDataSet
io = DataCatalog( {
"cauliflower": ImageDataSet(filepath="data/01_raw/cauliflower"),
}
)
But I dont see this in the catalog and get an error when I reference this in the pipeline node that the entry doesnt exist in the catalog.
Am I missing something here?
Note: this is on the latest version of kedro kedro, version 0.18.3
I just joined the channel, if I am bot using the right format or channel to ask this question, please let me know .
Thanks in advance!Anu Arora
11/25/2022, 1:45 PMdbx execute <workflow-name> --cluster-id=<cluster-id> ; kedro is failing on the below error;
/local_disk0/.ephemeral_nfs/envs/pythonEnv-f0037269-19cc-4c81-9dc2-43bcd22cd8ff/lib/python3.8/site-packages/kedro/framework/startup.py in _get_project_metadata(project_path)
64
65 if not pyproject_toml.is_file():
---> 66 raise RuntimeError(
67 f"Could not find the project configuration file '{_PYPROJECT}' in {project_path}. "
68 f"If you have created your project with Kedro "
RuntimeError: Could not find the project configuration file 'pyproject.toml' in /databricks/driver.
I can see that the file was never packaged but I am not sure if it was supposed to be packaged or not. Plus it is pointing to working directory as /databricks/driver somehow. Below is the python file I am running: as spark_python_task
from kedro.framework.project import configure_project
from kedro.framework.session import KedroSession
package_name = "project_comm"
configure_project(package_name)
with KedroSession.create(package_name,env="base") as session:
session.run()
Any help would be great!!
PS: I have tried with dbx deploy and launch as well and is still facing the same issueKarl
11/26/2022, 12:27 AMFabian
11/26/2022, 1:27 PMYousri
11/28/2022, 3:27 PMpython3 -m project_name.run
But i have question about parameters. When i run the packaged project i can't anymore pass parameters to the project or modifiy the parameters.yml so my question is how to pass arguments when i run a packaged kedro project ?Afaque Ahmad
11/29/2022, 6:56 AMcatalog dict in the after_node_run hook?Fabian
11/29/2022, 9:59 AMHi Team,
another beginner's question: I have created a pipeline that nicely analyzes my DataFrame. Now, I add a new level of complexity to my DataFrame and want to execute the pipeline on each level, similiar to a function in groupby.apply.
Can I do this without modifiying the pipeline itself? E.g., splitting the DataFrame ahead of the pipeline and remerging it afterwards while leaving the existing pipeline as it is?
Ankar Yadav
11/29/2022, 11:37 AMBalazs Konig
11/29/2022, 3:09 PMconf/dev/)
2. by pipeline (conf/base/data_connectors/xyz/)
Is there a simple way to achieve this double filter without much hacking?Jan
11/30/2022, 10:29 AMconf/base will not be loaded? I would like to do something like kedro run --env=prod and in the prod env I have a catalog that is prefixed (e.g. file: data/prod/01_raw/file.txt ) so that I can have the prod data separated. I would like to avoid leakage of development data into the prod env. For example if I add a new step and create a new entry in the data catalogue (base) and forget to add this entry in the prod catalog it will be used later on in the prod environment by default because it is not overwritten? Instead I would like to get an error or implicitly use a MemoryDataset, in other words: don't load conf/base . Does this make sense? 😄
Edit: Just realizing that this behaviour would be possible if I just use conf/base as the prod env and always develop in a conf/dev env. However, ideally I would like to use by default the conf/base and only work in prod by specifying it explicitly to avoid mistakenly changing something there 🤔Qiuyi Chen
11/30/2022, 6:35 PMfrom pyspark.sql import DataFrame
def function_a(params: Dict, *df_lst: List[DataFrame]):
report = pd.Dataframe()
for df in df_lst:
temp = function(df,params)
report = pd.concat([report,temp])
return report
I can run function like this
Function_a(params, df1,df2,df3)
But in the pipeline, how can I define the node and catalog in this situation. Here is what I did, please let me know where I did it wrong
def create_pipeline(**kwargs):
return Pipeline(
[ node( function = function_a,
Inputs = ["params", "df_lst"],
outputs= "report",
]
)
Catalog = DataCatalog(
data_sets={"df_lst": df1},
feed_dict={"params":params, },
)
I can only run the pipeline when df_lst is just one dataframe, but I do want it do be something like “df_lst”: df_1,df_2,df_3 …df_n(n>3)Fabian
12/01/2022, 10:59 AM