Jess Mankewitz (they/she)
05/23/2022, 11:03 PMEduardo
ploomber build
so once we have a few historical runs, we could flag anomalies like a sudden increase in NAs. I think this could be used for data quality, ML model tracking, etc. Thoughts?Jess Mankewitz (they/she)
05/27/2022, 7:45 PMgrid
, i’m setting two parameters so that my output is /path/[[parameter]]-sampled.csv
but when I build, I get /path/[[parameter]]-sampled-0.csv
Jess Mankewitz (they/she)
05/27/2022, 7:46 PMgrid
, so I’m having trouble programmatically selecting the correct downsampled file…am I just missing something?Spruha Vashi
06/02/2022, 2:27 PMJess Mankewitz (they/she)
06/02/2022, 10:18 PM- source: modeling/scripts/downsample_corpora.R
name: downsample-corpora-analysis1
product:
nb: modeling/output/notebooks/downsample_corpora.html
data: modeling/output/data/prepped_data/[[analysis]]-downsampled.csv
params:
- analysis: 'analysis1'
but I’m getting the following error:
Error: Failed to initialize NotebookRunner task with source 'modeling/scripts/downsample_corpora.R'.
Params must be initialized with a mapping, got: [{'analysis': 'analysis1'}] ('list')
What am I missing?Jess Mankewitz (they/she)
06/02/2022, 10:19 PMJess Mankewitz (they/she)
06/02/2022, 10:21 PMNikhil Reddy
06/03/2022, 7:33 AMJakub Bartczuk
06/03/2022, 8:59 PMploomber build --partial something.task1
Jakub Bartczuk
06/06/2022, 12:43 PMJulien Roy
06/07/2022, 5:27 PMNikhil Reddy
06/08/2022, 3:01 AMbicepcurl
06/08/2022, 9:52 AMbicepcurl
06/08/2022, 9:52 AMEdward Wang
06/08/2022, 10:33 AMclients.py
and changing the pipeline.yaml
file. However, is there a more intuitive way? Reason being, we could read GCS parquet files easily by using dd.read_parquet("<gs://file/path/*.parquet>")
without having to setup the GCS client. I think a user would expect the writing to be that easy as well by just using <http://dd.to|dd.to>_parquet()
without the client. Or is that already handled and I’m missing something?
2. I’m trying to create a mono repo with multiple data pipelines in it, with each sub-directory being a workflow. Are there any tips on how to structure such mono repo? Happy to hear them 🙂
3. How can I fully utilise Dask’s distributed computing while using Ploomber? Does it actually work or Ploomber itself is already distributed?Julien Roy
06/08/2022, 4:18 PMJess Mankewitz (they/she)
06/10/2022, 12:53 AMSpruha Vashi
06/10/2022, 10:26 PMEdward Wang
06/13/2022, 10:56 AMsoopervisor export
command could have an option to use buildx
to build images for other architecture that would be great! Right now I'm not running soopervisor export
but instead running the docker commands directly. This is slightly cumbersome since I'll have to look at soopervisor's source code to understand what export
does under the hood. Of course, this would be solved if we were to write a CI/CD pipeline, but at the very early stages of experimenting, I think users would want to build the image locally and push it to their container registries.
Feel free to push back on this though!Eduardo
Jess Mankewitz (they/she)
06/14/2022, 6:54 PMJess Mankewitz (they/she)
06/14/2022, 6:55 PMSHUBHAM AGRAWAL
06/15/2022, 9:13 AMgaoyang liu
06/16/2022, 6:04 AMEduardo Blancas - Develop and deploy a Machine Learning pipeline in 30 minutes with Ploomber - YouTube▾
ploomber scaffold --conda --empty
to establish a new project.
Modify the pipeline.yaml.
cd demo
ploomber scaffold
But I do not get the product dict in get.py as shown in the video. Instead product = None
Did I miss someting?gaoyang liu
06/17/2022, 1:45 PMSpruha Vashi
06/18/2022, 6:23 AMgaoyang liu
06/19/2022, 9:25 AMbias_variance_decomp
:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from mlxtend.evaluate import bias_variance_decomp
raw_data = fetch_california_housing()
X = pd.DataFrame(raw_data.data[:200], columns=raw_data.feature_names)
y = pd.DataFrame(raw_data.target[:200], columns=['price'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
out = pd.DataFrame(columns=["MSE", 'Bias^2', "Variance"])
for min_samples_leaf in list(range(1, 11)):
model = DecisionTreeRegressor(min_samples_leaf=min_samples_leaf)
mse, bias, variance = bias_variance_decomp(
model,
X_train.to_numpy(),
y_test.to_numpy(),
X_test.to_numpy(),
y_test.to_numpy(),
loss="mse"
)
print(mse, bias, variance)
It reports an error:
IndexError Traceback (most recent call last)
~\\AppData\\Local\\Temp\\ipykernel_19900\\3802776880.py in <cell line: 13>()
13 for min\_samples\_leaf in list(range(1, 11)):
14 model = DecisionTreeRegressor(min\_samples\_leaf=min\_samples\_leaf)
---\> 15 mse, bias, variance = bias\_variance\_decomp( 16 model,
17 X_train.to_numpy(),
C:\\ProgramData\\Anaconda3\\envs\\ag10\\lib\\site-packages\\mlxtend\\evaluate\\bias\_variance\_decomp.py in bias\_variance\_decomp(estimator, X\_train, y\_train, X\_test, y\_test, loss, num\_rounds, random\_seed, **fit_params)
104
105 for i in range(num_rounds):
--\> 106 X_boot, y_boot = \_draw\_bootstrap_sample(rng, X_train, y_train)
107
108 \# Keras support
C:\\ProgramData\\Anaconda3\\envs\\ag10\\lib\\site-packages\\mlxtend\\evaluate\\bias\_variance\_decomp.py in \_draw\_bootstrap_sample(rng, X, y)
14 sample_indices, size=sample_indices.shape\[0\], replace=True
15 )
---\> 16 return X\[bootstrap_indices\], y\[bootstrap_indices\]
17
18
IndexError: index 124 is out of bounds for axis 0 with size 40
In PyCharm I can set a breakpoint to line 16 ---\> 16 return X\[bootstrap_indices\], y\[bootstrap_indices\]
to check and manipulate variables to find out what is wrong. Quite easy and intuitive.
However in Jupyterlab, it is much more complicated. I did not even know how to debug it in Jupyter. Newly shipped Debugger of Jupyterlab is slow and can not step into the imported module (at least in my case).
Recently I found that ipdb may be a solution:
import ipdb
# put this above the entry function
ipdb.set_trace(context=8)
# then in Pdb
ipdb> b C:\\ProgramData\\Anaconda3\\envs\\ag10\\lib\\site-packages\\mlxtend\\evaluate\\bias\_variance\_decomp.py:16
ipdb> c
# then I can manipulate variables.
But the process is so cumbersome. Also, ipdb seems hacky for me compared with debugging in PyCharm or VSCode.
I think I am not the only one who has trouble in debugging in Jupyter. I wish to know how others debug in Jupyterlab? Any suggestions? Thanks very much.gaoyang liu
06/24/2022, 3:31 AMEduardo
Jupyter notebooks don't scale well to requirements typical for running ML in a large-scale production environment. These requirements include secure and privacy-respecting access to large datasets, reproducibility, high performance, scalability, documentation, and observability (logging, monitoring, debugging).I don't get why they said notebooks are not reproducible (they could run them on a CI server on each change). Performance and scalability doesn't sound like a problem with notebooks per se but more that what comes out of a notebook is an early prototype. With documentation and observability I agree. Notebooks are hardly ever documented and it's hard to have good observability on them. Thoughts?