Slackbot
06/08/2022, 10:33 AMEduardo
dd.read_parquet("<gs://file/path/*.parquet>")
parquet approach. The benefit of using Ploomber's client, is that any product generated by your pipeline gets uploaded to the bucket. Uploading helps in two ways: 1) it allows you to keep track of generated artifacts on each run 2) if you run ploomber distributively (e.g. Kubernetes), then each task will execute in a separate pod. cofiguring a file client will automatically upload/download products as needed so the whole pipeline executes end to end
2. if the pipelines are fully isolated (they don't have source code in common), then you can create one folder per workflow and have on pipeline.yaml
per folder. but I guess at some point you'll have components that you may wanna share. for example, some utility functions. In such case, I'd recommended you keeping the common code in a directory (say my_package
) and create a Python package (by adding a setup.py
file), then you'd be able to import any functions or classes in any of your workflows
3. It depends on how you're using Ploomber. You can run ploomber locally and in a distributed way (we support kubernetes, aws, airflow and slurm as backends). The benefit is that you don't need to you can run any Python library. We also have a free tier of Ploomber Cloud that works the same: each task is executed in a container and you can request custom resources for each (e.g. run some script with GPUs). However, if you already have dask code, you can use Ploomber to orchestrate multiple notebooks/scripts/functions where each component runs dask code. in this second use case ploomber is helping you better structure your projects, providing the incremental builds, etc. but dask handles the actual computations
does this help? please keep the feedback coming, it helps us a lot!Julien Roy
06/08/2022, 4:21 PMEdward Wang
06/09/2022, 2:34 AMEduardo
Eduardo
Edward Wang
06/09/2022, 8:30 AM