This message was deleted.
# ask-anything
s
This message was deleted.
🙌 2
e
thanks so much for your feedback! 1. You have a good point, I think we should update our guide based on your feedback. If you're reading external files (that are not generated by the pipeline), it makes sense to use the
dd.read_parquet("<gs://file/path/*.parquet>")
parquet approach. The benefit of using Ploomber's client, is that any product generated by your pipeline gets uploaded to the bucket. Uploading helps in two ways: 1) it allows you to keep track of generated artifacts on each run 2) if you run ploomber distributively (e.g. Kubernetes), then each task will execute in a separate pod. cofiguring a file client will automatically upload/download products as needed so the whole pipeline executes end to end 2. if the pipelines are fully isolated (they don't have source code in common), then you can create one folder per workflow and have on
pipeline.yaml
per folder. but I guess at some point you'll have components that you may wanna share. for example, some utility functions. In such case, I'd recommended you keeping the common code in a directory (say
my_package
) and create a Python package (by adding a
setup.py
file), then you'd be able to import any functions or classes in any of your workflows 3. It depends on how you're using Ploomber. You can run ploomber locally and in a distributed way (we support kubernetes, aws, airflow and slurm as backends). The benefit is that you don't need to you can run any Python library. We also have a free tier of Ploomber Cloud that works the same: each task is executed in a container and you can request custom resources for each (e.g. run some script with GPUs). However, if you already have dask code, you can use Ploomber to orchestrate multiple notebooks/scripts/functions where each component runs dask code. in this second use case ploomber is helping you better structure your projects, providing the incremental builds, etc. but dask handles the actual computations does this help? please keep the feedback coming, it helps us a lot!
j
Oh, nice thanks for this question Edward. My team need me to implement something close to what you want !
e
Thanks guys!! I'd just like to probe into Point 3 a little bit more: I'm planning to deploy Ploomber into my own GKE cluster using Soopervisor. Since the underlying codes are using Dask, I also want to utilise Dask's distributed cluster. However, since Ploomber is already sort of like distributing the tasks into the GKE cluster already - by having a pod for each task and distributed across the cluster, is it still possible to redistribute it on the code level again using Dask's distributed cluster (pardon me, I might be speaking gibberish 😆)? Because if its not possible, the Dask code will only be running on a single machine (or a node), and we wont be maximising the usage of Dask in this case.
e
yeah so if you deploy ploomber with GKE via soopervisor and your notebooks have dask code, then dask is doing the computations and ploomber the orchestration of multiple parts. I'm not an expert on Dask but if you're connecting to the cluster inside your ploomber pipeline, they it's fine since you'll be fully utilizing the cluster. Something that we want to add to soopervisor is the ability to run a full pipeline in a single pod, we've seen this use case a few times before when distributing across pods isn't necessary; and that applies to your use case, since dask is doing the computations
i think as you make progress with the deployment, i'll be easier to assist you in solving the challenges that arise
e
yea sure! thanks for the clarification btw! 🙌
👍 1