This message was deleted Ploomber #ask-anything

Join Slack

This message was deleted.

# ask-anything

Slackbot

06/08/2022, 10:33 AM

This message was deleted.

🙌 2

Eduardo

06/08/2022, 11:36 AM

thanks so much for your feedback! 1. You have a good point, I think we should update our guide based on your feedback. If you're reading external files (that are not generated by the pipeline), it makes sense to use the

dd.read_parquet("<gs://file/path/*.parquet>")

parquet approach. The benefit of using Ploomber's client, is that any product generated by your pipeline gets uploaded to the bucket. Uploading helps in two ways: 1) it allows you to keep track of generated artifacts on each run 2) if you run ploomber distributively (e.g. Kubernetes), then each task will execute in a separate pod. cofiguring a file client will automatically upload/download products as needed so the whole pipeline executes end to end 2. if the pipelines are fully isolated (they don't have source code in common), then you can create one folder per workflow and have on

pipeline.yaml

per folder. but I guess at some point you'll have components that you may wanna share. for example, some utility functions. In such case, I'd recommended you keeping the common code in a directory (say

my_package

) and create a Python package (by adding a

setup.py

file), then you'd be able to import any functions or classes in any of your workflows 3. It depends on how you're using Ploomber. You can run ploomber locally and in a distributed way (we support kubernetes, aws, airflow and slurm as backends). The benefit is that you don't need to you can run any Python library. We also have a free tier of Ploomber Cloud that works the same: each task is executed in a container and you can request custom resources for each (e.g. run some script with GPUs). However, if you already have dask code, you can use Ploomber to orchestrate multiple notebooks/scripts/functions where each component runs dask code. in this second use case ploomber is helping you better structure your projects, providing the incremental builds, etc. but dask handles the actual computations does this help? please keep the feedback coming, it helps us a lot!

Julien Roy

06/08/2022, 4:21 PM

Oh, nice thanks for this question Edward. My team need me to implement something close to what you want !

Edward Wang

06/09/2022, 2:34 AM

Thanks guys!! I'd just like to probe into Point 3 a little bit more: I'm planning to deploy Ploomber into my own GKE cluster using Soopervisor. Since the underlying codes are using Dask, I also want to utilise Dask's distributed cluster. However, since Ploomber is already sort of like distributing the tasks into the GKE cluster already - by having a pod for each task and distributed across the cluster, is it still possible to redistribute it on the code level again using Dask's distributed cluster (pardon me, I might be speaking gibberish 😆)? Because if its not possible, the Dask code will only be running on a single machine (or a node), and we wont be maximising the usage of Dask in this case.

Eduardo

06/09/2022, 8:06 AM

yeah so if you deploy ploomber with GKE via soopervisor and your notebooks have dask code, then dask is doing the computations and ploomber the orchestration of multiple parts. I'm not an expert on Dask but if you're connecting to the cluster inside your ploomber pipeline, they it's fine since you'll be fully utilizing the cluster. Something that we want to add to soopervisor is the ability to run a full pipeline in a single pod, we've seen this use case a few times before when distributing across pods isn't necessary; and that applies to your use case, since dask is doing the computations

Eduardo

06/09/2022, 8:06 AM

i think as you make progress with the deployment, i'll be easier to assist you in solving the challenges that arise

Edward Wang

06/09/2022, 8:30 AM

yea sure! thanks for the clarification btw! 🙌

👍 1

3 Views

Open in Slack

Previous Next