No data migration. Ploomber Cloud runs on AWS, and can integrate with your existing infrastructure such as S3, GCS, RedShift and Athena.

Ploomber

Current ones are small (&lt;100G, 50k samples), but I also want to make it work with ones that are 10-100T in size.

But even with the small datasets, I would love to have them processed in parallel. 

Ploomber is more like an orchestrator than a distributed computing solution, but if you're doing simple stuff like mapping, you can implement it with the grid feature: <https://docs.ploomber.io/en/latest/cookbook/grid.html>

essentially, you could define a grid of parameters and each of those parameters could be the indexes to process e.g., first task processes the first 1 M rows, the next one another 1M and so on. parquet is a a great format for these type of things

however, for more complex operations, I'd recommend you go with a distributed computing framework

Thanks, this makes sense. Maybe I could make this type of processing a single step in the ploomber pipeline.

yeah that might work. feel free to ask any other questions!