This message was deleted.
# ask-anything
s
This message was deleted.
e
Hi! How big is your dataset?
s
Current ones are small (<100G, 50k samples), but I also want to make it work with ones that are 10-100T in size.
But even with the small datasets, I would love to have them processed in parallel.
e
Ploomber is more like an orchestrator than a distributed computing solution, but if you're doing simple stuff like mapping, you can implement it with the grid feature: https://docs.ploomber.io/en/latest/cookbook/grid.html essentially, you could define a grid of parameters and each of those parameters could be the indexes to process e.g., first task processes the first 1 M rows, the next one another 1M and so on. parquet is a a great format for these type of things however, for more complex operations, I'd recommend you go with a distributed computing framework
s
Thanks, this makes sense. Maybe I could make this type of processing a single step in the ploomber pipeline.
e
yeah that might work. feel free to ask any other questions!
👍 1