We can’t really tell without knowing the kind of operations you perform on that data, how often you checkpoint, what kind of checkpoint you use, where the checkpoint is stored, etc. Generally, keyBy is used to shard (or split across task slots) incoming data - so, that could be a blind approach for scale.