hi, team, is there a way to keep data order in key...
# troubleshooting
a
hi, team, is there a way to keep data order in keyby operator? for example, if it read data from a S3 file, hope keyby opeartor could keep data order this specific file?
d
tldr; short answer, no. When you
keyBy
you are grouping data based on the key and sending to downstream operators. If the parallelism in the downstream operator is >1 you will lose ordering globally (the file in S3), but retain it for each key/group. Retaining order across parallel processes does not scale well.
a
how to retain order for each key/group?
d
use event timestamps and watermarks carefully
avoid shuffles and rearrangements rebalance, rescale or broadcast can cause these
parallel to 1 for all stages that need it
use debugging and tracing tools to monitor the ordering. switch this one to have complete observability
use logical event time ordering and watermarks carefully
also consider if absolute ordering is a hard requirement for your solution