In the offline flows configuration, I'm using a sc...
# troubleshooting
s
In the offline flows configuration, I'm using a schedule to run the job, right now in my local test env it's set to run every minute. I'm interested in hearing from anyone w/ production experience on running that job and how often they decide to run it. The docs say "frequently is better, as extra tasks will not be scheduled unless required". We are planning on ingesting ~30GB of data per day, resulting in about 100 segments. Our time window for rollup is 1d so those segments older than 1d will be rolled up and moved to the offline table. So just curious about thoughts on how often we should run that rollup task.
m
Any reason to not use realtime streaming, instead of frequent push?
s
We are using realtime streaming into the realtime table and we want to use the rollup to greatly reduce the size of the data after a day or two. We love this concept, I'm just curious how often people run that real time to offline job.
m
Oh I see. I think 30min is too frequent. You can try 6hr to 1 day. You may also increase the segment size from 300MB to 1GB to reduce number of segments (might need a bit more memory on minion side)
s
So our bucket timeframe is 1d, so I think good there. I'm talking about the schedule portion here:
"task": { "taskTypeConfigsMap": { "RealtimeToOfflineSegmentsTask": { "bucketTimePeriod": "1d", "bufferTimePeriod": "1d", "roundBucketTimePeriod": "1d", "mergeType": "rollup", "event_count.aggregationType": "sum", "maxNumRecordsPerSegment": "10000", "schedule": "0 * * * * ?" } } }
that's my local setup
m
Number of records seem too small, or does it give you 300MB segments already?
s
oh sorry that was for my local, it will be more in the 2m range for prod
prod we are targeting 300mb segments
So my focus is on the schedule section, locally it's at every 1m now, just curious what that should be in prod. I'm thinking every 4h maybe.
Or even daily
m
It is a bit of micro optimization, unless you see sizable improvements in choosing 4h or 1 day
n
On our side, in production, we set it to hourly, even if the bucket is 1d. It's better to have it more than the bucket. With 1d bucket and hourly freq, it'll just run once a day (23 other times will be noop). But this is safer because if you had some exception or transient error, you have all these other runs to recover
s
Perfect, thanks!