Can I use uspert and RealtimeToOfflineSegmentsTask...
# general
a
Can I use uspert and RealtimeToOfflineSegmentsTask at the same time?
k
No.. upsert can only be used with real-time only tables.
a
I mean Can I configure upsert and RealtimeToOfflineSegmentsTask in realtime table config?
k
No.
a
Something like the following doesn’t work. We got such error.
I followed some guide from this doc, but it doesn’t work now. Is this method supported in Pinot earlier versions? https://medium.com/@msoni6226/handling-of-duplicate-data-with-apache-pinot-78d35e7c8465
@Kishore G
m
@Alice Upsert feature works on realtime-only tables. RealtimeToOfflineTask implies there is an offline component, and hence upsert functionality won’t work. If you are looking for dedup, you got the pointer to it in the other thread.
a
Actually, RealtimeToOfflineSegmentsTask is introduced to reduce segment number and at the same time mergeType is set dedup to achieve deduplication in offline table. I depend on upsert/dedup feature to achieve deduplication for realtime table. I will try to figure out other way to achieve deduplication in both realtime and offline tables.
@Mayank, is it supported to enable dedup and realtimetoofflinesegmentstask in realtime table config in Pinot 0.11+?
Since I’m using Pinot 0.10 now and it takes some time to setup new Pinot cluster using 0.11+. 😅
It seems this configure could work. No error returned when this table is uploaded.
n
in case of dedup in realtime table, the duplicate rows are not stored in Pinot anyway. so you don’t need to again set dedup in the R2O task.
@saurabh dubey is this (dedup + r2o) combination okay? we disallow upsert + r2o because offline side keeps no metadata and the results will be all incorrect. But I can’t think of any reason why dedup+r2o wouldn’t work
s
@Alice Enabling dedup on realtime table (assuming the stream is partitioned by the PK and all other dedup requirements are met) prohibits ingestion of > 1 rows with the same primary key. Hence all the segments being processed by the r2o job will already effectively have unique rows. This configuration should work. However, if the offline table isn't exclusively being populated by the r2o job (you're using r2o ,as well as pushing data to the offline table via some other route), the offline table may end up with duplicate rows since dedup as a feature, isn't applicable for offline tables.
a
Excellent. Thanks.👍🌷
I’ll try dedup feature in realtime table since it supports startree index at the same. Is dedup feature ready for prod env?
s
Yes it is 👍
p
@Neha Pawar I see Startree has a recipe for this, assume this is not a recommended approach https://dev.startree.ai/docs/pinot/recipes/upserts-real-time-offline-job. Think there is some config to turn off the table validation for upsert and RealtimeToOffline task, then some custom job to achieve the upsert in offline.
n
hmm, the recipe also calls out that doing this will make you lose the upsert logic, and that you have to backfill your segments
Copy code
The main thing to keep in mind when combining these features is that upsert functionality only applies to real-time tables.

As soon as those segments are moved to an offline table, the upsert logic is no longer applied at query time. We will need to backfill the offline segments created by the real-time to offline job to achieve upsert-like queries.
@Mark Needham assuming you added this, do you think we should even have this recipe published? we block this at validations itself now
m
@Neha Pawar so are you saying that what's described in that recipe wouldn't work now? It seems I used a 0.10 snapshot version to get it to work
ah yeh, I worked around the validation:
Copy code
curl -X POST <http://localhost:9000/tables?validationTypesToSkip=All> \
  --data @config/orders_table.json