I'm looking for a way to delete (a row) and change...
# general
c
I'm looking for a way to delete (a row) and change data in pinot . For example, if a member withdraws, all data of the member must be deleted immediately. - I can replace segments of a full period in an offline table. - In realtime tables I would use UPSERT mode. I can upsert null values. But I can't use star-tree index. Can I delete without using UPSERT mode? https://docs.pinot.apache.org/basics/data-import/upsert Is there a way to delete a Row from a segment of an offline + realtime table in Pinot?
m
Hey - you can't delete individual rows. As you said, you can only delete at the segment level for both offline and real-time tables.
r
@User I'm not a GDPR expert but if you get a GDPR request is it enough to make it impossible to retrieve data for that user?
because if that's enough (I know there are strategies in some frameworks like encrypt the data and throw away the encryption key when the request comes in) we can make easily an indexing feature to support this
essentially we could apply a mask to the data meaning "don't read this row" but which would keep the offline segment format immutable
c
@User Thanks for the answer, how to replace the segments (comsuming and completed) of a realtime table?
@User Thanks for the answer, are encryption keys managed per user? I want to know more. Do you have any documentation on this?
r
sorry I was asking about your requirements, not describing a feature
m
@User The minion purge task can be used for GDPR purging
c
@User I have to comply with the GDPR, and the best way is to delete the individual rows.
m
Yes minion purge task for that @User
c
@User Thanks so much! 😍 https://github.com/apache/pinot/blob/c6ad763a5013825810a0af5448bb4a1d8be0e230/pinot-core/src/test/java/org/apache/pinot/core/minion/SegmentPurgerTest.java#L127 I checked Pinot's Purge Task, but I didn't mention it because I wanted to see if there was any other way. However, I think it's only a way to use PurgeTask, so I have a few questions about PurgeTask. Q1) Can PurgeTask also delete individual rows in the committed segment (not concealing) of the realtime table? Q2) PurgeTask does not seem to delete individual rows by random accessing segment files. If PurgeTask downloads, regenerates, and uploads segment files, what is the difference from injection job? I am trying to understand this difference. We will prefer an injection job to using Minion and implementing a task code. because it is more familiar to develop an injestion job. Q3) We can service large amounts of data and large numbers of segments. If PurgeTask works in a download and regenerative manner, regeneration and reload of segments will likely affect clusters or services, regardless of whether PurgeTask or InjectionJob is used. How will the cluster or service be affected?
m
1. Purge task is for offline tables
2. It is smarter to avoid regeneration of segment if nothing change. Also it takes away the burden of maintaining another ingestion pipeline. But essentially it is the same
3. Download/upload of data should not impact cluster performance. How much data are we talking about?
c
1. I'd like to check this out one more time. Isn't the committed segment here a segment of the realtime table? https://apache-pinot.slack.com/archives/CDRCA57FC/p1648678751360119?thread_ts=1648225591.865729&cid=CDRCA57FC 3. We have a couple of cases. One is about 500GB based on ElasticSearch's index size. The other is years of data and is stored in segments on a daily basis.
@User I'm sorry 'committed segment' is a bit confusing, does committed segment mean only segments of the OFFLINE table? Is the segment in which the consuming segment of the REALTIME table flushed is not a committed segment?
m
Consuming segment is one which is still open and ingesting records. When it is sealed it is committed and not longer open for adding more rows through real-time ingestion (in real-time table).