Hi, I'm paying attention to Minion's GDPR support.
I read the document that the minion framework can be used to achieve the requirements to comply with GDPR. However, the detailed description is "coming soon." I'm confused. Is the ability to use the Minion framework to delete records under certain conditions in the background not yet available? Or is it just the document that hasn't been written yet?
In addition, I have some questions about audit, authorization, and DR.
1. Audit at the query level. I need to know not only table config and schema change log, but also who, when, and what queries (including target tables and conditions) were requested. Does Pinot offer audit? Or is it possible to use minion to monitor queries in the background and log them?
2. Is Pinot planning to provide authentication-authorization modules? Druid provides the built-in kerberos authenticator and provides authorization through the ranger extension program. have any similar plans?
3. I want to configure replication between two data centers (not using cloud) Ideally, if data center 1 fails, we want to fail over to data center 2 and fail back when Data Center 1 is normal. Suppose I have configured deep storage (hdfs), pinot cluster on k8s in each of the two data centers. Deep storage replication is possible. But what happens to real-time data? I understand that real-time data stores data in memory and periodically flushes segments to disk. If a cluster down, will real-time data that has not yet been flushed be lost? I'm not sure how to configure DR on pinot. Is there any way to recommend it to me?
I'm in the process of getting to know Pinot. Thanks in advance for the help.
12/14/2020, 10:45 AM
I think LinkedIn is already using it for deleting records in background for GDPR. @Subbu Subramaniam@Mayank may have more information.
For your questions: 1. for query part, pinot logs query context in broker logs, for user level, Pinot doesn’t collect right now, it should come with AuthN/AuthZ ; 2. yes, it will be supported. Pinot currently has an ACL interface for user to plugin their own logic as well 3. Pinot keeps the start offset of each segment to guarantee no data loss. When server fails, current consuming data is in memory, so it will be gone. Once pinot come back online, Pinot reset kafka consumer offset to the saved segment start offset and re-ingest the data.
12/14/2020, 2:47 PM
Yes, there is a SegmentPurgeTask that can be used to purge records for GDPR.
12/14/2020, 5:06 PM
To clarify some more, Pinot does not replicate the data that it receives from realtime stream. It is expected that (1) the stream is replicated underneath to a different data center, so that the other data center can serve during the disaster (2) all records in the stream be re-ingested into the data center that is down so that it can be reconsumed by pinot in the data center that experienced disaster. The second point can be relaxed a bit if you have hybrid (as opposed to realtime only) use case.
As for minion purges, just clarifying that the task operates at a segment level, purging (or modifying) records as necessary. It is expected that the task executor has access to other databases that indicate which records need to be purged.
12/14/2020, 7:53 PM
Currently, there’s no out of box task scheduler for purge tasks or the record purger implementation in the open source code We do have all the building blocks. @Laxman Ch is working on the default purger scheduler implementation.