This message was deleted.
# dev
s
This message was deleted.
🎉 2
j
Hi Adarsh, you are correct this is the way to do it in the current product. You can also facilitate updates this way, as you have probably already guessed. And, using mixed segment granularity and overshadowing features build into Druid's architecture, you don't necessarily have to replace ALL the data if some time ranges aren't changing ... in this case you can "OVERWRITE WHERE __time ..." to reingest a smaller portion of your datasource for better efficiency.
a
Thanks John! The basic delete syntax is working now. I am working on adding a time interval clause as well. I think this would be required for any large dataset (as John suggested). I want to get some feedback for the syntax here. Initially I thought about sticking to the REPLACE syntax for consistency, but it would appear a little bit difficult to understand since both use WHERE.
Copy code
DELETE FROM datasource
[OVERWRITE WHERE <time>]
WHERE <condition>
PARTITIONED BY
[CLUSTERED BY ]
We could use a different clause for one of the two WHEREs like "DELETE IN TIME RANGE <time> WHERE ..." (just as an example) but this would be inconsistent with either the current REPLACE syntax or the SQL like DELETE syntax.
j
Exciting stuff! It would be nice if the OVERWRITE clause wasn't required ... but that would require that Druid figure out based on the WHERE predicates what time chunks are relevant to the delete option ... and unless you have a __time range in your WHERE condition that is AND'ed to any other predicates, then Druid would have to query the datasource to figure this out. But, considering this is not a sub-second DML operation anyway, maybe not a bad thing to consider? I don't know that the PARTITIONED BY clause is required -- I would hope that Druid could figure out on its own what granularity to use for the new overlay segments, like it does for compaction.
a
I haven't fully looked into this, but we have some additional constraints currently on the OVERWRITE clause, like making sure that the time intervals there are aligned with the partitioned by. Partitioned by clause would be removed in the future, once there is a way to get the partitioning and clustering, perhaps https://github.com/apache/druid/pull/13686. This is currently present since we don't have a way to access this easily.
j
You're right -- those two need to be aligned otherwise you will get an error. So ideally eliminate both of them then? Should it matter if the granularity of the new segments is difference from that of the existing segments ... mixed segment granularity and overshadowing is supported by the architecture, and compaction can clean it up afterwards?