https://linen.dev logo
Title
k

Kyle Larose

05/25/2023, 3:09 PM
We're looking to switch our ingestion from Kafka to a new rabbitmq super streams based model (plugin is being developed right now by my colleague). In order to avoid losing data when switching sources, we expect that we'll have a period of time where we ingest from both kafka and our new source. This will mean exact duplicates. I'm wondering if there is a simple way to remove those duplicates some time after we turn off the kafka source, knowing that there is a somewhat constrained time period to search. Searching in this slack, I've seen a few posts about using rollup to do it, but I'm not quite sure how to start doing that. Any pointers would be helpful.
v

Vadim

05/25/2023, 8:51 PM
How are you planning on getting the rabbitmq data into Druid? What is "rabbitmq super streams based model"? By "plugin is being developed right now by my colleague" are you saying you are writing a rabbitmq supervisor?
k

Kyle Larose

05/25/2023, 9:06 PM
Yes. It rabbitmq recently released a preview feature called "superstreams" which has much of the same functionality as kafka or kinesis. In particular, a rabbitmq stream is an append-only log, so the same ingestion model used by kafka and kinesis works here.
So, we're writing a plugin that will ingest data from them. We based much of it off the kinesis one, for what it's worth. We're planning on open-sourcing it. I know my colleague has discussed it, though I'm not sure if it was with the rabbit community or druid.
v

Vadim

05/25/2023, 9:24 PM
That is spectacular. Let me know how it goes. If you open source it and contribute it to Druid (and make sure that the sampler API works with it - I don't know if that happens automatically or need a bit of extra work) then ping me and I would be more than happy to make a tile for it in the streaming data loader in the console...
as for your original question: you can only have 1 supervisor per datasource so I assume you will have another (new) datasource for the rabbitmq. does it make sense to let them all run for some time and then just point you app at the new datasource (you can also do a batch backfill by copying from old data source to new datasource)
we recently did something like that where I work. We went from an OSS Kafka cluster to a Confluent cloud cluster (shoutout to Confluent cloud!). We did exactly what I said in my comment. In our usecase we only have several weeks of data that people can access in the app so we just let it run until the new datasource had that much data in it. We did not need to do the backfill but we discussed it before natural delays to the schedule from other projects obviated the need.
k

Kyle Larose

05/25/2023, 9:34 PM
We definitely plan on contributing it to druid, so that's great to hear. 🙂 I hadn't considered pointing it at a different data source. So, basically, we could identify when data started arriving at the new data source, and copy all the old data in prior to that. Or, keep the old one around, and stitch it together in the app. Unfortunately for us we have a few years' worth of data (not a huge amount, so copying it around is likely feasible).
Thanks for the advice!