Hello guys! I'm here to talk about the use case th...
# general
j
Hello guys! I'm here to talk about the use case that my team and I are facing right now. We have a realtime data processing platform that provides general data aggregations (sums, counts, avg, etc.) to a transactional fraud detection engine. In summary we consume data from Kafka topics, and with this data we increment some counters for a given ID (e.g. a credit card hash) and then we update it inside the database. This solution have a really great performance as we are accessing key-value indexed columns, but we have a really though time creating new pre-aggregation flows. So we thought: "what if we had a self-service data aggregation engine, where we didn't need to code every step of every pre-aggregation flow?". And here we are! We've been looking to Pinot for a long time now, and we're still not sure if it is going to fit our scenario. The major problems that we have today is: • Today, it takes us approximately 2 or 3 days to deliver a new pre-aggregation flow. • Our pre-aggregation algorithm have some kind of an "imprecision" as we work with a big time-window inside our aggregation technique. I want to give you some metrics here, so maybe you can help me think (or not) if Pinot can be suitable to us: • Our fraud detection engine runs, at its peak, a throughput of 7~8k transactions per minute. • For each transaction we make dozens (if not hundreds) of requests to our pre-aggregation platform, which gives us a throughput of ~100k queries per minute. • Our pre-aggregations lantecy SLA is 1 second to return ALL queries. I'm just putting it here to discuss similar use cases and understand whether the team's effort in starting something new or maintaining what already exists is worth it. I apologize if this is not the best way to introduce myself and start an discussion here. 😄
👋 3
m
Hello
Are you trying to pre cube all data and store in KV store? With Pinot, you don’t need to pre cube, you can simple ingest the records (with count of 1), and Pinot can aggregate at read time
j
Yes, that is what we are doing right now. We are not using Pinot yet. We read about the star-tree index and looks like it is exactly what we are looking for.
m
Pinot can ingest at much higher rates than what you currently have. Also it is optimized for read time aggregations. From what I gather for your description, it seems like a good use case for Pinot.
Happy to chat further to assist
j
Nice, thanks! I think that the best way to answer my questions is building a Proof of Concept, so I can bring the results here and open a new thread if needed.
m
Awesome
j
Given our scenario, is it possible to have an idea of what filesystem would be better to our deep storage? Or is there not much performance difference between them?
m
Folks have used common ones like S3, GCS etc
Deepstore is not in the read path, only in ingestion path
j
Oh, you're right!
Ok, that's enough for today, thank you!
m
👍
k
do you have rough schema of the topic or the table you plan to create?
j
That's a simplified DTO that we use inside our pre-aggregation system, i think that the table schema would look the same
Copy code
data class Payment(
    val id: String,
    val orderId: UUID,
    val value: BigDecimal?,
    val status: String,
    val customerId: UUID,
    val createdAt: Instant?,
    val cardCvvHash: String?,
    val cardHash: String?,
    val deviceId: String?
)
k
and the query patterns
j
All queries have a similar pattern, something like this:
Copy code
SELECT SUM(value) from payments WHERE customer_id = 'some-id' AND created_at >= NOW() - interval 1 day
---------------
SELECT count(id) from payments WHERE customer_id = 'some-id' AND status = 'REFUSED' AND created_at >= NOW() - interval 1 day
k
ok, then make customer_id as the sorted index
1
you may not really need star-tree index
j
Ok, but then we might combine all Ids like:
Copy code
SELECT SUM(value) from payments WHERE customer_id = 'some-id' AND device_id = 'other-id' AND created_at >= NOW() - interval 1 day
Or just make queries about the card hashes
Copy code
SELECT count(id) from payments WHERE card_hash = 'some-hash' AND status = 'REFUSED' AND created_at >= NOW() - interval 1 day
m
card_hash
can have inverted index on it, when data sorted on customer-id
You can pick the column that appears in most queries (like customer_id), as sorted. For other queries where this column does not appear, we can have inv index on some of them (which prune most of the rows).
j
Right, that looks good
k
I actually think this was a GREAT introduction. 😄
r
Hi, guys! Thank you so much for all the support! João and I, we intend to run some proofs of concept soon and we will be very happy to continue keep in touch sharing our experiences and struggles!
👍 1
k
👍