Hello guys I m here to talk about the use case that my team Apache Pinot #general

Hello guys! I'm here to talk about the use case th...

João Comini

11/11/2020, 4:27 PM

Hello guys! I'm here to talk about the use case that my team and I are facing right now. We have a realtime data processing platform that provides general data aggregations (sums, counts, avg, etc.) to a transactional fraud detection engine. In summary we consume data from Kafka topics, and with this data we increment some counters for a given ID (e.g. a credit card hash) and then we update it inside the database. This solution have a really great performance as we are accessing key-value indexed columns, but we have a really though time creating new pre-aggregation flows. So we thought: "what if we had a self-service data aggregation engine, where we didn't need to code every step of every pre-aggregation flow?". And here we are! We've been looking to Pinot for a long time now, and we're still not sure if it is going to fit our scenario. The major problems that we have today is: • Today, it takes us approximately 2 or 3 days to deliver a new pre-aggregation flow. • Our pre-aggregation algorithm have some kind of an "imprecision" as we work with a big time-window inside our aggregation technique. I want to give you some metrics here, so maybe you can help me think (or not) if Pinot can be suitable to us: • Our fraud detection engine runs, at its peak, a throughput of 7~8k transactions per minute. • For each transaction we make dozens (if not hundreds) of requests to our pre-aggregation platform, which gives us a throughput of ~100k queries per minute. • Our pre-aggregations lantecy SLA is 1 second to return ALL queries. I'm just putting it here to discuss similar use cases and understand whether the team's effort in starting something new or maintaining what already exists is worth it. I apologize if this is not the best way to introduce myself and start an discussion here. 😄

👋 3

Mayank

11/11/2020, 4:31 PM

Hello

Mayank

11/11/2020, 4:32 PM

Are you trying to pre cube all data and store in KV store? With Pinot, you don’t need to pre cube, you can simple ingest the records (with count of 1), and Pinot can aggregate at read time

João Comini

11/11/2020, 4:35 PM

Yes, that is what we are doing right now. We are not using Pinot yet. We read about the star-tree index and looks like it is exactly what we are looking for.

Mayank

11/11/2020, 4:38 PM

Pinot can ingest at much higher rates than what you currently have. Also it is optimized for read time aggregations. From what I gather for your description, it seems like a good use case for Pinot.

Mayank

11/11/2020, 4:38 PM

Happy to chat further to assist

João Comini

11/11/2020, 4:41 PM

Nice, thanks! I think that the best way to answer my questions is building a Proof of Concept, so I can bring the results here and open a new thread if needed.

Mayank

11/11/2020, 4:42 PM

Awesome

João Comini

11/11/2020, 4:46 PM

Given our scenario, is it possible to have an idea of what filesystem would be better to our deep storage? Or is there not much performance difference between them?

Mayank

11/11/2020, 4:47 PM

Folks have used common ones like S3, GCS etc

Mayank

11/11/2020, 4:47 PM

Deepstore is not in the read path, only in ingestion path

João Comini

11/11/2020, 4:47 PM

Oh, you're right!

João Comini

11/11/2020, 4:48 PM

Ok, that's enough for today, thank you!

Mayank

11/11/2020, 4:49 PM

👍

Kishore G

11/11/2020, 5:11 PM

do you have rough schema of the topic or the table you plan to create?

João Comini

11/11/2020, 5:39 PM

That's a simplified DTO that we use inside our pre-aggregation system, i think that the table schema would look the same

Copy code

data class Payment(
    val id: String,
    val orderId: UUID,
    val value: BigDecimal?,
    val status: String,
    val customerId: UUID,
    val createdAt: Instant?,
    val cardCvvHash: String?,
    val cardHash: String?,
    val deviceId: String?
)

Kishore G

11/11/2020, 5:43 PM

and the query patterns

João Comini

11/11/2020, 5:49 PM

All queries have a similar pattern, something like this:

Copy code

SELECT SUM(value) from payments WHERE customer_id = 'some-id' AND created_at >= NOW() - interval 1 day
---------------
SELECT count(id) from payments WHERE customer_id = 'some-id' AND status = 'REFUSED' AND created_at >= NOW() - interval 1 day

Kishore G

11/11/2020, 5:49 PM

ok, then make customer_id as the sorted index

➕ 1

Kishore G

11/11/2020, 5:50 PM

you may not really need star-tree index

João Comini

11/11/2020, 5:51 PM

Ok, but then we might combine all Ids like:

Copy code

SELECT SUM(value) from payments WHERE customer_id = 'some-id' AND device_id = 'other-id' AND created_at >= NOW() - interval 1 day

João Comini

11/11/2020, 5:51 PM

Or just make queries about the card hashes

Copy code

SELECT count(id) from payments WHERE card_hash = 'some-hash' AND status = 'REFUSED' AND created_at >= NOW() - interval 1 day

Mayank

11/11/2020, 5:52 PM

card_hash

can have inverted index on it, when data sorted on customer-id

Mayank

11/11/2020, 5:53 PM

You can pick the column that appears in most queries (like customer_id), as sorted. For other queries where this column does not appear, we can have inv index on some of them (which prune most of the rows).

João Comini

11/11/2020, 5:53 PM

Right, that looks good

Karin Wolok

11/11/2020, 6:07 PM

I actually think this was a GREAT introduction. 😄

Rhafik Gonzalez

11/11/2020, 8:42 PM

Hi, guys! Thank you so much for all the support! João and I, we intend to run some proofs of concept soon and we will be very happy to continue keep in touch sharing our experiences and struggles!

👍 1

Kishore G

11/11/2020, 8:47 PM

👍

Open in Slack

Previous Next