venice #general

Join Slack

Zac Policzer

12/01/2023, 4:44 AM

I'd need to pull out my IDE. I'm away from my keyboard just mow

Slackbot

12/01/2023, 4:45 AM

This message was deleted.

Slackbot

12/06/2023, 11:00 PM

This message was deleted.

Slackbot

01/26/2024, 2:20 AM

This message was deleted.

Felix GV

02/24/2024, 2:45 AM

FYI! LinkedIn is hiring for the Venice team!

🚀 3

Zac Policzer

05/06/2024, 6:17 PM

New Venice talk! Come learn about Venice and formal methods!

https://www.youtube.com/watch?v=Jz0J5N77QKk&list=PLWLcqZLzY8u9d78Ey5KUCMFLgNgYoFgCX&index=3&ab_channel=MarkusKuppe▾

🎉 3

Moditha Hewasinghage

05/21/2024, 11:18 AM

Hi everyone, We have been trying to integrate the spark push job into our workflow. First of all thank you for the awesome work. However we are facing some issues with the usage of scala 2.12 used in the dependancies as we are using 2.13. Is there a specific reason to be on 2.12 ? I see that there is a dependancy on the linkdin fork of kafka which is only on 2.12 which might be the problem. It won't be possible to integrate the push job or the relevant spark writers to our code as 2.12 and 2.13 are incompatible. It should be easy to change the spark dependancies to 2.13 but the custom kafka library is going to be a problem. So is there a way for us to get around this or have a 2.13 build as well.

Felix GV

05/29/2024, 2:02 PM

New podcast on the Geek Narrator! Check it out!

https://youtu.be/D6gZKM4Jnk4▾

🎉 4

Felix GV

06/07/2024, 7:31 PM

Hello Venetians The first ever VeniceCon is right around the corner! Join us on Tuesday June 25th at 1 PM, in Sunnyvale, for an in-person event! RSVP here: https://www.meetup.com/big-data-meetup-linkedin/events/301509265/ There will be 4 talks on the agenda! 1. Overview of Venice and Surrounding AI Ecosystem, by @Manu and @Zac Policzer 2. Project Update -- What Venice Shipped in the Last Year, by @Felix GV 3. How LinkedIn’s Trust & Privacy Filtering Were Re-Architected to Leverage Venice & Da Vinci, by Apurv Kumar 4. Venice’s Next-Gen Read Path, by @Xun Yin Please don’t hesitate to share with your broader network. If you would like to re-share these social media posts, this would be greatly appreciated as well 🙏 • LinkedIn • Twitter <!here>

🚀 4

Felix GV

08/16/2024, 12:38 AM

Congrats to our three Summer interns, @Yanit Gebrewold, @Sebastian Infante Murguia and @tony! Thanks for contributing to Venice, and remember you’ll always be part of the community 😄 ! Don’t be a stranger, let us know how things go! Cheers!

🎉 6

Moditha Hewasinghage

09/19/2024, 10:59 AM

Hi We (@Dejan Mijic) have been looking into integrating venice into our existing data processing pipeline and decided the most reliable way of doing that would be writing to kafka directly as we have the capacity to do that within exisiting spark jobs. The data is already clean and validated to have the schema at this point and it will be a batch reload not an update. So could you tell us or point us to documentation/ code snippets on. 1. First of all is it a possible solution ? 2. What would be the kafka topic/ kafka topic naming scheme, that we need to write to a particular store? 3. Is there any steps that need to be carried out on Venice api? a. before the ingestion b. after the ingestion

Felix GV

10/21/2024, 6:00 PM

Did you know that Venice offers a client library called Da Vinci? This library eagerly ingests a Venice dataset into a local RocksDB instance, providing lightning fast query on this locally cached state. See this talk from Apurv Kumar at this year’s VeniceCon, to hear about how they re-architected their service to leverage Da Vinci, and achieved significant improvements across many dimensions!

https://www.youtube.com/watch?v=JMULRvcmRyk▾

Felix GV

03/17/2025, 7:59 PM

New podcast up! https://podcasts.apple.com/us/podcast/the-infoq-podcast/id1106971805?i=1000699480482 Feel free to re-share these posts with your networks if you'd like! - https://www.linkedin.com/posts/venicedb_building-linkedins-resilient-data-storage-activity-7307489014377234433-Zeau - https://bsky.app/profile/venicedb.org/post/3lklwkraqa223 - https://x.com/VeniceDataBase/status/1901724068753060324

🎉 5

Felix GV

03/25/2025, 5:08 PM

FYI @Dmytro Prokhorenkov, I added a dependency section to the main README in this PR (not yet merged). Feel free to provide any feedback if this is not what you would have needed.

Felix GV

03/25/2025, 6:46 PM

Someone asked me a few questions about Da Vinci (the stateful client option for Venice, which preloads and then queries local state, rather than doing requests across the network). I’ll share the answers below in case anyone else is interested: > (1) For the non-partitioned case, where the dataset can fit all-in-RAM or SSD, what’s a typically size limit you support? For dataset less than 10GB that may seem straightforward I assume with larger dataset that’s 100GB - 1TB, bootstraping the data from network may take time. Your intuition is right, and most of these use cases are <10 GB. We are currently working with a partner team on a roughly 1 TB / node use case, but they are not in prod with this yet, so no concrete experience to share so far. On the Venice servers, most clusters have >1 TB / node of state, and so we do need to rebootstrap this much in certain scenarios. But the servers are a more controlled use case, since they are treated as stateful infrastructure and so there is host stickiness and more control in general over maintenance operations such as OS upgrades. Whereas Da Vinci client applications tend to be treated as stateless (even though they’re not) and so the host assignments are more volatile and bootstrap can come into play more often. In the current architecture of Venice (and by the same token Da Vinci), all writes always come through Kafka, no matter if they’re batch pushes or single row updates. And while we have optimized the shit out of it as best we can, it remains the case that Kafka’s per-message overhead is a bottleneck. To address this, we are currently working on the next-gen of this part of the architecture, which we call blob transfer. We are planning to have several iterations of this, but the first one (which is pretty much code complete and nearing production try out quite soon) is a fairly narrow scope which we call “Da Vinci P2P blob transfer”. As the name implies, Da Vinci instances will serve the raw blobs of the RocksDB databases among peers, thus optionally skipping the Kafka part, then resume subscribing to Kafka for further updates, using the correct checkpoint info which is maintained atomically alongside the data. Eventually, the scope will expand to support servers blob transfer, and also to support alternatives to P2P (e.g. putting the blobs on S3/ABS/GCS or whatever). We don’t have numbers to share yet on the blob transfer vs Kafka performance for full bootstrap, but we expect that to be a good boost.

Felix GV

03/25/2025, 6:46 PM

> (2) Have you considered running Venice as a sidecar container? As compared to your current client-library based approach. They are pros and cons but would like to understand your experience on maintaining that client-facing library, especially you cannot quite control the binary releases. Yes we did… but so far the sidecar story at LinkedIn is a bit immature, so we’re sitting on the sidelines for now. There is work on this (at LI, not in Venice specifically) so we may jump on that bandwagon eventually… I will say, however, that the current architecture of Da Vinci is to have the RocksDB database accessed directly from the host application’s process, via JNI, and the overhead of that is quite low (we’ve benchmarked single digit microseconds including everything: JNI overhead, RocksDB lookup [assuming the PlainTable / all-in-RAM config], and Avro deserialization). Whereas with a sidecar we would hit the network stack (even though it would be on loopback) and it seems like it might be challenging to be quite as fast. Still, the maintainability benefits may warrant pursuing that approach. Regarding client releases and uptake, I acknowledge that it has been historically painful, and still is to some extent, but I will say that it has gotten better, at least internally at LI… Through a mix of tooling and processes, we now have much better dependency hygiene than before and our dependents do pick up new library versions much more quickly than before… That is not Venice-specific and does nothing for folks outside LI who likely have the same challenges, but I just wanted to share that data point nonetheless.

Felix GV

03/25/2025, 6:46 PM

> (3) For the partitioned-cases, is that a popular use case, the sharding part is a bit clear to me from reading the blog, and I imagine it’ll require tight integration to make it work --- the sharding on the data preparation & push side, may need to agree with the sharding on the data access layer? The partitioned case is definitely a power user scenario, and in that sense it is “not that popular”. I would say that it only makes sense when used as part of a “framework” or “environment” which already has a notion of partitioning. IOW, it doesn’t make much sense to use partitioned Da Vinci just like that, out of the box, because you wouldn’t know what partition to subscribe to on each host… At LI, we’ve had partitioned Da Vinci integrated in our search stack as well as our stream processing stack, both of which have built-in notions of partitioning. In those cases, we configure the relevant Venice stores with a “custom partitioner”, which results in the Venice data being “co-partitioned” with the other system (e.g. search or stream processing), and therefore the other system can subscribe to the correct shard in each of its own hosts (or tasks, or whatever). As an aside, this is where the name Da Vinci really shines… did you know that Leonardo Da Vinci left many of his works, including his masterpieces, unfinished? For example, he kept tweaking the Mona Lisa up until his death, many years after having started it. Likewise, while Venice is a full-fledged system that one can use directly without much scaffolding around it, Da Vinci is more of a building block, which you can integrate into other systems.

Jia

03/26/2025, 10:06 PM

Thanks a lot for the detailed reply @Felix GV. A follow up question. For Da Vinci mode, do you keep a copy of the data both on the client side and on the Venice server side? Do you really need the server-side copy?

Felix GV

04/17/2025, 11:29 PM

We have a Venice talk lined up at J on the Beach in Spain next month! Who's coming 😁 ? https://www.linkedin.com/posts/j-on-the-beach_jotb25-duckdb-rocksdb-activity-7318664574696632322-nsbo

🎉 4

🇪🇸 3

Gabriel Drouin

04/22/2025, 2:28 PM

I've ran into issues while running

./gradlew check --continue

with Ubuntu 24.04.1 LTS on WSL2 (Windows 11). In a nutshell: numerous integration/e2e tests would fail due to timeouts. I'll setup the work environment on my Macbook Pro instead. Further details in reply

Gabriel Drouin

04/25/2025, 6:12 PM

@Koorous Vargha I just noticed dark mode got enabled on the docs 👀 very nice

🚀 3

Gabriel Drouin

05/07/2025, 1:15 PM

Hey folks, question regarding tests. In this current PR, I'm adding unit tests where each tests takes ~6 seconds due to exponential backoff. When running

:internal:venice-common:test --tests "com.linkedin.venice.utils.TestHelixUtils"

, the tests run sequentially, which takes some time. I was wondering if you knew of any way to run multiple tests in parallel. I remember that

./gradlew check --continue

would spawn x threads when running the full suite, and thought perhaps I would be able to do the same thing here, or similar. Thanks! EDIT: it doesn't actually take that long, but thought it might be helpful to know in the future.

Felix GV

05/09/2025, 7:35 PM

Interesting talk by @Jia about Pinterest's KV Store, used for ML feature serving:

https://youtu.be/aCVIjDkzLM8?si=hHd-finRd2Hcw5xk▾

👍 5

Gabriel Drouin

05/11/2025, 9:25 PM

All checks passing on this new PR 🍀 However, I think much discussion will be required in order to arrive at a proper solution we can all agree on 😅 Exciting! Have a great week everyone! 👋

🚀 2

Felix GV

06/02/2025, 5:08 PM

One of the things we were talking about today @Gabriel Drouin

Gabriel Drouin

06/10/2025, 4:49 PM

@Felix GV A very important question I wish to solve today... Would you be able to tell me why the tests in the CI are named in french? 😆

🤣 2

Minh Nguyen

06/13/2025, 4:32 PM

I have added the welcome bot to our slack channel! Anyone who join #C03SLQWRSLF will trigger the workflow. Any extra info you want to add?

🚀 2

🚢 1

Dejan Mijic

06/18/2025, 3:03 PM

Hey folks, is there a working example of fast client wiring? The closest thing I found to a guide is

com.linkedin.venice.fastclient.utils.TestClientSimulator

. Is it an appropriate thing to follow (minus the mocks and dummies 🙂 )? Thanks!

Felix GV

06/20/2025, 4:15 PM

The latest conference talk is now posted online! Reshare and like this social media post, if you'd like, to promote Venice to your networks!

❤️ 1

Felix GV

06/20/2025, 4:16 PM

Direct YouTube link here:

https://youtu.be/hc0pgvnr3fQ▾