https://venicedb.org logo
Join Slack
Powered by
# general
  • s

    Slackbot

    11/11/2023, 6:27 PM
    This message was deleted.
    z
    e
    • 3
    • 4
  • e

    Elijah Grimaldi

    11/11/2023, 6:28 PM
    For the part about different DB options, I guess my question is are we exposing more predefined options to be changed in the configuration or are you thinking that it should be able to be fully customized if that makes sense
  • s

    Slackbot

    11/15/2023, 5:22 PM
    This message was deleted.
    f
    e
    +2
    • 5
    • 10
  • e

    Elijah Grimaldi

    11/15/2023, 5:23 PM
    Or did I get it wrong?
  • s

    Slackbot

    11/17/2023, 4:58 PM
    This message was deleted.
    n
    • 2
    • 1
  • s

    Slackbot

    11/27/2023, 11:45 PM
    This message was deleted.
    šŸ‘ 3
    šŸŽ‰ 2
    m
    k
    e
    • 4
    • 9
  • k

    Koorous Vargha

    12/01/2023, 4:42 AM
    Hi @Zac Policzer, regarding this issue, what was the regex you had in mind to find all of the system stores within a cluster? Thanks! I have this so far:
    Copy code
    // Check the migration status for all source and destination cluster stores
    Pattern systemStorePattern = Pattern.compile(storeName);
    Stream<String> allSrcStoreNames = srcControllerClient.getClusterStores(srcClusterName).stream().filter(srcStoreName -> systemStorePattern.matcher(srcStoreName).matches());
    Stream<String> allDestStoreNames = destControllerClient.getClusterStores(destClusterName).stream().filter(destStoreName -> systemStorePattern.matcher(destStoreName).matches());
    
    allSrcStoreNames.forEach(srcStorename -> printMigrationStatus(srcControllerClient, srcStorename));
    allDestStoreNames.forEach(destStorename -> printMigrationStatus(destControllerClient, destStorename));
  • z

    Zac Policzer

    12/01/2023, 4:44 AM
    Ah right. There should be a constant for it. Something like "venice_system_store_" is the prefix I believe
  • z

    Zac Policzer

    12/01/2023, 4:44 AM
    I'd need to pull out my IDE. I'm away from my keyboard just mow
  • s

    Slackbot

    12/01/2023, 4:45 AM
    This message was deleted.
    x
    z
    k
    • 4
    • 11
  • s

    Slackbot

    12/06/2023, 11:00 PM
    This message was deleted.
    z
    f
    +2
    • 5
    • 20
  • s

    Slackbot

    01/26/2024, 2:20 AM
    This message was deleted.
    e
    • 2
    • 2
  • f

    Felix GV

    02/24/2024, 2:45 AM
    FYI! LinkedIn is hiring for the Venice team!
    šŸš€ 3
  • z

    Zac Policzer

    05/06/2024, 6:17 PM
    New Venice talk! Come learn about Venice and formal methods!

    https://www.youtube.com/watch?v=Jz0J5N77QKk&amp;list=PLWLcqZLzY8u9d78Ey5KUCMFLgNgYoFgCX&amp;index=3&amp;ab_channel=MarkusKuppeā–¾

    šŸŽ‰ 3
  • m

    Moditha Hewasinghage

    05/21/2024, 11:18 AM
    Hi everyone, We have been trying to integrate the spark push job into our workflow. First of all thank you for the awesome work. However we are facing some issues with the usage of scala 2.12 used in the dependancies as we are using 2.13. Is there a specific reason to be on 2.12 ? I see that there is a dependancy on the linkdin fork of kafka which is only on 2.12 which might be the problem. It won't be possible to integrate the push job or the relevant spark writers to our code as 2.12 and 2.13 are incompatible. It should be easy to change the spark dependancies to 2.13 but the custom kafka library is going to be a problem. So is there a way for us to get around this or have a 2.13 build as well.
    f
    s
    • 3
    • 41
  • f

    Felix GV

    05/29/2024, 2:02 PM
    New podcast on the Geek Narrator! Check it out!

    https://youtu.be/D6gZKM4Jnk4ā–¾

    šŸŽ‰ 4
  • f

    Felix GV

    06/07/2024, 7:31 PM
    Hello Venetians The first ever VeniceCon is right around the corner! Join us on Tuesday June 25th at 1 PM, in Sunnyvale, for an in-person event! RSVP here: https://www.meetup.com/big-data-meetup-linkedin/events/301509265/ There will be 4 talks on the agenda! 1. Overview of Venice and Surrounding AI Ecosystem, by @Manu and @Zac Policzer 2. Project Update -- What Venice Shipped in the Last Year, by @Felix GV 3. How LinkedIn’s Trust & Privacy Filtering Were Re-Architected to Leverage Venice & Da Vinci, by Apurv Kumar 4. Venice’s Next-Gen Read Path, by @Xun Yin Please don’t hesitate to share with your broader network. If you would like to re-share these social media posts, this would be greatly appreciated as well šŸ™ • LinkedIn • Twitter <!here>
    šŸš€ 4
  • f

    Felix GV

    08/16/2024, 12:38 AM
    Congrats to our three Summer interns, @Yanit Gebrewold, @Sebastian Infante Murguia and @tony! Thanks for contributing to Venice, and remember you’ll always be part of the community šŸ˜„ ! Don’t be a stranger, let us know how things go! Cheers!
    šŸŽ‰ 6
    y
    t
    +3
    • 6
    • 24
  • m

    Moditha Hewasinghage

    09/19/2024, 10:59 AM
    Hi We (@Dejan Mijic) have been looking into integrating venice into our existing data processing pipeline and decided the most reliable way of doing that would be writing to kafka directly as we have the capacity to do that within exisiting spark jobs. The data is already clean and validated to have the schema at this point and it will be a batch reload not an update. So could you tell us or point us to documentation/ code snippets on. 1. First of all is it a possible solution ? 2. What would be the kafka topic/ kafka topic naming scheme, that we need to write to a particular store? 3. Is there any steps that need to be carried out on Venice api? a. before the ingestion b. after the ingestion
    f
    d
    +3
    • 6
    • 58
  • f

    Felix GV

    10/21/2024, 6:00 PM
    Did you know that Venice offers a client library called Da Vinci? This library eagerly ingests a Venice dataset into a local RocksDB instance, providing lightning fast query on this locally cached state. See this talk from Apurv Kumar at this year’s VeniceCon, to hear about how they re-architected their service to leverage Da Vinci, and achieved significant improvements across many dimensions!

    https://www.youtube.com/watch?v=JMULRvcmRykā–¾

    e
    • 2
    • 3
  • f

    Felix GV

    03/17/2025, 7:59 PM
    New podcast up! https://podcasts.apple.com/us/podcast/the-infoq-podcast/id1106971805?i=1000699480482 Feel free to re-share these posts with your networks if you'd like! - https://www.linkedin.com/posts/venicedb_building-linkedins-resilient-data-storage-activity-7307489014377234433-Zeau - https://bsky.app/profile/venicedb.org/post/3lklwkraqa223 - https://x.com/VeniceDataBase/status/1901724068753060324
    šŸŽ‰ 5
    m
    k
    • 3
    • 3
  • f

    Felix GV

    03/25/2025, 5:08 PM
    FYI @Dmytro Prokhorenkov, I added a dependency section to the main README in this PR (not yet merged). Feel free to provide any feedback if this is not what you would have needed.
    d
    • 2
    • 1
  • f

    Felix GV

    03/25/2025, 6:46 PM
    Someone asked me a few questions about Da Vinci (the stateful client option for Venice, which preloads and then queries local state, rather than doing requests across the network). I’ll share the answers below in case anyone else is interested: > (1) For the non-partitioned case, where the dataset can fit all-in-RAM or SSD, what’s a typically size limit you support? For dataset less than 10GB that may seem straightforward I assume with larger dataset that’s 100GB - 1TB, bootstraping the data from network may take time. Your intuition is right, and most of these use cases are <10 GB. We are currently working with a partner team on a roughly 1 TB / node use case, but they are not in prod with this yet, so no concrete experience to share so far. On the Venice servers, most clusters have >1 TB / node of state, and so we do need to rebootstrap this much in certain scenarios. But the servers are a more controlled use case, since they are treated as stateful infrastructure and so there is host stickiness and more control in general over maintenance operations such as OS upgrades. Whereas Da Vinci client applications tend to be treated as stateless (even though they’re not) and so the host assignments are more volatile and bootstrap can come into play more often. In the current architecture of Venice (and by the same token Da Vinci), all writes always come through Kafka, no matter if they’re batch pushes or single row updates. And while we have optimized the shit out of it as best we can, it remains the case that Kafka’s per-message overhead is a bottleneck. To address this, we are currently working on the next-gen of this part of the architecture, which we call blob transfer. We are planning to have several iterations of this, but the first one (which is pretty much code complete and nearing production try out quite soon) is a fairly narrow scope which we call ā€œDa Vinci P2P blob transferā€. As the name implies, Da Vinci instances will serve the raw blobs of the RocksDB databases among peers, thus optionally skipping the Kafka part, then resume subscribing to Kafka for further updates, using the correct checkpoint info which is maintained atomically alongside the data. Eventually, the scope will expand to support servers blob transfer, and also to support alternatives to P2P (e.g. putting the blobs on S3/ABS/GCS or whatever). We don’t have numbers to share yet on the blob transfer vs Kafka performance for full bootstrap, but we expect that to be a good boost.
    j
    • 2
    • 8
  • f

    Felix GV

    03/25/2025, 6:46 PM
    > (2) Have you considered running Venice as a sidecar container? As compared to your current client-library based approach. They are pros and cons but would like to understand your experience on maintaining that client-facing library, especially you cannot quite control the binary releases. Yes we did… but so far the sidecar story at LinkedIn is a bit immature, so we’re sitting on the sidelines for now. There is work on this (at LI, not in Venice specifically) so we may jump on that bandwagon eventually… I will say, however, that the current architecture of Da Vinci is to have the RocksDB database accessed directly from the host application’s process, via JNI, and the overhead of that is quite low (we’ve benchmarked single digit microseconds including everything: JNI overhead, RocksDB lookup [assuming the PlainTable / all-in-RAM config], and Avro deserialization). Whereas with a sidecar we would hit the network stack (even though it would be on loopback) and it seems like it might be challenging to be quite as fast. Still, the maintainability benefits may warrant pursuing that approach. Regarding client releases and uptake, I acknowledge that it has been historically painful, and still is to some extent, but I will say that it has gotten better, at least internally at LI… Through a mix of tooling and processes, we now have much better dependency hygiene than before and our dependents do pick up new library versions much more quickly than before… That is not Venice-specific and does nothing for folks outside LI who likely have the same challenges, but I just wanted to share that data point nonetheless.
    j
    • 2
    • 2
  • f

    Felix GV

    03/25/2025, 6:46 PM
    > (3) For the partitioned-cases, is that a popular use case, the sharding part is a bit clear to me from reading the blog, and I imagine it’ll require tight integration to make it work --- the sharding on the data preparation & push side, may need to agree with the sharding on the data access layer? The partitioned case is definitely a power user scenario, and in that sense it is ā€œnot that popularā€. I would say that it only makes sense when used as part of a ā€œframeworkā€ or ā€œenvironmentā€ which already has a notion of partitioning. IOW, it doesn’t make much sense to use partitioned Da Vinci just like that, out of the box, because you wouldn’t know what partition to subscribe to on each host… At LI, we’ve had partitioned Da Vinci integrated in our search stack as well as our stream processing stack, both of which have built-in notions of partitioning. In those cases, we configure the relevant Venice stores with a ā€œcustom partitionerā€, which results in the Venice data being ā€œco-partitionedā€ with the other system (e.g. search or stream processing), and therefore the other system can subscribe to the correct shard in each of its own hosts (or tasks, or whatever). As an aside, this is where the name Da Vinci really shines… did you know that Leonardo Da Vinci left many of his works, including his masterpieces, unfinished? For example, he kept tweaking the Mona Lisa up until his death, many years after having started it. Likewise, while Venice is a full-fledged system that one can use directly without much scaffolding around it, Da Vinci is more of a building block, which you can integrate into other systems.
    j
    • 2
    • 1
  • j

    Jia

    03/26/2025, 10:06 PM
    Thanks a lot for the detailed reply @Felix GV. A follow up question. For Da Vinci mode, do you keep a copy of the data both on the client side and on the Venice server side? Do you really need the server-side copy?
    f
    • 2
    • 2
  • f

    Felix GV

    04/17/2025, 11:29 PM
    We have a Venice talk lined up at J on the Beach in Spain next month! Who's coming 😁 ? https://www.linkedin.com/posts/j-on-the-beach_jotb25-duckdb-rocksdb-activity-7318664574696632322-nsbo
    šŸŽ‰ 4
    šŸ‡ŖšŸ‡ø 3
  • g

    Gabriel Drouin

    04/22/2025, 2:28 PM
    I've ran into issues while running
    ./gradlew check --continue
    with Ubuntu 24.04.1 LTS on WSL2 (Windows 11). In a nutshell: numerous integration/e2e tests would fail due to timeouts. I'll setup the work environment on my Macbook Pro instead. Further details in reply
    k
    f
    • 3
    • 14
  • g

    Gabriel Drouin

    04/25/2025, 6:12 PM
    @Koorous Vargha I just noticed dark mode got enabled on the docs šŸ‘€ very nice
    šŸš€ 3
    k
    f
    • 3
    • 6
  • g

    Gabriel Drouin

    05/07/2025, 1:15 PM
    Hey folks, question regarding tests. In this current PR, I'm adding unit tests where each tests takes ~6 seconds due to exponential backoff. When running
    :internal:venice-common:test --tests "com.linkedin.venice.utils.TestHelixUtils"
    , the tests run sequentially, which takes some time. I was wondering if you knew of any way to run multiple tests in parallel. I remember that
    ./gradlew check --continue
    would spawn x threads when running the full suite, and thought perhaps I would be able to do the same thing here, or similar. Thanks! EDIT: it doesn't actually take that long, but thought it might be helpful to know in the future.
    k
    f
    • 3
    • 5