https://www.getdaft.io logo
Join Slack
Powered by
# local-readers
  • s

    Sammy Sidhu

    05/01/2024, 6:48 PM
    @Cory Grinstead @Clark Zinzow Adding y'all here to talk about the local readers that Cory is working on. For context, @Clark Zinzow built out the streaming CSV and JSON readers.
    🙌 1
  • c

    Clark Zinzow

    05/01/2024, 6:52 PM
    @Cory Grinstead Happy to answer any questions you have! And I'm really looking forward to having more performant local readers. 😄
  • c

    Cory Grinstead

    05/03/2024, 5:16 PM
    Hey so i finally have some time to get pretty heads down on this today. Considering I wrote most of the json reader in polars, I was planning on trying to port over as much of it as possible to not reinvent the wheel. However, I do think that given it'd be a "substantial portion" of code, it requires including the original MIT license on the ported code.
    s
    • 2
    • 3
  • c

    Cory Grinstead

    05/04/2024, 4:56 PM
    PR that swaps out json_deserializer for simd_json https://github.com/Eventual-Inc/Daft/pull/2228. I'm seeing ~10-20% faster times on this branch!
    🚀 1
    👀 1
    c
    • 2
    • 1
  • s

    Sammy Sidhu

    05/04/2024, 6:08 PM
    That’s incredible @Cory Grinstead!
  • s

    Sammy Sidhu

    05/04/2024, 6:09 PM
    @Clark Zinzow would prob be the best to review
  • c

    Cory Grinstead

    05/04/2024, 6:49 PM
    looks like there is one issue with the simd_json impl, it doesn't guarantee the ordering, so
    test_load_json
    is failing because the columns are out of order.
    👍 1
  • c

    Clark Zinzow

    05/04/2024, 7:13 PM
    @Cory Grinstead Yes, unfortunately preserving the order of keys in
    Object
    values isn't supported in `simd_json`: https://github.com/simd-lite/simd-json/issues/270
  • c

    Clark Zinzow

    05/04/2024, 7:27 PM
    I also left an initial review of the PR - outside of the preservation of column ordering, the only other question I had was around preservation of exponent format for ints/floats when decoding into a string column. Otherwise, the PR looks great!
  • c

    Clark Zinzow

    05/04/2024, 11:12 PM
    @Cory Grinstead I left another review! I can merge the PR as soon as you validate my understanding for the unsafe code safety conditions.
  • c

    Cory Grinstead

    05/09/2024, 5:50 PM
    I have a working POC of the local json reader! https://github.com/Eventual-Inc/Daft/pull/2264 It still needs quite a bit of work, but thought I'd at least share the initial draft!
    🙌 1
    s
    c
    • 3
    • 19
  • c

    Cory Grinstead

    05/10/2024, 6:58 PM
    After i finish up ☝️ I have a few additional ideas I want to try out. • There’s a few copies that need to be done by using the
    StreamDeserializer
    , I think if we created our own equivalent, we’d be able to avoid a few of those copies. • We can also make things a bit faster if we use the
    simd_json
    Tape
    directly instead of using the intermediate
    Value
    struct. It'll allow us to better utilize projection pushdowns.
  • s

    Sammy Sidhu

    05/10/2024, 7:08 PM
    I see, so skip decoding the columns that we don’t need?
  • c

    Cory Grinstead

    05/10/2024, 9:17 PM
    yeah exactly, Given that we have a known schema when decoding, we can skip irrelevant key/value pairs.
  • c

    Cory Grinstead

    05/10/2024, 10:09 PM
    It looks like the current streaming implementation applies the projection after parsing
  • s

    Sammy Sidhu

    05/13/2024, 5:57 PM
    Ah we should fix that and skip decoding those then!
  • s

    Sammy Sidhu

    05/13/2024, 5:58 PM
    Can you do a follow up and fix that in the streaming reader as well?
  • s

    Sammy Sidhu

    05/13/2024, 5:59 PM
    Also, is the local reader pr ready to review?
  • c

    Cory Grinstead

    05/13/2024, 6:01 PM
    Also, is the local reader pr ready to review?
    yes it's ready for review.
  • c

    Clark Zinzow

    05/13/2024, 6:07 PM
    @Cory Grinstead I don't think that's the case. IIRC, we push the column selection projection into parsing but include columns that are required for a filter predicate, and that line that you linked is a post-filter selection to ensure we drop any filter predicate columns not included in the projection.
  • c

    Clark Zinzow

    05/13/2024, 6:08 PM
    I could be misremembering, but I believe that was my intended implementation at least!
  • c

    Cory Grinstead

    05/13/2024, 6:34 PM
    I don't think that's the case. IIRC, we push the column selection projection into parsing but include columns that are required for a filter predicate, and that line that you linked is a post-filter selection to ensure we drop any filter predicate columns not included in the projection.
    are there any tests for this scenario? I want to make sure that the new local reader handles this case as well.
  • c

    Cory Grinstead

    05/16/2024, 2:53 AM
    I think i got most of the feedback on the JSON PR addressed. There are a few outstanding comments that I'll get to on Friday. (I'll be offline all day tomorrow).
  • s

    Sammy Sidhu

    05/16/2024, 7:50 PM
    Thanks Cory!