https://www.getdaft.io logo
Join Slack
Powered by
# general
  • r

    Rakesh Jain

    09/12/2025, 10:15 PM
    Hello Daft team, for the Lakevision project, which is for visualizing Iceberg based Lakehouses, we use daft for SQL and Sample Data, and we are very happy with it. Thanks for the great work!
    ❤️ 4
  • k

    Kyle

    09/15/2025, 6:22 AM
    Are there any plans to have a function to generate the perplexity of a model on a given text? E.g. perplexity of qwen1.5b on a particular string?
    s
    • 2
    • 2
  • u

    吕威

    09/16/2025, 7:22 AM
    Hi guys, i'm confused with DataType.image() I am try to use yolo model to detect image, and crop object to next embedding theme.
    Copy code
    @udf(
        return_dtype=DataType.list(
            DataType.struct(
                {
                    "class": DataType.string(),
                    "score": DataType.float64(),
                    "cropped_img": DataType.image(),
                    "bbox": DataType.list(DataType.int64()),
                }
            )
        ),
        num_gpus=1,
        batch_size=16,
    )
    class YOLOWorldOnnxObjDetect:
        def __init__(
            self,
            model_path: str,
            device: str = "cuda:0",
            confidence: float = 0.25,
        ):
             # int model
             pass
    
        def __call__(self, images_2d_col: Series) -> List[List[dict]]:
            images: List[np.ndarray] = images_2d_col.to_pylist()
            results = self.yolo.predict(source=images, conf=self.confidence)
            for r in results:
                img_result = []
                orig_img = r.orig_img
                for box in r.boxes:
                    x1, y1, x2, y2 = box.xyxy[0].cpu().numpy().astype(int)
                    x1, y1 = max(0, x1), max(0, y1)
                    x2, y2 = min(orig_img.shape[1], x2), min(orig_img.shape[0], y2)
                    x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2)
                    cls = int(box.cls[0])
                    img_result.append(
                        {
                            "class": self.yolo.names[cls],
                            "score": float(box.conf[0]),
                            "cropped_img": {
                                "cropimg": cv2.cvtColor(
                                    orig_img[y1:y2, x1:x2], cv2.COLOR_BGR2RGB
                                ),
                            },
                            "bbox": [x1, y1, x2, y2],
                        }
                    )
                objs.append(img_result)
            return objs
    the cropped_img must return with dict, if direct return np.ndarray, will raise Could not convert array(..., dtype=uint8) with type numpy.ndarray: was expecting tuple of (key, value) pair error why?
    e
    r
    • 3
    • 9
  • c

    ChanChan Mao

    09/22/2025, 4:00 PM
    hey all! i wanted to share this use case that we came across in our community 😊 authored by @YK --- When John Shelburne of CatFIX Technology needed to process 59 million bond market records (28M bonds + 31M prices + trade ideas) for his neural network model, he had a few options: • Continue using pandas (taking 3 days for processing) • Move to Vertex AI or another cloud service (more costs) Essentially, he was choosing between painfully slow local processing or expensive cloud compute. That's when he discovered Daft. By switching from pandas to Daft, John achieved: • 7.3 minutes total runtime (vs 3 days previously) • 600x faster processing • 50% lower memory usage • All running locally on his iMac As John put it: "HOLY SMOKES WAS IT FAST! My neural network model is now trainable in real time." See his original post: https://www.linkedin.com/posts/shelburne_daft-eventual-katana-activity-7371644995176460288-a02g
    daft party 6
    🙌 6
  • c

    ChanChan Mao

    09/23/2025, 12:33 AM
    Hey everyone! Just wanted to share in this channel that we're bringing back Daft Contributor Sync series where we'll highlight work in the open source, cover latest releases and features, and shout out our contributors! This month's contributor sync will be This Thursday September 25 at 4pm PT. We'll be talking about major improvements that we've shipped in the last few months, like Model APIs, UDF improvements, integrations with Turbopuffer, Clickhouse, and Lance, and our new
    daft.File
    datatype. Following that, @Colin Ho will dive into his work on Flotilla, our distributed engine, and showcase some exciting benchmark results 👀 We'll leave plenty of time at the end for questions and discussions. Add to your calendar and we'll see you then! 👋
    daft party 4
  • g

    Garrett Weaver

    09/23/2025, 10:28 PM
    any recommendations on dealing with protobuf? technically I can convert to json, but would like to convert to structured type automatically for easier access.
    s
    • 2
    • 2
  • a

    Amir Shukayev

    09/24/2025, 11:38 PM
    any recommendations on running daft in a managed instance group on GCP?
    c
    • 2
    • 1
  • n

    Nathan Cai

    09/24/2025, 11:59 PM
    Hey guys, quick question, but isn't this supposed to say
    Copy code
    # Supply actual values for the s3
    Not
    Copy code
    # Supply actual values for the se
    in the docs? https://docs.daft.ai/en/stable/connectors/aws/#rely-on-environment
    Copy code
    from <http://daft.io|daft.io> import IOConfig, S3Config
    
    # Supply actual values for the se
    io_config = IOConfig(s3=S3Config(key_id="key_id", session_token="session_token", secret_key="secret_key"))
    
    # Globally set the default IOConfig for any subsequent I/O calls
    daft.set_planning_config(default_io_config=io_config)
    
    # Perform some I/O operation
    df = daft.read_parquet("<s3://my_bucket/my_path/**/*>")
    d
    • 2
    • 2
  • c

    ChanChan Mao

    09/25/2025, 11:01 PM
    Hey everyone, we're live for the contributor sync! Join us 🙂 https://us06web.zoom.us/j/89647699067?pwd=b0jsNnL9yT1L2wTsDoG6kh9e83kcp7.1&amp;jst=2
    ❤️ 3
  • g

    Garrett Weaver

    09/26/2025, 4:58 AM
    when running a UDF with native runner and
    use_process=True
    everything works fine locally (mac), but seeing
    /usr/bin/bash: line 1:    58 Bus error               (core dumped)
    when run on k8s (argo workflows). any thoughts?
    c
    • 2
    • 3
  • g

    Garrett Weaver

    09/26/2025, 5:11 PM
    I am using
    explode
    with average result being 1 row --> 12 rows and max 1 row --> 366 rows (~5m rows --> ~66m rows). Seeing decently high memory usage during the explode even with a repartition prior to explode. Is the only remedy more partitions and/or reduced number cpus to reduce parallelism?
    c
    k
    • 3
    • 10
  • n

    Nathan Cai

    09/29/2025, 1:00 PM
    Hi there, just a quick question, what value does Daft provide compared to me just using multi-threaded programming? In which use cases does Daft go above and beyond just multi-threading?
    c
    k
    • 3
    • 3
  • r

    Robert Howell

    09/30/2025, 4:39 PM
    👋 Looking to get started with Daft? We’ve curated some approachable Good First Issues which the team is happy to help you with! New Expressions These are fun because you'll get to introduce entirely new capabilities. • Add `Expression.var` • Add `Expression.pow` • Add `Expression.product` • Support `ddof` argument for `stddev` • Support simple case and searched cased with list of branches • UUID function <-- very easy and very 🆒 Enhancements These are interesting because you'll get to learn from the existing functionality and extend current capabilities. • Support `Series[start:end]` slicing • Support image hash functions for deduplication • Equality on structured types <-- this would be a great addition • Hash rows of dataframes • Custom CSV delimiters <-- also great addition!! • Custom CSV quote characters • Custom CSV date/time formatting Documentation We are ALWAYS interested in documentation PRs and improve docs.daft.ai — here's a mega-list which you could work with coding agents to curate something nice! • https://github.com/Eventual-Inc/Daft/issues/4125 Full List • https://github.com/Eventual-Inc/Daft/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22&amp;page=1
    ❤️ 6
    🙌 1
  • d

    Dan Reverri

    09/30/2025, 9:50 PM
    I’m trying to read a large csv file and want to iterate over arrow record batches but daft is reading the whole file before starting to yield batches. Are there any options to process the csv file in chunks?
    d
    c
    • 3
    • 19
  • c

    ChanChan Mao

    10/03/2025, 9:35 PM
    If you're new to Daft and our community, I highly recommend reading this overview of Daft blogpost written by @Euan Lim! Such technical concepts were explained very well and were easily digestible. Thanks for this writeup!
    🙌 6
    e
    • 2
    • 1
  • c

    Colin Ho

    10/15/2025, 4:14 PM
    Hey everyone! We just pushed some updates to the Daft docs on optimization and debugging — especially around memory and partitioning. • Managing memory: tips on how to tune Daft to lower memory usage and avoid OOMs. • Optimizing partitions / batches: details on how Daft parallelizes data across cores / machines, and how to control it. Do check them out if you face any difficulties like high memory usage or low cpu utilization within your workloads!
    daft party 4
    s
    • 2
    • 1
  • f

    fr1ll

    10/15/2025, 11:36 PM
    Hi there- an API question for you. I’m working on a pipeline to generate input data for Apple’s new
    embedding-atlas
    library. At the end, I need to run UMAP against the full set of embeddings. I currently resort to the clunky approach below. Am I missing a better way to add a column from a numpy array of same length as the dataframe?
    Copy code
    vecs = np.stack([r["embedding"] for r in df.select("embedding").to_pylist()], axis=0)
    
    # little helper wrapping umap
    xy, knn_indices, knn_distances = umap_with_neighbors(vecs).values()
    
    df = df.with_column("_row_index": (daft.sql_expr("row_number()") - 1))
    
    umap_cols = daft.from_pydict({
        "_row_index": list(range(xy.shape[0])),
        "umap_x": daft.Series.from_arrow(pa.array(xy[:,0])),
        "umap_y": daft.Series.from_arrow(pa.array(xy[:,1])),
    })
     
    df = df.join(umap_cols, on="_row_index")
    m
    • 2
    • 5
  • n

    Nathan Cai

    10/17/2025, 4:00 PM
    Hi there, I'm making a contribution for https://github.com/Eventual-Inc/Daft/issues/3786 I think I've done it, but I was wondering if someone can help me explain how testing works. I already see function signatures in
    tests/connect/test_io.py
    Copy code
    @pytest.mark.skip(reason="<https://github.com/Eventual-Inc/Daft/issues/3786>")
    def test_write_csv_with_delimiter(make_df, make_spark_df, spark_session, tmp_path):
        pass
    But the thing is, I don't know what these arguments mean, and I don't see that function being invoked anywhere else so I don't know where it's being provided those arguments. But the other thing is, its function signature is different from the ones above:
    Copy code
    def test_csv_basic_roundtrip(make_spark_df, assert_spark_equals, spark_session, tmp_path):
    And I think I might need the
    assert_spark_equals
    function in my scenario, I was wondering if someone is willing to guide me in the right direction.
    r
    m
    • 3
    • 16
  • b

    Ben Cornelis

    10/17/2025, 5:37 PM
    Hello! I have a question about flotilla runner memory usage. I'm running a daft query against a ray cluster running on k8s (azure). I'm attempting to read a large parquet table (~1.8T, 50k files) from azure blob store, performing a group by with a count aggregation, and then writing the resulting dataframe to another directory. The script and query looks like:
    Copy code
    df = daft.read_parquet(
        "<abfs://my_container@my_service_account.dfs.core.windows.net/in_table_path/>"
    )
    df = df.groupby("group_col").agg(col("count_col").count())
    df.write_parquet("<abfs://my_container@my_service_account.dfs.core.windows.net/out_table_path/>")
    (it's worth noting that I have daft pinned to 0.5.19 since I currently get an error using
    write_parquet
    to azure blob store on later versions. It is the same issue as here: https://github.com/Eventual-Inc/Daft/issues/5336) When I submit a job running this script to my ray cluster, it eventually fails with this error:
    Copy code
    ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
    Memory on the node (IP: 10.224.2.36, ID: 637c809d2d9dfed8dd5c109e7ff9b81e4c10a024322086029a8f6def) where the task (actor ID: 53ff539b6349c1bee7efc01802000000, name=flotilla-plan-runner:RemoteFlotillaRunner.__init__, pid=1631, memory used=24.82GB) was running was 30.46GB / 32.00GB (0.951874), which exceeds the memory usage threshold of 0.95.
    Is this unexpected that the flotilla runner on the head node is consuming so much memory?
    c
    • 2
    • 12
  • b

    Ben Cornelis

    10/17/2025, 9:34 PM
    I'm getting a connection reset error periodically reading parquet from ADLS:
    Copy code
    daft.exceptions.DaftCoreException: DaftError::External Cached error: Unable to open file <abfs://path/to/my.parquet>: Error { context: Full(Custom { kind: Io, error: reqwest::Error { kind: Decode, source: hyper::Error(Body, Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }) } }, "error converting `reqwest` request into a byte stream") }
    With the job I mentioned above with 50k files it's happening often enough that the job always fails. Has anyone ever seen this?
    k
    • 2
    • 35
  • b

    Ben Cornelis

    10/20/2025, 8:13 PM
    Is it possible to pass an ADLS uri containing a storage account name instead of passing env / config variables? I'm getting this error `Azure Storage Account not set and is required. Set either
    AzureConfig.storage_account
    or the
    AZURE_STORAGE_ACCOUNT
    environment variable.` but I also see code attempting to parse it from the uri: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-io/src/azure_blob.rs#L116. I'm not sure why that's not working. I'm using an abfs uri of the form
    <abfs://container@account_name.dfs.core.windows.net/path/>
    k
    • 2
    • 6
  • n

    Nathan Cai

    10/21/2025, 9:03 PM
    By the way, does anyone actually run
    make test
    on their own devices? It burns a whole in my RAM when I run it on my laptop
    d
    m
    s
    • 4
    • 3
  • v

    VOID 001

    10/23/2025, 7:40 AM
    Hi, is there any native datasource support for daft reading from hive?
    k
    • 2
    • 1
  • b

    Ben Cornelis

    10/24/2025, 7:40 PM
    I'm running a group by query on the 1.8TB dataset I mentioned above. The ray job is failing because some RaySwordfishActors are running out of memory, either during the repartition or write parquet step (it looks like they're interleaved, so it's hard to tell). The log I'm seeing is:
    (raylet) node_manager.cc:3193: 1 Workers (tasks / actors) killed due to memory pressure (OOM)
    I've tried various k8s pod sizing - currently I have 80 worker pods with 8cpu / 32gb. Does anyone know why so much memory would be used, or how to tune / think about queries like this?
    ✅ 1
    d
    • 2
    • 14
  • g

    Garrett Weaver

    10/24/2025, 8:17 PM
    Is
    parquet_inflation_factor
    still respected in flotilla? I have a parquet file that is ~2GB, but expands to ~16GB in memory. When trying to run a select with window function, it is trying to load all data into memory and struggling (OOMing). What is the best course of action here?
    ✅ 1
    m
    c
    • 3
    • 8
  • g

    Garrett Weaver

    10/24/2025, 9:35 PM
    another question, in latest version, I am seeing an arrow schema mismatch on write with some new metadata in the schema, works fine with pyarrow writer (
    native_parquet_writer=False
    )
    ✅ 1
    d
    • 2
    • 7
  • t

    takumi hayase

    10/28/2025, 12:39 PM
    I am looking for new position as full stack developer now.
  • b

    Ben Cornelis

    10/28/2025, 3:40 PM
    I'm seeing lower than expected cpu utilization 30-40% (I originally posted 5-20% but misread it) on my k8s pods running a query - any tips on debugging or optimizing this?
    j
    • 2
    • 4
  • v

    VOID 001

    10/29/2025, 1:33 PM
    I've found that daft now moving to a new state-ful UDF implement: daft.cls. For developers who is using daft.udf already. Does the community suggest we switch to daft.cls now? (production related workloads) And what are the current functionality daft.cls does not support? Also what's the cost of migrating from daft.udf to daft.cls? IMO user only need to change the decorator name PLUS the function call argument. Is there any other thing I need to concern? Thanks!
    c
    e
    • 3
    • 8
  • n

    Nathan Cai

    10/30/2025, 10:30 PM
    Hi there, I was planning on working on this issue of adding a custom delimiter to the
    write_csv
    option https://github.com/Eventual-Inc/Daft/issues/3786 but I realized that since I'm using one field of PyArrow's csv WriteOptions: https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions Might as well implement support all the fields, as it shouldn't be too difficult. This is the original PyArrow class:
    Copy code
    @dataclass(kw_only=True)
    class WriteOptions(lib._Weakrefable):
        include_header: bool = field(default=True, kw_only=False)
        batch_size: int = 1024
        delimiter: str = ","
        quoting_style: Literal["needed", "all_valid", "none"] = "needed"
    
        def validate(self) -> None: ...
    I want to implement the corresponding class in Rust like this:
    Copy code
    #[derive(Debug, Clone)]
    pub struct CSVWriteOptions {
        pub include_header: bool,
        pub batch_size: usize,
        pub delimiter: char,
        pub quoting_style: QuotingStyle,
    }
    
    #[derive(Debug, Clone)]
    pub enum CSVQuotingStyle {
        Needed,
        AllValid,
        None,
    }
    However there's a couple of obstactles, the first one it seems that Rust doesn't support default fields, so I guess I create a special constructor for default I guess? But the main problem is, the
    quoting_style
    field, it accepts only specific kinds of strings, normally this is fine, but the problem is that when that Rust struct gets passed to Daft's Python CSV Writer class which uses those options with PyArrow, it accepts a string for
    quoting_style
    , not an enum, what is the cleanest approach to this? As far as I'm aware you can't have specific values for strings in Rust, you have to use an enum. Should I just create two structs, the one above, and a new one that's the same but accepts
    quoting_style
    as a String and the original
    CSVWriteOptions
    has a function that converts it a new struct that turns the
    quote_style
    into a string? How does Daft deal with Python's string literal and Rust enum interoperability? I feel like this is something you have encountered before and I want to know your standard of doing this.
    d
    • 2
    • 4