https://www.getdaft.io logo
Join Slack
Powered by
# daft-github
  • g

    GitHub

    07/21/2025, 12:07 PM
    #4811 feat: support glob multiple path Pull request opened by stayrascal ## Changes Made Fix the doc of
    daft.from_glob_path
    , and support passing multiple paths, even through we can passed a regex pattern, but sometimes it's hard to use a single regex pattern to cover all case, so it's better to support more glob paths, but need to ensure the result doesn't contains duplicate file path if the glob paths have intersection. ## Related Issues ## Checklist The documents need to be refresh, the
    from_glob_path
    didn't return
    type
    column. • [ x] Documented in API Docs (if applicable) • [ x] Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • [ x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
  • g

    GitHub

    07/21/2025, 7:09 PM
    #4812 fix: read/write embeddings for Parquet and Lance Pull request opened by malcolmgreaves Fixes reading embedding-typed columns from DataFrames saved to both Parquet and Lance data formats. Primary fix is to in `daft-core`'s
    array/ops/cast.rs
    . Specifically the `FixedSizeListArray`'s
    cast
    implementation. There is now a match arm for
    DataType::Embedding
    . Upon encountering this, Daft will now properly cast the data into an
    EmbeddingArray
    struct (from
    datatypes::logical
    ). Before, this case was unrecognized and thus led an
    unimplemented!
    error. Added a new Python based test to cover the fix
    tests/io/test_roundtrip_embeddings.py
    . The new
    test_roundtrip_embedding
    covers a roundtrip serialization to and from both Parquet and Lance. Also added new tests in `tests/series/test_cast.py`:
    test_series_cast_fixed_shape_list_to_embedding
    explicitly casts fixed size list arrays into and embeddings and
    test_series_cast_embedding_to_fixed_shape_list
    does the reverse. Supporting these tests is a new
    random_numerical_embedding
    helper function in
    tests/utils.py
    . Fixes: #4732 Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/21/2025, 9:42 PM
    #4813 feat: Include UDF Names in Progress Bar Pull request opened by srilman ## Changes Made As suggested, add UDF names to progress bars. Also tweaked some other names in the process to be more descriptive. Screen.Recording.2025-07-21.at.2.41.29.PM.mov ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
  • g

    GitHub

    07/21/2025, 10:17 PM
    #2443 Use visitor pattern for `expr_has_agg` Issue created by kevinzwang The function
    expr_has_agg
    in src/daft-dsl/src/expr.rs currently traverses the expression by matching on all the expression types. We could use a tree visitor pattern instead to simplify it. Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/21/2025, 10:31 PM
    #4814 chore: new code path for scalar UDF Pull request opened by kevinzwang ## Changes Made Was trying to get kwargs to work with scalar UDFs, decided to bite the bullet and build out the entire code path for the new UDFs so that it no longer relies on the legacy UDF code. Turned out to be quite simple actually, the complexity of the legacy UDF code is largely from the class UDF stuff we did. Additional changes: • add ability to specify kwargs to UDF • remove
    @daft.func.batch
    . It was just an alias to
    @daft.udf
    anyway, I think we should actually think about it a bit more since I don't think I want the batch UDF to behave exactly like the current UDF either. • Rename Python UDF to legacy Python UDF ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
  • g

    GitHub

    07/21/2025, 11:01 PM
    #2980 SQL `explode` does not properly broadcast against other columns Issue created by jaychia ### Describe the bug [image](https://private-user-images.githubusercontent.com/17691182/372313415-8196ea5c-1150-45e9-8516-613874635e46.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMxMzkyMDAsIm5iZiI6MTc1MzEzODkwMCwicGF0aCI6Ii8xNzY5MTE4Mi8zNzIzMTM0MTUtODE5NmVhNWMtMTE1MC00NWU5LTg1MTYtNjEzODc0NjM1ZTQ2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIxVDIzMDE0MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTkxYTM4OWEyZjlkNzVlMTJjYjMzMTJiMjBiOTA3YmQ3NzZjY2RlYWJjYWY1MTRlN2Q0ZDAyODYyYmE4OTI1MTcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.b3NqeJU-ppCyqzsWb33l178vTZjxQBeh-2kq7izydVw) ### To Reproduce
    Copy code
    import daft
    
    df = daft.from_pydict({"list": [[1,2,3], [4,5], [6]], "x": [1, 2, 3]})
    daft.sql("SELECT x, explode(list) FROM df").collect()
    ### Expected behavior No response ### Component(s) SQL ### Additional context No response Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/21/2025, 11:02 PM
    #3016 Map type make is_nullable false for key Issue created by andrewgazelka Daft/src/daft-schema/src/dtype.rs Line 256 in</Eventual-Inc/Daft/commit/dfccfe3bd16912789eb1f8be90cc55e9f0b687e1|dfccfe3> | arrow2:datatypesField:new("key", key.to_arrow()?, true), | | ------------------------------------------------------------ | currently tests fail; need to look into it Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/21/2025, 11:04 PM
    #3060 write_parquet underestimates rowgroup sizes when targeting `parquet_target_row_group_size` Issue created by jaychia ### Describe the bug The
    parquet_target_row_group_size
    execution config variable is supposed to write data with a default of 128MB for the rowgroup size. However, we notice that parquet data was being written with much larger rowgroups (~300+MB) Upon deeper inspection, it seems that we assume by default a 3x compression ratio of the data (
    inflation_factor
    ) during our estimations:https://github.com/Eventual-Inc/Daft/blob/main/daft/table/table_io.py#L509-L510 This causes a 3x underestimation of the on-disk rowgroup size when the data is actually materialized ### To Reproduce No response ### Expected behavior No response ### Component(s) Parquet ### Additional context No response Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/21/2025, 11:08 PM
    #3293 `is_in` does not work with none/null Issue created by universalmind303 ### Describe the bug using
    is_in
    does not respect `None`s ### To Reproduce df = daft.from_pydict( { "nums": [1, 2, 3, None, 4, 5], } ) df.filter(col("nums").is_in([1, None])).collect()
    Copy code
    ╭───────╮
    │ nums  │
    │ ---   │
    │ Int64 │
    ╞═══════╡
    │ 1     │
    ╰───────╯
    (Showing first 1 of 1 rows)
    ### Expected behavior expect the output to include
    [1, None]
    not just
    [1]
    ### Component(s) SQL, Python Runner ### Additional context No response Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/21/2025, 11:12 PM
    #4815 feat: struct Expression.unnest() Pull request opened by kevinzwang ## Changes Made Add the
    Expression.unnest
    method. Since we already have the mechanism for unnesting with
    Expression.struct.get("*")
    , I'm just reusing that for now. We will need to change it once we introduce selectors anyway. ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/21/2025, 11:38 PM
    #2370 errors when calling write_deltalake Issue created by rkunnamp Describe the bug Getting the following error when calling write_deltalake File /opt/conda/lib/python3.11/site-packages/daft/table/table_io.py:691, in write_deltalake..file_visitor(written_file) 689 def file_visitor(written_file: Any) -> None: 690 path, partition_values = get_partitions_from_path(written_file.path) --> 691 stats = get_file_stats_from_metadata(written_file.metadata) 693 # PyArrow added support for written_file.size in 9.0.0 694 if ARROW_VERSION >= (9, 0, 0): File /opt/conda/lib/python3.11/site-packages/daft/table/table_io.py:687, in write_deltalake..get_file_stats_from_metadata(metadata) 686 def get_file_stats_from_metadata(metadata): --> 687 deltalake.writer.get_file_stats_from_metadata(metadata, -1) TypeError: get_file_stats_from_metadata() missing 1 required positional argument: 'columns_to_collect_stats' To Reproduce Steps to reproduce the behavior: 1. Go to https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page and download January 2024 Yellow tax trip record data in parquet format (At the time of writing this bug the file was https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet) 2. Now execute the following code
    Copy code
    import daft
    dt = daft.read_parquet("yellow_tripdata_2024-01.parquet")
    dt.write_deltalake("t5")
    The error mentioned above is obtained. On inspecting t5 folder, found that metdata files are not written. Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/21/2025, 11:45 PM
    #1726 [BUG] Bug with tqdm progress bar display when running in Ray client mode with remote cluster Issue created by jaychia Describe the bug When running on a remote cluster via Ray client, progress bars seem to be broken:
    Copy code
    (SchedulerActor pid=180, ip=10.0.66.234) Exception in thread 0d287252-6ae5-445b-9b96-5e412af6ab5d:
    (SchedulerActor pid=180, ip=10.0.66.234) Traceback (most recent call last):
    (SchedulerActor pid=180, ip=10.0.66.234)   File "/home/ray/anaconda3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    (SchedulerActor pid=180, ip=10.0.66.234)     self.run()
    (SchedulerActor pid=180, ip=10.0.66.234)   File "/home/ray/anaconda3/lib/python3.10/threading.py", line 953, in run
    (SchedulerActor pid=180, ip=10.0.66.234)     self._target(*self._args, **self._kwargs)
    (SchedulerActor pid=180, ip=10.0.66.234)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 460, in _resume_span
    (SchedulerActor pid=180, ip=10.0.66.234)     return method(self, *_args, **_kwargs)
    (SchedulerActor pid=180, ip=10.0.66.234)   File "/Users/jaychia/code/venv-demo/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 460, in _resume_span
    (SchedulerActor pid=180, ip=10.0.66.234)   File "/tmp/ray/session_2023-12-14_02-16-08_035207_7/runtime_resources/pip/aeb7de99a29ab6cec9bf133f8005a8e5d32df3a9/virtualenv/lib/python3.10/site-packages/daft/runners/ray_runner.py", line 578, in _run_plan
    (SchedulerActor pid=180, ip=10.0.66.234)     pbar.mark_task_start(task)
    (SchedulerActor pid=180, ip=10.0.66.234)   File "/tmp/ray/session_2023-12-14_02-16-08_035207_7/runtime_resources/pip/aeb7de99a29ab6cec9bf133f8005a8e5d32df3a9/virtualenv/lib/python3.10/site-packages/daft/runners/progress_bar.py", line 63, in mark_task_start
    (SchedulerActor pid=180, ip=10.0.66.234)     pb.total += 1
    (SchedulerActor pid=180, ip=10.0.66.234) AttributeError: 'tqdm' object has no attribute 'total'. Did you mean: '_total'?
    To reproduce, run a remote Ray cluster and try this: import daft import ray RAY_ADDRESS = "ray://localhost:10001" ray.init(runtime_env={"pip": ["getdaft==0.2.7"]}, address=RAY_ADDRESS) daft.context.set_runner_ray(address=RAY_ADDRESS) import boto3 import daft session = boto3.session.Session() creds = session.get_credentials() daft.set_planning_config(default_io_config=daft.io.IOConfig( s3=daft.io.S3Config( key_id=creds.access_key, access_key=creds.secret_key, session_token=creds.token, ) )) df = daft.read_csv("s3://noaa-global-hourly-pds/2023/**") df.show() (Running
    ray==2.4.0
    and
    getdaft==0.2.7
    both on the client-side and on the cluster) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/21/2025, 11:52 PM
    #4816 New UDFs Roadmap Issue created by kevinzwang # Background As Daft becomes used for more and more multimodal/AI workflows, we see some increasing patterns around the usage of UDFs, and we'd like to redesign our UDFs to better work with these patterns. This issue tracks the progress on this redesign. The major differences between the designs of the existing (legacy) and new UDFs: 1. The default mode is to operate on a single row of data at a time, instead of a batch. 2. Instead of being separate concepts from expressions, the new UDFs will instead mirror our built-in functions as much as possible. UDFs are Daft functions, just ones that are defined by the user. 3. The legacy UDFs could be stateful using the
    concurrency
    parameter. The new UDFs will not be stateful. Instead, to do stateful things, we will introduce the concept of "resources". Resources are not covered in this roadmap but will be a separate issue. In addition, the scope of this work also includes some new ways to use UDFs, such as multi-column outputs, generator UDFs, async UDFs, and ergonomics around conversions between Python and Daft types. # Examples Simple scalar UDF @daft.func def my_udf(input: int) -> str: # return dtype is inferred from type hint return f"{input}" # can be used to construct expressions expr: daft.Expression = my_udf(col("a")) df.select(expr) # can also still be used with scalar values as a regular Python function val = my_udf(1) assert val == "1" Generator UDF @daft.func(return_dtype=DataType.list(DataType.int())) # generators return list-type expressions def my_gen_udf(input: int): for i in range(input): yield i Async UDF # should work automatically @daft.func def my_async_udf(input: int) -> str: return await api_call(input) Batch UDF @daft.func.batch def my_batch_udf(input: Series) -> Series: return input * input Type checking @daft.func def my_func(x: int): ... df = daft.from_pydict({"x": ["a", "b", "c"]}) df.select(my_func(df["x"])) # should error because column "x" is not compatible with int # Roadmap These tasks do not necessarily need to be done in order • MVP for scalar UDF • #4723 • #4814 • Unify and document Daft <-> Python type conversions • Generator UDF • Async UDF • Async generator UDF • Type checking/casting on UDF args using argument type hints • MVP for batch UDF • Aggregate UDF (design TBD) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/22/2025, 12:50 AM
    #4817 fix: Fix `File writer must be created before bytes_written can be called` bug in native parquet writer Pull request opened by colin-ho ## Changes Made @desmondcheongzx this bug came up during the pdf parsing job. I don't know what the actual cause is, wonder if you know? ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/22/2025, 7:49 AM
    #4818 fix: update a new expression example in the contributing guide to work Pull request opened by r3stl355 ## Changes Made A new expression example in CONTRIBUTING.md needed couple of small changes to work. Current implementations of the expressions in the code base is more succinct than this example but the example is simpler to read so instead of replacing it with the newer code (similar to what's used in the rest of the code) I just fixed it to work. ## Related Issues No related issues ## Checklist There are no changes to docs • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/22/2025, 11:16 AM
    #4819 chore: add vscode debug example with env Pull request opened by Jay-ju ## Changes Made ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/22/2025, 8:57 PM
    #4821 Issue reading timestamp partitioned hive tables Issue created by cmditch ### Describe the bug We're exploring migrating from Spark to Daft. All of our data is hive managed parquet files written using Spark, and often partitioned by timestamps. Using
    daft.read_parquet("/some/table/**", hive_partitioning=True)
    results in a partition column with NaT (see attached photos). I have a fork opened which has fixed the bug for us, but I'm not sure the fix is super high quality. [Image](https://private-user-images.githubusercontent.com/15849320/469450348-7a2d7033-1b8c-47ea-bcb1-142669f8f8ea.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMyOTU5NTIsIm5iZiI6MTc1MzI5NTY1MiwicGF0aCI6Ii8xNTg0OTMyMC80Njk0NTAzNDgtN2EyZDcwMzMtMWI4Yy00N2VhLWJjYjEtMTQyNjY5ZjhmOGVhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIzVDE4MzQxMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWIyNzY2NTZlZDk1YjgzNWYyZGZkN2M2M2U3NzU4YjhhOTAyNWY0NmQwYmNjZDlhNjIwYTI5MDNhZTgzN2Y4MDQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0._24FzKCUF1uE5br78860TmP3Dk0IIOhJZI9UHht59ZY) [Image](https://private-user-images.githubusercontent.com/15849320/469450395-2a1a8550-f4ef-455d-bd25-0457bb741f6a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMyOTU5NTIsIm5iZiI6MTc1MzI5NTY1MiwicGF0aCI6Ii8xNTg0OTMyMC80Njk0NTAzOTUtMmExYTg1NTAtZjRlZi00NTVkLWJkMjUtMDQ1N2JiNzQxZjZhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIzVDE4MzQxMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc2MGMzMzBlODQyNjdkMjYxYWFkYTZiMDE1YWJmODc0ZmE2NGNlNTNhYjRmM2M3ZTFjYTBkYzYzNjlhN2UyMDAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.CRtjCvKh7ROJyvC_gVEBCzKDF1X8rqdGT0f-REOS3vU) ### To Reproduce import daft df = daft.read_parquet("/my/hive/table_with_timestamp_partitions/**", hive_partitioning=True) df.limit(3).to_pandas() Eventual-Inc/Daft
  • g

    GitHub

    07/22/2025, 9:07 PM
    #4822 Issue reading iceberg table over NFS Issue created by cmditch ### Describe the bug I'm not sure the issue is with NFS, but it seems to stem from using
    file:/
    instead of
    file://
    when specifying the paths of the table's parquet files. I have a Daft fork where the issue is fixed, but I'm not sure the implementation is high quality (definitely some Cursor vibe coding going on). ### To Reproduce
    Copy code
    def get_iceberg_catalog() -> pyiceberg.catalog.Catalog:
        return pyiceberg.catalog.load_catalog(
            "hive_catalog",
            **{"type": "hive", "uri": "<thrift://cupk8:30083>"},
        )
    
    
    prices = daft.read_iceberg(get_iceberg_catalog().load_table("ercot.prices_iceberg_test"))
    prices.limit(10).to_pandas()
    Here is the error thrown, despite the file existing:
    Copy code
    IcebergScanOperator(ercot.prices_iceberg_test) has Partitioning Keys: [PartitionField(date#Timestamp(Microseconds, Some("UTC")), src=date#Timestamp(Microseconds, Some("UTC")), tfm=Identity)] but no partition filter was specified. This will result in a full table scan.
    Error when running pipeline node ScanTaskSource
    DaftError::External Internal IO Error when opening: file:/Volumes/zge-office/spark-warehouse/ercot.db/prices_iceberg_test/data/date=2023-04-04T00%3A00Z/00000-4412-79b5e377-a240-4055-afe3-26df45d1ffbb-00006.parquet:
    Details:
    No such file or directory (os error 2)
    [Image](https://private-user-images.githubusercontent.com/15849320/469452726-411e4c5b-1b51-4b5f-b6fb-db2c8f88c623.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMyOTU5NjEsIm5iZiI6MTc1MzI5NTY2MSwicGF0aCI6Ii8xNTg0OTMyMC80Njk0NTI3MjYtNDExZTRjNWItMWI1MS00YjVmLWI2ZmItZGIyYzhmODhjNjIzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIzVDE4MzQyMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWUxOWZkMjcxMWY1ZjA3ZjU3YjZhZGRkNmY3YThlMWQ3MjUxMmU4NDY3ZjYxZWVmN2I2YjNlMDUxZGYxYmJiZDYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.3Ri7ydmMEFrG1I6l2DTy-eBhLsK7MNK4_WqJZjP-vDE) Eventual-Inc/Daft
  • g

    GitHub

    07/22/2025, 10:51 PM
    #4823 fix: add custom robots.txt Pull request opened by ccmao1130 ## Changes Made add custom robots.txt for google search indexing ## Related Issues n/a ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/23/2025, 12:15 AM
    #4825 feat(turbopuffer): Add client and write kwargs Pull request opened by desmondcheongzx ## Changes Made Turbopuffer client construction and writes accept many possible arguments. Instead of enumerating them and always trying to stay in sync, lets add support for kwargs. Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/23/2025, 3:10 AM
    #4827 Added image extraction &amp; embedding into document processing tutorial Pull request opened by malcolmgreaves ## Changes Made ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
  • g

    GitHub

    07/23/2025, 3:21 AM
    #4828 _from_arrow_type_with_ray_data_extensions needs to use right Ray ser/des APIs Issue created by srinathk10 ### Describe the bug With this PR, we are using for ser/des of Arrow Tensor extensions, we are switching over to cloudpickle with fallback to json. Issue is Daft in
    _from_arrow_type_with_ray_data_extensions
    assumes json for deserialization. Best to use the appropriate APIs in Ray. ### To Reproduce
    Copy code
    import sys
    from unittest.mock import patch
    
    import daft
    import numpy as np
    import pandas as pd
    import pyarrow as pa
    import pytest
    from packaging.version import parse as parse_version
    
    import ray
    
    # Daft needs to use ray.cloudpickle and json for fallback for
    # serialization/deserialization of Arrow tensor extension types.
    pytestmark = pytest.mark.skip(
        reason="Daft needs to use ray.cloudpickle and json for fallback for "
        "serialization/deserialization of Arrow tensor extension types.",
    )
    
    
    @pytest.fixture(scope="module")
    def ray_start(request):
        try:
            yield ray.init(num_cpus=16)
        finally:
            ray.shutdown()
    
    
    def test_from_daft_raises_error_on_pyarrow_14(ray_start):
        # This test assumes that `from_daft` calls `get_pyarrow_version` to get the
        # PyArrow version. We can't mock `__version__` on the module directly because
        # `get_pyarrow_version` caches the version.
        with patch(
            "ray.data.read_api.get_pyarrow_version", return_value=parse_version("14.0.0")
        ):
            with pytest.raises(RuntimeError):
                ray.data.from_daft(daft.from_pydict({"col": [0]}))
    
    
    @pytest.mark.skipif(
        parse_version(pa.__version__) >= parse_version("14.0.0"),
        reason="<https://github.com/ray-project/ray/issues/53278>",
    )
    def test_daft_round_trip(ray_start):
        data = {
            "int_col": list(range(128)),
            "str_col": [str(i) for i in range(128)],
            "nested_list_col": [[i] * 3 for i in range(128)],
            "tensor_col": [np.array([[i] * 3] * 3) for i in range(128)],
        }
        df = daft.from_pydict(data)
        ds = ray.data.from_daft(df)
        pd.testing.assert_frame_equal(ds.to_pandas(), df.to_pandas())
    
        df2 = ds.to_daft()
        df_pandas = df.to_pandas()
        df2_pandas = df2.to_pandas()
    
        for c in data.keys():
            # NOTE: tensor behavior on round-trip is different because Ray Data provides
            # Daft with more information about a column being a fixed-shape-tensor.
            #
            # Hence the Pandas representation of `df1` is "just" an object column, but
            # `df2` knows that this is actually a numpy fixed shaped tensor column
            if c == "tensor_col":
                np.testing.assert_equal(
                    np.array(list(df_pandas[c])), df2_pandas[c].to_numpy()
                )
            else:
                pd.testing.assert_series_equal(df_pandas[c], df2_pandas[c])
    
    
    if __name__ == "__main__":
        import sys
    
        sys.exit(pytest.main(["-v", __file__]))
    ### Expected behavior No response ### Component(s) Ray Runner ### Additional context No response Eventual-Inc/Daft
  • g

    GitHub

    07/23/2025, 4:22 AM
    #4829 Support extremely flexible list datatype declarations Issue created by jaychia ### Is your feature request related to a problem? Writing datatypes in Daft today happens in a few places: 1. UDF return types (
    return_dtype=...
    ) 2. .apply return types (
    .apply(..., return_dtype=...)
    ) 3. Casting (
    .cast(...)
    ) 4. Type hinting (
    .read_parquet(schema=...)
    ) (I might have missed a few places) However, writing exact Daft types is quite verbose: from daft import DataType @daft.udf(return_dtype=DataType.float64()) def f(): ... To fix this, we allowed some mapping of Python types to Daft types: from daft import DataType @daft.udf(return_dtype=DataType.float64()) def f(): ... This also works for struct types, using Python dicts: from daft import DataType @daft.udf(return_dtype={"foo": DataType.float64()}) def f(): ... However, lists don't work! Thus building a highly complex type such as a list-of-list-of-struct-of-list is highly verbose: from daft import DataType @daft.udf(return_dtype=DataType.list(DataType.list(DataType.struct({"foo": DataType.list(float)})))) def f(): ... ### Describe the solution you'd like Here is a proposal, which looks quite Pythonic, but I'm not aware of any other library that does this which is maybe a bit concerning. @daft.udf(return_dtype=[[{"foo": [float, ...]}, ...], ...]) In Python, there is the
    Ellipsis
    singleton. The above is completely valid Python, and expresses the same types as the highly verbose datatype variant, while remaining relatively readable. Look at this, it's beautiful! @daft.udf(return_dtype={ "bboxes": [[float, 4], ...], # fixed size syntax "objects": [str, ...], "image": [[[DataType.int8(), ...], ...], 3], # mix and match for specific datatypes "metadata": dict[str, int], # map type }) ### Describe alternatives you've considered No response ### Additional Context No response ### Would you like to implement a fix? No Eventual-Inc/Daft
  • g

    GitHub

    07/23/2025, 10:05 AM
    #4810 feat: Add `get_runner_type` method to support getting the currently used Runner type Pull request opened by plotor ## Changes Made We found that in some scenarios, users need to obtain the Runner type of Daft in UDF, but currently it can only be obtained through
    daft.context.get_context()._runner.name
    . The problem is that the UDF running on the ray worker gets
    None
    result when call
    daft.context.get_context()._runner
    , so I added a
    daft.context.get_context().get_runner_type()
    method in this PR. The execution mechanism of this method is as follows: 1. Prioritize
    daft.context.get_context()._runner
    to determine the Runner type; 2. If
    daft.context.get_context()._runner
    is
    None
    , call the
    detect_ray_state
    method to determine whether it's currently running on ray. If so, the current Runner type is considered to be
    ray
    , otherwise it is
    native
    . In addition, I found that when the
    DAFT_RUNNER
    env is inconsistent with
    set_runner_xxx
    , Daft will prioritize the
    set_runner_xxx
    settings, so I added some warn logs to remind users. ## Related Issues No issue ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/23/2025, 1:00 PM
    #4830 chore: upgrade py-spy version Pull request opened by Jay-ju ## Changes Made Because the current version of py-spy in ray is
    py-spy==0.4.0
    , while the version in daft is
    py-spy==0.3.14
    . Introducing them simultaneously would cause a version conflict. Therefore, a small change was made here to change the version of py-spy to a range version. ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/23/2025, 2:46 PM
    #4831 fix: expand ~ to home directory in deltatable read and write Pull request opened by r3stl355 ## Changes Made Added expanding of
    ~
    to HOME directory in
    read_deltalake
    and
    write_deltalake
    . These use different backends (read uses
    delta-rs
    , write uses
    daft
    ) so implemented the changes in the Python API ## Related Issues Closes #4786. ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/23/2025, 7:02 PM
    #4833 ci: fail on timeout Pull request opened by colin-ho ## Changes Made ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/23/2025, 7:08 PM
    #4834 fix: read/write embeddings for Parquet and Lance Pull request opened by malcolmgreaves ## Changes Made Fixes reading embedding-typed columns from DataFrames saved to both Parquet and Lance data formats. Primary fix is to in `daft-core`'s
    array/ops/cast.rs
    . Specifically the `FixedSizeListArray`'s
    cast
    implementation. There is now a match arm for
    DataType::Embedding
    . Upon encountering this, Daft will now properly cast the data into an
    EmbeddingArray
    struct (from
    datatypes::logical
    ). Before, this case was unrecognized and thus led an
    unimplemented!
    error. Secondary fix is related to this Arrow 8 bug: https://issues.apache.org/jira/browse/ARROW-12201 . When writing to Parquet in Arrow 8, the logical datatype information for u32 values are incorrectly written. They will instead be written as signed 64 bit integers! The fix is to always use Parquet version >= 2.0. In later versions of Arrow, the default switches from parquet version 1.0 to >= 2.0. Here, the fix is to always set
    version=2.6
    in the
    ParquetWriter
    object used in
    daft.io.writer
    . Added a new Python based test to cover the fix
    tests/io/test_roundtrip_embeddings.py
    . The new
    test_roundtrip_embedding
    covers a roundtrip serialization to and from both Parquet and Lance. Also added new tests in `tests/series/test_cast.py`:
    test_series_cast_fixed_shape_list_to_embedding
    explicitly casts fixed size list arrays into and embeddings and
    test_series_cast_embedding_to_fixed_shape_list
    does the reverse. Supporting these tests is a new
    random_numerical_embedding
    helper function in
    tests/utils.py
    . One drive-by change: updated to use non-deprecated
    rand
    functions & types in
    src/daft-core/src/array/ops/cast.rs
    . Fixes: #4732 ## Related Issues #4732 ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
    • 1
    • 1
  • g

    GitHub

    07/23/2025, 8:55 PM
    #4836 feat: Use daft-decoding for hive value deserialization Pull request opened by colin-ho ## Changes Made Use the existing daft-decoding code used in csv deserialization for hive partition values as well. Previously, we were using arrow2's utf8 cast to do this, which caused issues as seen in #4821. ## Related Issues Closes #4821 ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
  • g

    GitHub

    07/23/2025, 11:42 PM
    #4838 docs: categorize functions and make each function its own page Pull request opened by kevinzwang ## Changes Made Change Daft functions docs to categorize them, as I wanted to do with expressions before. Now, the functions are split into different files under
    daft/functions/
    , and if you add a function, it will automatically be added to the functions page in the category corresponding to the file it's in! Screenshots: [image](https://private-user-images.githubusercontent.com/20215378/470043426-c5f1d309-1c57-4571-97b6-d303328bfb5a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMzMjA1NzYsIm5iZiI6MTc1MzMyMDI3NiwicGF0aCI6Ii8yMDIxNTM3OC80NzAwNDM0MjYtYzVmMWQzMDktMWM1Ny00NTcxLTk3YjYtZDMwMzMyOGJmYjVhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzI0VDAxMjQzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ4Y2VmZDhiOGFkYjE2ZTlhZGNhZDY0Njg2NjMyMjA3MmYyNTUzYzU0ZDI1NzJmMDA5Y2YwZjk3MTllM2UwMDkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.mrkzW1Ej0Pwqoe2t-5ZFtge9Oj1GIN6qP_N3Z-uzACc) [image](https://private-user-images.githubusercontent.com/20215378/470034468-4edc95bc-f2a1-4d09-97cb-b15eff960a02.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMzMjA1NzYsIm5iZiI6MTc1MzMyMDI3NiwicGF0aCI6Ii8yMDIxNTM3OC80NzAwMzQ0NjgtNGVkYzk1YmMtZjJhMS00ZDA5LTk3Y2ItYjE1ZWZmOTYwYTAyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzI0VDAxMjQzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTYxYTk5NWJhYWEyYjAwY2ZmMTQzZTljYWNhZDYxZmUyZjI2MjExYjJkMmQyMDk0YTY3ZGEyNzFmYzRkNDZjYjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.XaYGY9z3xcY6DFWuHfe0DdJmD9TcvHVfi8cglEu5zic) ## Related Issues #4737 #4824 ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to
    docs/mkdocs.yml
    navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft
    • 1
    • 1