Distributed Data Community #daft-github

GitHub

07/21/2025, 12:07 PM

#4811 feat: support glob multiple path Pull request opened by stayrascal ## Changes Made Fix the doc of

daft.from_glob_path

, and support passing multiple paths, even through we can passed a regex pattern, but sometimes it's hard to use a single regex pattern to cover all case, so it's better to support more glob paths, but need to ensure the result doesn't contains duplicate file path if the glob paths have intersection. ## Related Issues ## Checklist The documents need to be refresh, the

from_glob_path

didn't return

type

column. • [ x] Documented in API Docs (if applicable) • [ x] Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • [ x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/21/2025, 7:09 PM

#4812 fix: read/write embeddings for Parquet and Lance Pull request opened by malcolmgreaves Fixes reading embedding-typed columns from DataFrames saved to both Parquet and Lance data formats. Primary fix is to in `daft-core`'s

array/ops/cast.rs

. Specifically the `FixedSizeListArray`'s

cast

implementation. There is now a match arm for

DataType::Embedding

. Upon encountering this, Daft will now properly cast the data into an

EmbeddingArray

struct (from

datatypes::logical

). Before, this case was unrecognized and thus led an

unimplemented!

error. Added a new Python based test to cover the fix

tests/io/test_roundtrip_embeddings.py

. The new

test_roundtrip_embedding

covers a roundtrip serialization to and from both Parquet and Lance. Also added new tests in `tests/series/test_cast.py`:

test_series_cast_fixed_shape_list_to_embedding

explicitly casts fixed size list arrays into and embeddings and

test_series_cast_embedding_to_fixed_shape_list

does the reverse. Supporting these tests is a new

random_numerical_embedding

helper function in

tests/utils.py

. Fixes: #4732 Eventual-Inc/Daft

GitHub

07/21/2025, 9:42 PM

#4813 feat: Include UDF Names in Progress Bar Pull request opened by srilman ## Changes Made As suggested, add UDF names to progress bars. Also tweaked some other names in the process to be more descriptive. Screen.Recording.2025-07-21.at.2.41.29.PM.mov ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/21/2025, 10:17 PM

#2443 Use visitor pattern for `expr_has_agg` Issue created by kevinzwang The function

expr_has_agg

in src/daft-dsl/src/expr.rs currently traverses the expression by matching on all the expression types. We could use a tree visitor pattern instead to simplify it. Eventual-Inc/Daft

GitHub

07/21/2025, 10:31 PM

#4814 chore: new code path for scalar UDF Pull request opened by kevinzwang ## Changes Made Was trying to get kwargs to work with scalar UDFs, decided to bite the bullet and build out the entire code path for the new UDFs so that it no longer relies on the legacy UDF code. Turned out to be quite simple actually, the complexity of the legacy UDF code is largely from the class UDF stuff we did. Additional changes: • add ability to specify kwargs to UDF • remove

@daft.func.batch

. It was just an alias to

@daft.udf

anyway, I think we should actually think about it a bit more since I don't think I want the batch UDF to behave exactly like the current UDF either. • Rename Python UDF to legacy Python UDF ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/21/2025, 11:01 PM

#2980 SQL `explode` does not properly broadcast against other columns Issue created by jaychia ### Describe the bug [image](https://private-user-images.githubusercontent.com/17691182/372313415-8196ea5c-1150-45e9-8516-613874635e46.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMxMzkyMDAsIm5iZiI6MTc1MzEzODkwMCwicGF0aCI6Ii8xNzY5MTE4Mi8zNzIzMTM0MTUtODE5NmVhNWMtMTE1MC00NWU5LTg1MTYtNjEzODc0NjM1ZTQ2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIxVDIzMDE0MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTkxYTM4OWEyZjlkNzVlMTJjYjMzMTJiMjBiOTA3YmQ3NzZjY2RlYWJjYWY1MTRlN2Q0ZDAyODYyYmE4OTI1MTcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.b3NqeJU-ppCyqzsWb33l178vTZjxQBeh-2kq7izydVw) ### To Reproduce

Copy code

import daft

df = daft.from_pydict({"list": [[1,2,3], [4,5], [6]], "x": [1, 2, 3]})
daft.sql("SELECT x, explode(list) FROM df").collect()

### Expected behavior No response ### Component(s) SQL ### Additional context No response Eventual-Inc/Daft

GitHub

07/21/2025, 11:02 PM

#3016 Map type make is_nullable false for key Issue created by andrewgazelka Daft/src/daft-schema/src/dtype.rs Line 256 in</Eventual-Inc/Daft/commit/dfccfe3bd16912789eb1f8be90cc55e9f0b687e1|dfccfe3> | arrow2:datatypesField:new("key", key.to_arrow()?, true), | | ------------------------------------------------------------ | currently tests fail; need to look into it Eventual-Inc/Daft

GitHub

07/21/2025, 11:04 PM

#3060 write_parquet underestimates rowgroup sizes when targeting `parquet_target_row_group_size` Issue created by jaychia ### Describe the bug The

parquet_target_row_group_size

execution config variable is supposed to write data with a default of 128MB for the rowgroup size. However, we notice that parquet data was being written with much larger rowgroups (~300+MB) Upon deeper inspection, it seems that we assume by default a 3x compression ratio of the data (

inflation_factor

) during our estimations:https://github.com/Eventual-Inc/Daft/blob/main/daft/table/table_io.py#L509-L510 This causes a 3x underestimation of the on-disk rowgroup size when the data is actually materialized ### To Reproduce No response ### Expected behavior No response ### Component(s) Parquet ### Additional context No response Eventual-Inc/Daft

GitHub

07/21/2025, 11:08 PM

#3293 `is_in` does not work with none/null Issue created by universalmind303 ### Describe the bug using

is_in

does not respect `None`s ### To Reproduce df = daft.from_pydict( { "nums": [1, 2, 3, None, 4, 5], } ) df.filter(col("nums").is_in([1, None])).collect()

Copy code

╭───────╮
│ nums  │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
╰───────╯
(Showing first 1 of 1 rows)

### Expected behavior expect the output to include

[1, None]

not just

[1]

### Component(s) SQL, Python Runner ### Additional context No response Eventual-Inc/Daft

GitHub

07/21/2025, 11:12 PM

#4815 feat: struct Expression.unnest() Pull request opened by kevinzwang ## Changes Made Add the

Expression.unnest

method. Since we already have the mechanism for unnesting with

Expression.struct.get("*")

, I'm just reusing that for now. We will need to change it once we introduce selectors anyway. ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/21/2025, 11:38 PM

#2370 errors when calling write_deltalake Issue created by rkunnamp Describe the bug Getting the following error when calling write_deltalake File /opt/conda/lib/python3.11/site-packages/daft/table/table_io.py:691, in write_deltalake..file_visitor(written_file) 689 def file_visitor(written_file: Any) -> None: 690 path, partition_values = get_partitions_from_path(written_file.path) --> 691 stats = get_file_stats_from_metadata(written_file.metadata) 693 # PyArrow added support for written_file.size in 9.0.0 694 if ARROW_VERSION >= (9, 0, 0): File /opt/conda/lib/python3.11/site-packages/daft/table/table_io.py:687, in write_deltalake..get_file_stats_from_metadata(metadata) 686 def get_file_stats_from_metadata(metadata): --> 687 deltalake.writer.get_file_stats_from_metadata(metadata, -1) TypeError: get_file_stats_from_metadata() missing 1 required positional argument: 'columns_to_collect_stats' To Reproduce Steps to reproduce the behavior: 1. Go to https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page and download January 2024 Yellow tax trip record data in parquet format (At the time of writing this bug the file was https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet) 2. Now execute the following code

Copy code

import daft
dt = daft.read_parquet("yellow_tripdata_2024-01.parquet")
dt.write_deltalake("t5")

The error mentioned above is obtained. On inspecting t5 folder, found that metdata files are not written. Eventual-Inc/Daft

GitHub

07/21/2025, 11:45 PM

#1726 [BUG] Bug with tqdm progress bar display when running in Ray client mode with remote cluster Issue created by jaychia Describe the bug When running on a remote cluster via Ray client, progress bars seem to be broken:

Copy code

(SchedulerActor pid=180, ip=10.0.66.234) Exception in thread 0d287252-6ae5-445b-9b96-5e412af6ab5d:
(SchedulerActor pid=180, ip=10.0.66.234) Traceback (most recent call last):
(SchedulerActor pid=180, ip=10.0.66.234)   File "/home/ray/anaconda3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
(SchedulerActor pid=180, ip=10.0.66.234)     self.run()
(SchedulerActor pid=180, ip=10.0.66.234)   File "/home/ray/anaconda3/lib/python3.10/threading.py", line 953, in run
(SchedulerActor pid=180, ip=10.0.66.234)     self._target(*self._args, **self._kwargs)
(SchedulerActor pid=180, ip=10.0.66.234)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 460, in _resume_span
(SchedulerActor pid=180, ip=10.0.66.234)     return method(self, *_args, **_kwargs)
(SchedulerActor pid=180, ip=10.0.66.234)   File "/Users/jaychia/code/venv-demo/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 460, in _resume_span
(SchedulerActor pid=180, ip=10.0.66.234)   File "/tmp/ray/session_2023-12-14_02-16-08_035207_7/runtime_resources/pip/aeb7de99a29ab6cec9bf133f8005a8e5d32df3a9/virtualenv/lib/python3.10/site-packages/daft/runners/ray_runner.py", line 578, in _run_plan
(SchedulerActor pid=180, ip=10.0.66.234)     pbar.mark_task_start(task)
(SchedulerActor pid=180, ip=10.0.66.234)   File "/tmp/ray/session_2023-12-14_02-16-08_035207_7/runtime_resources/pip/aeb7de99a29ab6cec9bf133f8005a8e5d32df3a9/virtualenv/lib/python3.10/site-packages/daft/runners/progress_bar.py", line 63, in mark_task_start
(SchedulerActor pid=180, ip=10.0.66.234)     pb.total += 1
(SchedulerActor pid=180, ip=10.0.66.234) AttributeError: 'tqdm' object has no attribute 'total'. Did you mean: '_total'?

To reproduce, run a remote Ray cluster and try this: import daft import ray RAY_ADDRESS = "ray://localhost:10001" ray.init(runtime_env={"pip": ["getdaft==0.2.7"]}, address=RAY_ADDRESS) daft.context.set_runner_ray(address=RAY_ADDRESS) import boto3 import daft session = boto3.session.Session() creds = session.get_credentials() daft.set_planning_config(default_io_config=daft.io.IOConfig( s3=daft.io.S3Config( key_id=creds.access_key, access_key=creds.secret_key, session_token=creds.token, ) )) df = daft.read_csv("s3://noaa-global-hourly-pds/2023/**") df.show() (Running

ray==2.4.0

and

getdaft==0.2.7

both on the client-side and on the cluster) Eventual-Inc/Daft

GitHub

07/21/2025, 11:52 PM

#4816 New UDFs Roadmap Issue created by kevinzwang # Background As Daft becomes used for more and more multimodal/AI workflows, we see some increasing patterns around the usage of UDFs, and we'd like to redesign our UDFs to better work with these patterns. This issue tracks the progress on this redesign. The major differences between the designs of the existing (legacy) and new UDFs: 1. The default mode is to operate on a single row of data at a time, instead of a batch. 2. Instead of being separate concepts from expressions, the new UDFs will instead mirror our built-in functions as much as possible. UDFs are Daft functions, just ones that are defined by the user. 3. The legacy UDFs could be stateful using the

concurrency

parameter. The new UDFs will not be stateful. Instead, to do stateful things, we will introduce the concept of "resources". Resources are not covered in this roadmap but will be a separate issue. In addition, the scope of this work also includes some new ways to use UDFs, such as multi-column outputs, generator UDFs, async UDFs, and ergonomics around conversions between Python and Daft types. # Examples Simple scalar UDF @daft.func def my_udf(input: int) -> str: # return dtype is inferred from type hint return f"{input}" # can be used to construct expressions expr: daft.Expression = my_udf(col("a")) df.select(expr) # can also still be used with scalar values as a regular Python function val = my_udf(1) assert val == "1" Generator UDF @daft.func(return_dtype=DataType.list(DataType.int())) # generators return list-type expressions def my_gen_udf(input: int): for i in range(input): yield i Async UDF # should work automatically @daft.func def my_async_udf(input: int) -> str: return await api_call(input) Batch UDF @daft.func.batch def my_batch_udf(input: Series) -> Series: return input * input Type checking @daft.func def my_func(x: int): ... df = daft.from_pydict({"x": ["a", "b", "c"]}) df.select(my_func(df["x"])) # should error because column "x" is not compatible with int # Roadmap These tasks do not necessarily need to be done in order • MVP for scalar UDF • #4723 • #4814 • Unify and document Daft <-> Python type conversions • Generator UDF • Async UDF • Async generator UDF • Type checking/casting on UDF args using argument type hints • MVP for batch UDF • Aggregate UDF (design TBD) Eventual-Inc/Daft

GitHub

07/22/2025, 12:50 AM

#4817 fix: Fix `File writer must be created before bytes_written can be called` bug in native parquet writer Pull request opened by colin-ho ## Changes Made @desmondcheongzx this bug came up during the pdf parsing job. I don't know what the actual cause is, wonder if you know? ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/22/2025, 7:49 AM

#4818 fix: update a new expression example in the contributing guide to work Pull request opened by r3stl355 ## Changes Made A new expression example in CONTRIBUTING.md needed couple of small changes to work. Current implementations of the expressions in the code base is more succinct than this example but the example is simpler to read so instead of replacing it with the newer code (similar to what's used in the rest of the code) I just fixed it to work. ## Related Issues No related issues ## Checklist There are no changes to docs • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/22/2025, 11:16 AM

#4819 chore: add vscode debug example with env Pull request opened by Jay-ju ## Changes Made ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/22/2025, 8:57 PM

#4821 Issue reading timestamp partitioned hive tables Issue created by cmditch ### Describe the bug We're exploring migrating from Spark to Daft. All of our data is hive managed parquet files written using Spark, and often partitioned by timestamps. Using

daft.read_parquet("/some/table/**", hive_partitioning=True)

results in a partition column with NaT (see attached photos). I have a fork opened which has fixed the bug for us, but I'm not sure the fix is super high quality. [Image](https://private-user-images.githubusercontent.com/15849320/469450348-7a2d7033-1b8c-47ea-bcb1-142669f8f8ea.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMyOTU5NTIsIm5iZiI6MTc1MzI5NTY1MiwicGF0aCI6Ii8xNTg0OTMyMC80Njk0NTAzNDgtN2EyZDcwMzMtMWI4Yy00N2VhLWJjYjEtMTQyNjY5ZjhmOGVhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIzVDE4MzQxMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWIyNzY2NTZlZDk1YjgzNWYyZGZkN2M2M2U3NzU4YjhhOTAyNWY0NmQwYmNjZDlhNjIwYTI5MDNhZTgzN2Y4MDQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0._24FzKCUF1uE5br78860TmP3Dk0IIOhJZI9UHht59ZY) [Image](https://private-user-images.githubusercontent.com/15849320/469450395-2a1a8550-f4ef-455d-bd25-0457bb741f6a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMyOTU5NTIsIm5iZiI6MTc1MzI5NTY1MiwicGF0aCI6Ii8xNTg0OTMyMC80Njk0NTAzOTUtMmExYTg1NTAtZjRlZi00NTVkLWJkMjUtMDQ1N2JiNzQxZjZhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIzVDE4MzQxMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc2MGMzMzBlODQyNjdkMjYxYWFkYTZiMDE1YWJmODc0ZmE2NGNlNTNhYjRmM2M3ZTFjYTBkYzYzNjlhN2UyMDAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.CRtjCvKh7ROJyvC_gVEBCzKDF1X8rqdGT0f-REOS3vU) ### To Reproduce import daft df = daft.read_parquet("/my/hive/table_with_timestamp_partitions/**", hive_partitioning=True) df.limit(3).to_pandas() Eventual-Inc/Daft

GitHub

07/22/2025, 9:07 PM

#4822 Issue reading iceberg table over NFS Issue created by cmditch ### Describe the bug I'm not sure the issue is with NFS, but it seems to stem from using

file:/

instead of

file://

when specifying the paths of the table's parquet files. I have a Daft fork where the issue is fixed, but I'm not sure the implementation is high quality (definitely some Cursor vibe coding going on). ### To Reproduce

Copy code

def get_iceberg_catalog() -> pyiceberg.catalog.Catalog:
    return pyiceberg.catalog.load_catalog(
        "hive_catalog",
        **{"type": "hive", "uri": "<thrift://cupk8:30083>"},
    )


prices = daft.read_iceberg(get_iceberg_catalog().load_table("ercot.prices_iceberg_test"))
prices.limit(10).to_pandas()

Here is the error thrown, despite the file existing:

Copy code

IcebergScanOperator(ercot.prices_iceberg_test) has Partitioning Keys: [PartitionField(date#Timestamp(Microseconds, Some("UTC")), src=date#Timestamp(Microseconds, Some("UTC")), tfm=Identity)] but no partition filter was specified. This will result in a full table scan.
Error when running pipeline node ScanTaskSource
DaftError::External Internal IO Error when opening: file:/Volumes/zge-office/spark-warehouse/ercot.db/prices_iceberg_test/data/date=2023-04-04T00%3A00Z/00000-4412-79b5e377-a240-4055-afe3-26df45d1ffbb-00006.parquet:
Details:
No such file or directory (os error 2)

[Image](https://private-user-images.githubusercontent.com/15849320/469452726-411e4c5b-1b51-4b5f-b6fb-db2c8f88c623.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMyOTU5NjEsIm5iZiI6MTc1MzI5NTY2MSwicGF0aCI6Ii8xNTg0OTMyMC80Njk0NTI3MjYtNDExZTRjNWItMWI1MS00YjVmLWI2ZmItZGIyYzhmODhjNjIzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIzVDE4MzQyMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWUxOWZkMjcxMWY1ZjA3ZjU3YjZhZGRkNmY3YThlMWQ3MjUxMmU4NDY3ZjYxZWVmN2I2YjNlMDUxZGYxYmJiZDYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.3Ri7ydmMEFrG1I6l2DTy-eBhLsK7MNK4_WqJZjP-vDE) Eventual-Inc/Daft

GitHub

07/22/2025, 10:51 PM

#4823 fix: add custom robots.txt Pull request opened by ccmao1130 ## Changes Made add custom robots.txt for google search indexing ## Related Issues n/a ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/23/2025, 12:15 AM

#4825 feat(turbopuffer): Add client and write kwargs Pull request opened by desmondcheongzx ## Changes Made Turbopuffer client construction and writes accept many possible arguments. Instead of enumerating them and always trying to stay in sync, lets add support for kwargs. Eventual-Inc/Daft

GitHub

07/23/2025, 3:10 AM

#4827 Added image extraction & embedding into document processing tutorial Pull request opened by malcolmgreaves ## Changes Made ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/23/2025, 3:21 AM

#4828 _from_arrow_type_with_ray_data_extensions needs to use right Ray ser/des APIs Issue created by srinathk10 ### Describe the bug With this PR, we are using for ser/des of Arrow Tensor extensions, we are switching over to cloudpickle with fallback to json. Issue is Daft in

_from_arrow_type_with_ray_data_extensions

assumes json for deserialization. Best to use the appropriate APIs in Ray. ### To Reproduce

Copy code

import sys
from unittest.mock import patch

import daft
import numpy as np
import pandas as pd
import pyarrow as pa
import pytest
from packaging.version import parse as parse_version

import ray

# Daft needs to use ray.cloudpickle and json for fallback for
# serialization/deserialization of Arrow tensor extension types.
pytestmark = pytest.mark.skip(
    reason="Daft needs to use ray.cloudpickle and json for fallback for "
    "serialization/deserialization of Arrow tensor extension types.",
)


@pytest.fixture(scope="module")
def ray_start(request):
    try:
        yield ray.init(num_cpus=16)
    finally:
        ray.shutdown()


def test_from_daft_raises_error_on_pyarrow_14(ray_start):
    # This test assumes that `from_daft` calls `get_pyarrow_version` to get the
    # PyArrow version. We can't mock `__version__` on the module directly because
    # `get_pyarrow_version` caches the version.
    with patch(
        "ray.data.read_api.get_pyarrow_version", return_value=parse_version("14.0.0")
    ):
        with pytest.raises(RuntimeError):
            ray.data.from_daft(daft.from_pydict({"col": [0]}))


@pytest.mark.skipif(
    parse_version(pa.__version__) >= parse_version("14.0.0"),
    reason="<https://github.com/ray-project/ray/issues/53278>",
)
def test_daft_round_trip(ray_start):
    data = {
        "int_col": list(range(128)),
        "str_col": [str(i) for i in range(128)],
        "nested_list_col": [[i] * 3 for i in range(128)],
        "tensor_col": [np.array([[i] * 3] * 3) for i in range(128)],
    }
    df = daft.from_pydict(data)
    ds = ray.data.from_daft(df)
    pd.testing.assert_frame_equal(ds.to_pandas(), df.to_pandas())

    df2 = ds.to_daft()
    df_pandas = df.to_pandas()
    df2_pandas = df2.to_pandas()

    for c in data.keys():
        # NOTE: tensor behavior on round-trip is different because Ray Data provides
        # Daft with more information about a column being a fixed-shape-tensor.
        #
        # Hence the Pandas representation of `df1` is "just" an object column, but
        # `df2` knows that this is actually a numpy fixed shaped tensor column
        if c == "tensor_col":
            np.testing.assert_equal(
                np.array(list(df_pandas[c])), df2_pandas[c].to_numpy()
            )
        else:
            pd.testing.assert_series_equal(df_pandas[c], df2_pandas[c])


if __name__ == "__main__":
    import sys

    sys.exit(pytest.main(["-v", __file__]))

### Expected behavior No response ### Component(s) Ray Runner ### Additional context No response Eventual-Inc/Daft

GitHub

07/23/2025, 4:22 AM

#4829 Support extremely flexible list datatype declarations Issue created by jaychia ### Is your feature request related to a problem? Writing datatypes in Daft today happens in a few places: 1. UDF return types (

return_dtype=...

) 2. .apply return types (

.apply(..., return_dtype=...)

) 3. Casting (

.cast(...)

) 4. Type hinting (

.read_parquet(schema=...)

) (I might have missed a few places) However, writing exact Daft types is quite verbose: from daft import DataType @daft.udf(return_dtype=DataType.float64()) def f(): ... To fix this, we allowed some mapping of Python types to Daft types: from daft import DataType @daft.udf(return_dtype=DataType.float64()) def f(): ... This also works for struct types, using Python dicts: from daft import DataType @daft.udf(return_dtype={"foo": DataType.float64()}) def f(): ... However, lists don't work! Thus building a highly complex type such as a list-of-list-of-struct-of-list is highly verbose: from daft import DataType @daft.udf(return_dtype=DataType.list(DataType.list(DataType.struct({"foo": DataType.list(float)})))) def f(): ... ### Describe the solution you'd like Here is a proposal, which looks quite Pythonic, but I'm not aware of any other library that does this which is maybe a bit concerning. @daft.udf(return_dtype=[[{"foo": [float, ...]}, ...], ...]) In Python, there is the

Ellipsis

singleton. The above is completely valid Python, and expresses the same types as the highly verbose datatype variant, while remaining relatively readable. Look at this, it's beautiful! @daft.udf(return_dtype={ "bboxes": [[float, 4], ...], # fixed size syntax "objects": [str, ...], "image": [[[DataType.int8(), ...], ...], 3], # mix and match for specific datatypes "metadata": dict[str, int], # map type }) ### Describe alternatives you've considered No response ### Additional Context No response ### Would you like to implement a fix? No Eventual-Inc/Daft

GitHub

07/23/2025, 10:05 AM

#4810 feat: Add `get_runner_type` method to support getting the currently used Runner type Pull request opened by plotor ## Changes Made We found that in some scenarios, users need to obtain the Runner type of Daft in UDF, but currently it can only be obtained through

daft.context.get_context()._runner.name

. The problem is that the UDF running on the ray worker gets

None

result when call

daft.context.get_context()._runner

, so I added a

daft.context.get_context().get_runner_type()

method in this PR. The execution mechanism of this method is as follows: 1. Prioritize

daft.context.get_context()._runner

to determine the Runner type; 2. If

daft.context.get_context()._runner

None

, call the

detect_ray_state

method to determine whether it's currently running on ray. If so, the current Runner type is considered to be

ray

, otherwise it is

native

. In addition, I found that when the

DAFT_RUNNER

env is inconsistent with

set_runner_xxx

, Daft will prioritize the

set_runner_xxx

settings, so I added some warn logs to remind users. ## Related Issues No issue ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/23/2025, 1:00 PM

#4830 chore: upgrade py-spy version Pull request opened by Jay-ju ## Changes Made Because the current version of py-spy in ray is

py-spy==0.4.0

, while the version in daft is

py-spy==0.3.14

. Introducing them simultaneously would cause a version conflict. Therefore, a small change was made here to change the version of py-spy to a range version. ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/23/2025, 2:46 PM

#4831 fix: expand ~ to home directory in deltatable read and write Pull request opened by r3stl355 ## Changes Made Added expanding of

to HOME directory in

read_deltalake

and

write_deltalake

. These use different backends (read uses

delta-rs

, write uses

daft

) so implemented the changes in the Python API ## Related Issues Closes #4786. ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/23/2025, 7:02 PM

#4833 ci: fail on timeout Pull request opened by colin-ho ## Changes Made ## Related Issues ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/23/2025, 7:08 PM

#4834 fix: read/write embeddings for Parquet and Lance Pull request opened by malcolmgreaves ## Changes Made Fixes reading embedding-typed columns from DataFrames saved to both Parquet and Lance data formats. Primary fix is to in `daft-core`'s

array/ops/cast.rs

. Specifically the `FixedSizeListArray`'s

cast

implementation. There is now a match arm for

DataType::Embedding

. Upon encountering this, Daft will now properly cast the data into an

EmbeddingArray

struct (from

datatypes::logical

). Before, this case was unrecognized and thus led an

unimplemented!

error. Secondary fix is related to this Arrow 8 bug: https://issues.apache.org/jira/browse/ARROW-12201 . When writing to Parquet in Arrow 8, the logical datatype information for u32 values are incorrectly written. They will instead be written as signed 64 bit integers! The fix is to always use Parquet version >= 2.0. In later versions of Arrow, the default switches from parquet version 1.0 to >= 2.0. Here, the fix is to always set

version=2.6

in the

ParquetWriter

object used in

daft.io.writer

. Added a new Python based test to cover the fix

tests/io/test_roundtrip_embeddings.py

. The new

test_roundtrip_embedding

covers a roundtrip serialization to and from both Parquet and Lance. Also added new tests in `tests/series/test_cast.py`:

test_series_cast_fixed_shape_list_to_embedding

explicitly casts fixed size list arrays into and embeddings and

test_series_cast_embedding_to_fixed_shape_list

does the reverse. Supporting these tests is a new

random_numerical_embedding

helper function in

tests/utils.py

. One drive-by change: updated to use non-deprecated

rand

functions & types in

src/daft-core/src/array/ops/cast.rs

. Fixes: #4732 ## Related Issues #4732 ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/23/2025, 8:55 PM

#4836 feat: Use daft-decoding for hive value deserialization Pull request opened by colin-ho ## Changes Made Use the existing daft-decoding code used in csv deserialization for hive partition values as well. Previously, we were using arrow2's utf8 cast to do this, which caused issues as seen in #4821. ## Related Issues Closes #4821 ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft

GitHub

07/23/2025, 11:42 PM

#4838 docs: categorize functions and make each function its own page Pull request opened by kevinzwang ## Changes Made Change Daft functions docs to categorize them, as I wanted to do with expressions before. Now, the functions are split into different files under

daft/functions/

, and if you add a function, it will automatically be added to the functions page in the category corresponding to the file it's in! Screenshots: [image](https://private-user-images.githubusercontent.com/20215378/470043426-c5f1d309-1c57-4571-97b6-d303328bfb5a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMzMjA1NzYsIm5iZiI6MTc1MzMyMDI3NiwicGF0aCI6Ii8yMDIxNTM3OC80NzAwNDM0MjYtYzVmMWQzMDktMWM1Ny00NTcxLTk3YjYtZDMwMzMyOGJmYjVhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzI0VDAxMjQzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ4Y2VmZDhiOGFkYjE2ZTlhZGNhZDY0Njg2NjMyMjA3MmYyNTUzYzU0ZDI1NzJmMDA5Y2YwZjk3MTllM2UwMDkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.mrkzW1Ej0Pwqoe2t-5ZFtge9Oj1GIN6qP_N3Z-uzACc) [image](https://private-user-images.githubusercontent.com/20215378/470034468-4edc95bc-f2a1-4d09-97cb-b15eff960a02.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMzMjA1NzYsIm5iZiI6MTc1MzMyMDI3NiwicGF0aCI6Ii8yMDIxNTM3OC80NzAwMzQ0NjgtNGVkYzk1YmMtZjJhMS00ZDA5LTk3Y2ItYjE1ZWZmOTYwYTAyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzI0VDAxMjQzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTYxYTk5NWJhYWEyYjAwY2ZmMTQzZTljYWNhZDYxZmUyZjI2MjExYjJkMmQyMDk0YTY3ZGEyNzFmYzRkNDZjYjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.XaYGY9z3xcY6DFWuHfe0DdJmD9TcvHVfi8cglEu5zic) ## Related Issues #4737 #4824 ## Checklist • Documented in API Docs (if applicable) • Documented in User Guide (if applicable) • If adding a new documentation page, doc is added to

docs/mkdocs.yml

navigation • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Eventual-Inc/Daft