GitHub
07/21/2025, 12:07 PMdaft.from_glob_path
, and support passing multiple paths, even through we can passed a regex pattern, but sometimes it's hard to use a single regex pattern to cover all case, so it's better to support more glob paths, but need to ensure the result doesn't contains duplicate file path if the glob paths have intersection.
## Related Issues
## Checklist
The documents need to be refresh, the from_glob_path
didn't return type
column.
• [ x] Documented in API Docs (if applicable)
• [ x] Documented in User Guide (if applicable)
• If adding a new documentation page, doc is added to docs/mkdocs.yml
navigation
• [ x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/21/2025, 7:09 PMarray/ops/cast.rs
. Specifically the `FixedSizeListArray`'s cast
implementation.
There is now a match arm for DataType::Embedding
. Upon encountering this, Daft will now properly cast the
data into an EmbeddingArray
struct (from datatypes::logical
). Before, this case was unrecognized and thus
led an unimplemented!
error.
Added a new Python based test to cover the fix tests/io/test_roundtrip_embeddings.py
. The new
test_roundtrip_embedding
covers a roundtrip serialization to and from both Parquet and Lance.
Also added new tests in `tests/series/test_cast.py`: test_series_cast_fixed_shape_list_to_embedding
explicitly casts fixed size list arrays into and embeddings and test_series_cast_embedding_to_fixed_shape_list
does the reverse.
Supporting these tests is a new random_numerical_embedding
helper function in tests/utils.py
.
Fixes: #4732
Eventual-Inc/DaftGitHub
07/21/2025, 9:42 PMdocs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/21/2025, 10:17 PMexpr_has_agg
in src/daft-dsl/src/expr.rs currently traverses the expression by matching on all the expression types. We could use a tree visitor pattern instead to simplify it.
Eventual-Inc/DaftGitHub
07/21/2025, 10:31 PM@daft.func.batch
. It was just an alias to @daft.udf
anyway, I think we should actually think about it a bit more since I don't think I want the batch UDF to behave exactly like the current UDF either.
• Rename Python UDF to legacy Python UDF
## Related Issues
## Checklist
• Documented in API Docs (if applicable)
• Documented in User Guide (if applicable)
• If adding a new documentation page, doc is added to docs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/21/2025, 11:01 PMimport daft
df = daft.from_pydict({"list": [[1,2,3], [4,5], [6]], "x": [1, 2, 3]})
daft.sql("SELECT x, explode(list) FROM df").collect()
### Expected behavior
No response
### Component(s)
SQL
### Additional context
No response
Eventual-Inc/DaftGitHub
07/21/2025, 11:02 PMGitHub
07/21/2025, 11:04 PMparquet_target_row_group_size
execution config variable is supposed to write data with a default of 128MB for the rowgroup size.
However, we notice that parquet data was being written with much larger rowgroups (~300+MB)
Upon deeper inspection, it seems that we assume by default a 3x compression ratio of the data (inflation_factor
) during our estimations:https://github.com/Eventual-Inc/Daft/blob/main/daft/table/table_io.py#L509-L510
This causes a 3x underestimation of the on-disk rowgroup size when the data is actually materialized
### To Reproduce
No response
### Expected behavior
No response
### Component(s)
Parquet
### Additional context
No response
Eventual-Inc/DaftGitHub
07/21/2025, 11:08 PMis_in
does not respect `None`s
### To Reproduce
df = daft.from_pydict(
{
"nums": [1, 2, 3, None, 4, 5],
}
)
df.filter(col("nums").is_in([1, None])).collect()
╭───────╮
│ nums │
│ --- │
│ Int64 │
╞═══════╡
│ 1 │
╰───────╯
(Showing first 1 of 1 rows)
### Expected behavior
expect the output to include [1, None]
not just [1]
### Component(s)
SQL, Python Runner
### Additional context
No response
Eventual-Inc/DaftGitHub
07/21/2025, 11:12 PMExpression.unnest
method. Since we already have the mechanism for unnesting with Expression.struct.get("*")
, I'm just reusing that for now. We will need to change it once we introduce selectors anyway.
## Related Issues
## Checklist
• Documented in API Docs (if applicable)
• Documented in User Guide (if applicable)
• If adding a new documentation page, doc is added to docs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/21/2025, 11:38 PMimport daft
dt = daft.read_parquet("yellow_tripdata_2024-01.parquet")
dt.write_deltalake("t5")
The error mentioned above is obtained. On inspecting t5 folder, found that metdata files are not written.
Eventual-Inc/DaftGitHub
07/21/2025, 11:45 PM(SchedulerActor pid=180, ip=10.0.66.234) Exception in thread 0d287252-6ae5-445b-9b96-5e412af6ab5d:
(SchedulerActor pid=180, ip=10.0.66.234) Traceback (most recent call last):
(SchedulerActor pid=180, ip=10.0.66.234) File "/home/ray/anaconda3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
(SchedulerActor pid=180, ip=10.0.66.234) self.run()
(SchedulerActor pid=180, ip=10.0.66.234) File "/home/ray/anaconda3/lib/python3.10/threading.py", line 953, in run
(SchedulerActor pid=180, ip=10.0.66.234) self._target(*self._args, **self._kwargs)
(SchedulerActor pid=180, ip=10.0.66.234) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 460, in _resume_span
(SchedulerActor pid=180, ip=10.0.66.234) return method(self, *_args, **_kwargs)
(SchedulerActor pid=180, ip=10.0.66.234) File "/Users/jaychia/code/venv-demo/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 460, in _resume_span
(SchedulerActor pid=180, ip=10.0.66.234) File "/tmp/ray/session_2023-12-14_02-16-08_035207_7/runtime_resources/pip/aeb7de99a29ab6cec9bf133f8005a8e5d32df3a9/virtualenv/lib/python3.10/site-packages/daft/runners/ray_runner.py", line 578, in _run_plan
(SchedulerActor pid=180, ip=10.0.66.234) pbar.mark_task_start(task)
(SchedulerActor pid=180, ip=10.0.66.234) File "/tmp/ray/session_2023-12-14_02-16-08_035207_7/runtime_resources/pip/aeb7de99a29ab6cec9bf133f8005a8e5d32df3a9/virtualenv/lib/python3.10/site-packages/daft/runners/progress_bar.py", line 63, in mark_task_start
(SchedulerActor pid=180, ip=10.0.66.234) pb.total += 1
(SchedulerActor pid=180, ip=10.0.66.234) AttributeError: 'tqdm' object has no attribute 'total'. Did you mean: '_total'?
To reproduce, run a remote Ray cluster and try this:
import daft
import ray
RAY_ADDRESS = "ray://localhost:10001"
ray.init(runtime_env={"pip": ["getdaft==0.2.7"]}, address=RAY_ADDRESS)
daft.context.set_runner_ray(address=RAY_ADDRESS)
import boto3
import daft
session = boto3.session.Session()
creds = session.get_credentials()
daft.set_planning_config(default_io_config=daft.io.IOConfig(
s3=daft.io.S3Config(
key_id=creds.access_key,
access_key=creds.secret_key,
session_token=creds.token,
)
))
df = daft.read_csv("s3://noaa-global-hourly-pds/2023/**")
df.show()
(Running ray==2.4.0
and getdaft==0.2.7
both on the client-side and on the cluster)
Eventual-Inc/DaftGitHub
07/21/2025, 11:52 PMconcurrency
parameter. The new UDFs will not be stateful. Instead, to do stateful things, we will introduce the concept of "resources". Resources are not covered in this roadmap but will be a separate issue.
In addition, the scope of this work also includes some new ways to use UDFs, such as multi-column outputs, generator UDFs, async UDFs, and ergonomics around conversions between Python and Daft types.
# Examples
Simple scalar UDF
@daft.func
def my_udf(input: int) -> str: # return dtype is inferred from type hint
return f"{input}"
# can be used to construct expressions
expr: daft.Expression = my_udf(col("a"))
df.select(expr)
# can also still be used with scalar values as a regular Python function
val = my_udf(1)
assert val == "1"
Generator UDF
@daft.func(return_dtype=DataType.list(DataType.int())) # generators return list-type expressions
def my_gen_udf(input: int):
for i in range(input):
yield i
Async UDF
# should work automatically
@daft.func
def my_async_udf(input: int) -> str:
return await api_call(input)
Batch UDF
@daft.func.batch
def my_batch_udf(input: Series) -> Series:
return input * input
Type checking
@daft.func
def my_func(x: int):
...
df = daft.from_pydict({"x": ["a", "b", "c"]})
df.select(my_func(df["x"])) # should error because column "x" is not compatible with int
# Roadmap
These tasks do not necessarily need to be done in order
• MVP for scalar UDF
• #4723
• #4814
• Unify and document Daft <-> Python type conversions
• Generator UDF
• Async UDF
• Async generator UDF
• Type checking/casting on UDF args using argument type hints
• MVP for batch UDF
• Aggregate UDF (design TBD)
Eventual-Inc/DaftGitHub
07/22/2025, 12:50 AMdocs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/22/2025, 7:49 AMdocs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/22/2025, 11:16 AMdocs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/22/2025, 8:57 PMdaft.read_parquet("/some/table/**", hive_partitioning=True)
results in a partition column with NaT (see attached photos).
I have a fork opened which has fixed the bug for us, but I'm not sure the fix is super high quality.
[Image](https://private-user-images.githubusercontent.com/15849320/469450348-7a2d7033-1b8c-47ea-bcb1-142669f8f8ea.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMyOTU5NTIsIm5iZiI6MTc1MzI5NTY1MiwicGF0aCI6Ii8xNTg0OTMyMC80Njk0NTAzNDgtN2EyZDcwMzMtMWI4Yy00N2VhLWJjYjEtMTQyNjY5ZjhmOGVhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIzVDE4MzQxMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWIyNzY2NTZlZDk1YjgzNWYyZGZkN2M2M2U3NzU4YjhhOTAyNWY0NmQwYmNjZDlhNjIwYTI5MDNhZTgzN2Y4MDQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0._24FzKCUF1uE5br78860TmP3Dk0IIOhJZI9UHht59ZY) [Image](https://private-user-images.githubusercontent.com/15849320/469450395-2a1a8550-f4ef-455d-bd25-0457bb741f6a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMyOTU5NTIsIm5iZiI6MTc1MzI5NTY1MiwicGF0aCI6Ii8xNTg0OTMyMC80Njk0NTAzOTUtMmExYTg1NTAtZjRlZi00NTVkLWJkMjUtMDQ1N2JiNzQxZjZhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIzVDE4MzQxMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc2MGMzMzBlODQyNjdkMjYxYWFkYTZiMDE1YWJmODc0ZmE2NGNlNTNhYjRmM2M3ZTFjYTBkYzYzNjlhN2UyMDAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.CRtjCvKh7ROJyvC_gVEBCzKDF1X8rqdGT0f-REOS3vU)
### To Reproduce
import daft
df = daft.read_parquet("/my/hive/table_with_timestamp_partitions/**", hive_partitioning=True)
df.limit(3).to_pandas()
Eventual-Inc/DaftGitHub
07/22/2025, 9:07 PMfile:/
instead of file://
when specifying the paths of the table's parquet files.
I have a Daft fork where the issue is fixed, but I'm not sure the implementation is high quality (definitely some Cursor vibe coding going on).
### To Reproduce
def get_iceberg_catalog() -> pyiceberg.catalog.Catalog:
return pyiceberg.catalog.load_catalog(
"hive_catalog",
**{"type": "hive", "uri": "<thrift://cupk8:30083>"},
)
prices = daft.read_iceberg(get_iceberg_catalog().load_table("ercot.prices_iceberg_test"))
prices.limit(10).to_pandas()
Here is the error thrown, despite the file existing:
IcebergScanOperator(ercot.prices_iceberg_test) has Partitioning Keys: [PartitionField(date#Timestamp(Microseconds, Some("UTC")), src=date#Timestamp(Microseconds, Some("UTC")), tfm=Identity)] but no partition filter was specified. This will result in a full table scan.
Error when running pipeline node ScanTaskSource
DaftError::External Internal IO Error when opening: file:/Volumes/zge-office/spark-warehouse/ercot.db/prices_iceberg_test/data/date=2023-04-04T00%3A00Z/00000-4412-79b5e377-a240-4055-afe3-26df45d1ffbb-00006.parquet:
Details:
No such file or directory (os error 2)
[Image](https://private-user-images.githubusercontent.com/15849320/469452726-411e4c5b-1b51-4b5f-b6fb-db2c8f88c623.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMyOTU5NjEsIm5iZiI6MTc1MzI5NTY2MSwicGF0aCI6Ii8xNTg0OTMyMC80Njk0NTI3MjYtNDExZTRjNWItMWI1MS00YjVmLWI2ZmItZGIyYzhmODhjNjIzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIzVDE4MzQyMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWUxOWZkMjcxMWY1ZjA3ZjU3YjZhZGRkNmY3YThlMWQ3MjUxMmU4NDY3ZjYxZWVmN2I2YjNlMDUxZGYxYmJiZDYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.3Ri7ydmMEFrG1I6l2DTy-eBhLsK7MNK4_WqJZjP-vDE)
Eventual-Inc/DaftGitHub
07/22/2025, 10:51 PMdocs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/23/2025, 12:15 AMGitHub
07/23/2025, 3:10 AMdocs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/23/2025, 3:21 AM_from_arrow_type_with_ray_data_extensions
assumes json for deserialization. Best to use the appropriate APIs in Ray.
### To Reproduce
import sys
from unittest.mock import patch
import daft
import numpy as np
import pandas as pd
import pyarrow as pa
import pytest
from packaging.version import parse as parse_version
import ray
# Daft needs to use ray.cloudpickle and json for fallback for
# serialization/deserialization of Arrow tensor extension types.
pytestmark = pytest.mark.skip(
reason="Daft needs to use ray.cloudpickle and json for fallback for "
"serialization/deserialization of Arrow tensor extension types.",
)
@pytest.fixture(scope="module")
def ray_start(request):
try:
yield ray.init(num_cpus=16)
finally:
ray.shutdown()
def test_from_daft_raises_error_on_pyarrow_14(ray_start):
# This test assumes that `from_daft` calls `get_pyarrow_version` to get the
# PyArrow version. We can't mock `__version__` on the module directly because
# `get_pyarrow_version` caches the version.
with patch(
"ray.data.read_api.get_pyarrow_version", return_value=parse_version("14.0.0")
):
with pytest.raises(RuntimeError):
ray.data.from_daft(daft.from_pydict({"col": [0]}))
@pytest.mark.skipif(
parse_version(pa.__version__) >= parse_version("14.0.0"),
reason="<https://github.com/ray-project/ray/issues/53278>",
)
def test_daft_round_trip(ray_start):
data = {
"int_col": list(range(128)),
"str_col": [str(i) for i in range(128)],
"nested_list_col": [[i] * 3 for i in range(128)],
"tensor_col": [np.array([[i] * 3] * 3) for i in range(128)],
}
df = daft.from_pydict(data)
ds = ray.data.from_daft(df)
pd.testing.assert_frame_equal(ds.to_pandas(), df.to_pandas())
df2 = ds.to_daft()
df_pandas = df.to_pandas()
df2_pandas = df2.to_pandas()
for c in data.keys():
# NOTE: tensor behavior on round-trip is different because Ray Data provides
# Daft with more information about a column being a fixed-shape-tensor.
#
# Hence the Pandas representation of `df1` is "just" an object column, but
# `df2` knows that this is actually a numpy fixed shaped tensor column
if c == "tensor_col":
np.testing.assert_equal(
np.array(list(df_pandas[c])), df2_pandas[c].to_numpy()
)
else:
pd.testing.assert_series_equal(df_pandas[c], df2_pandas[c])
if __name__ == "__main__":
import sys
sys.exit(pytest.main(["-v", __file__]))
### Expected behavior
No response
### Component(s)
Ray Runner
### Additional context
No response
Eventual-Inc/DaftGitHub
07/23/2025, 4:22 AMreturn_dtype=...
)
2. .apply return types (.apply(..., return_dtype=...)
)
3. Casting (.cast(...)
)
4. Type hinting (.read_parquet(schema=...)
)
(I might have missed a few places)
However, writing exact Daft types is quite verbose:
from daft import DataType
@daft.udf(return_dtype=DataType.float64())
def f():
...
To fix this, we allowed some mapping of Python types to Daft types:
from daft import DataType
@daft.udf(return_dtype=DataType.float64())
def f():
...
This also works for struct types, using Python dicts:
from daft import DataType
@daft.udf(return_dtype={"foo": DataType.float64()})
def f():
...
However, lists don't work! Thus building a highly complex type such as a list-of-list-of-struct-of-list is highly verbose:
from daft import DataType
@daft.udf(return_dtype=DataType.list(DataType.list(DataType.struct({"foo": DataType.list(float)}))))
def f():
...
### Describe the solution you'd like
Here is a proposal, which looks quite Pythonic, but I'm not aware of any other library that does this which is maybe a bit concerning.
@daft.udf(return_dtype=[[{"foo": [float, ...]}, ...], ...])
In Python, there is the Ellipsis
singleton. The above is completely valid Python, and expresses the same types as the highly verbose datatype variant, while remaining relatively readable. Look at this, it's beautiful!
@daft.udf(return_dtype={
"bboxes": [[float, 4], ...], # fixed size syntax
"objects": [str, ...],
"image": [[[DataType.int8(), ...], ...], 3], # mix and match for specific datatypes
"metadata": dict[str, int], # map type
})
### Describe alternatives you've considered
No response
### Additional Context
No response
### Would you like to implement a fix?
No
Eventual-Inc/DaftGitHub
07/23/2025, 10:05 AMdaft.context.get_context()._runner.name
. The problem is that the UDF running on the ray worker gets None
result when call daft.context.get_context()._runner
, so I added a daft.context.get_context().get_runner_type()
method in this PR. The execution mechanism of this method is as follows:
1. Prioritize daft.context.get_context()._runner
to determine the Runner type;
2. If daft.context.get_context()._runner
is None
, call the detect_ray_state
method to determine whether it's currently running on ray. If so, the current Runner type is considered to be ray
, otherwise it is native
.
In addition, I found that when the DAFT_RUNNER
env is inconsistent with set_runner_xxx
, Daft will prioritize the set_runner_xxx
settings, so I added some warn logs to remind users.
## Related Issues
No issue
## Checklist
• Documented in API Docs (if applicable)
• Documented in User Guide (if applicable)
• If adding a new documentation page, doc is added to docs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/23/2025, 1:00 PMpy-spy==0.4.0
, while the version in daft is py-spy==0.3.14
. Introducing them simultaneously would cause a version conflict. Therefore, a small change was made here to change the version of py-spy to a range version.
## Related Issues
## Checklist
• Documented in API Docs (if applicable)
• Documented in User Guide (if applicable)
• If adding a new documentation page, doc is added to docs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/23/2025, 2:46 PM~
to HOME directory in read_deltalake
and write_deltalake
. These use different backends (read uses delta-rs
, write uses daft
) so implemented the changes in the Python API
## Related Issues
Closes #4786.
## Checklist
• Documented in API Docs (if applicable)
• Documented in User Guide (if applicable)
• If adding a new documentation page, doc is added to docs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/23/2025, 7:02 PMdocs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/23/2025, 7:08 PMarray/ops/cast.rs
. Specifically the `FixedSizeListArray`'s cast
implementation.
There is now a match arm for DataType::Embedding
. Upon encountering this, Daft will now properly cast the
data into an EmbeddingArray
struct (from datatypes::logical
). Before, this case was unrecognized and thus
led an unimplemented!
error.
Secondary fix is related to this Arrow 8 bug: https://issues.apache.org/jira/browse/ARROW-12201 . When writing to Parquet in Arrow 8, the logical datatype information for u32 values are incorrectly written. They will instead be written as signed 64 bit integers! The fix is to always use Parquet version >= 2.0. In later versions of Arrow, the default switches from parquet version 1.0 to >= 2.0. Here, the fix is to always set version=2.6
in the ParquetWriter
object used in daft.io.writer
.
Added a new Python based test to cover the fix tests/io/test_roundtrip_embeddings.py
. The new
test_roundtrip_embedding
covers a roundtrip serialization to and from both Parquet and Lance.
Also added new tests in `tests/series/test_cast.py`: test_series_cast_fixed_shape_list_to_embedding
explicitly casts fixed size list arrays into and embeddings and test_series_cast_embedding_to_fixed_shape_list
does the reverse.
Supporting these tests is a new random_numerical_embedding
helper function in tests/utils.py
.
One drive-by change: updated to use non-deprecated rand
functions & types in src/daft-core/src/array/ops/cast.rs
.
Fixes: #4732
## Related Issues
#4732
## Checklist
• Documented in API Docs (if applicable)
• Documented in User Guide (if applicable)
• If adding a new documentation page, doc is added to docs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/23/2025, 8:55 PMdocs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/DaftGitHub
07/23/2025, 11:42 PMdaft/functions/
, and if you add a function, it will automatically be added to the functions page in the category corresponding to the file it's in!
Screenshots:
[image](https://private-user-images.githubusercontent.com/20215378/470043426-c5f1d309-1c57-4571-97b6-d303328bfb5a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMzMjA1NzYsIm5iZiI6MTc1MzMyMDI3NiwicGF0aCI6Ii8yMDIxNTM3OC80NzAwNDM0MjYtYzVmMWQzMDktMWM1Ny00NTcxLTk3YjYtZDMwMzMyOGJmYjVhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzI0VDAxMjQzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ4Y2VmZDhiOGFkYjE2ZTlhZGNhZDY0Njg2NjMyMjA3MmYyNTUzYzU0ZDI1NzJmMDA5Y2YwZjk3MTllM2UwMDkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.mrkzW1Ej0Pwqoe2t-5ZFtge9Oj1GIN6qP_N3Z-uzACc)
[image](https://private-user-images.githubusercontent.com/20215378/470034468-4edc95bc-f2a1-4d09-97cb-b15eff960a02.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMzMjA1NzYsIm5iZiI6MTc1MzMyMDI3NiwicGF0aCI6Ii8yMDIxNTM3OC80NzAwMzQ0NjgtNGVkYzk1YmMtZjJhMS00ZDA5LTk3Y2ItYjE1ZWZmOTYwYTAyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzI0VDAxMjQzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTYxYTk5NWJhYWEyYjAwY2ZmMTQzZTljYWNhZDYxZmUyZjI2MjExYjJkMmQyMDk0YTY3ZGEyNzFmYzRkNDZjYjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.XaYGY9z3xcY6DFWuHfe0DdJmD9TcvHVfi8cglEu5zic)
## Related Issues
#4737
#4824
## Checklist
• Documented in API Docs (if applicable)
• Documented in User Guide (if applicable)
• If adding a new documentation page, doc is added to docs/mkdocs.yml
navigation
• Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Eventual-Inc/Daft