Andrew Fuqua
04/21/2025, 10:59 PMEverett Kleven
04/24/2025, 4:45 PMYufan
04/25/2025, 3:56 PMAndrew Kursar
04/25/2025, 9:29 PMyashovardhan chaturvedi
05/02/2025, 2:34 PMNeil Wadhvana
05/04/2025, 1:55 AMdaft
without needing to specify each column (since I can have anywhere from 1 to 5). Is there another syntax that would work? This is not working as is:
@daft.udf(return_dtype=daft.DataType.python())
def mean_ensemble(*depth_value_series: daft.Series) -> List[Dict[str, np.ndarray]]:
"""Apply mean ensemble to depth maps."""
depth_value_lists = [series.to_pylist() for series in depth_value_series]
reduced_depth_maps: List[Dict[str, np.ndarray]] = []
for depth_value_list in depth_value_lists:
# Calculate mean and standard deviation across all depth maps in the list
stacked_depths = np.stack(depth_value_list, axis=0)
mean_depth = np.mean(stacked_depths, axis=0)
std_depth = np.std(stacked_depths, axis=0)
reduced_depth_maps.append(
{
"mean": mean_depth,
"std": std_depth,
}
)
return reduced_depth_maps
Garrett Weaver
05/14/2025, 1:05 PMYufan
05/21/2025, 11:29 PMYuri Gorokhov
05/23/2025, 9:29 PMAttempting to downcast Map { key: Utf8, value: List(Utf8) } to \"daft_core::array::list_array::ListArray\"
Wondering if someone has seen this before?Giridhar Pathak
05/25/2025, 1:37 AMTypeError: pyarrow.lib.large_list() takes no keyword arguments
the code:
table = catalog.load_table(table)
return df.read_iceberg(table)
has anyone experienced this before?Giridhar Pathak
05/25/2025, 10:18 PMdaft.read_table("platform.messages").filter("event_time > TIMESTAMP '2025-05-24T00:00:00Z'").limit(5).show()
running this makes the process crash. looks like memory goes thru the roof. Not sure if its trying to read the whole table into memory.
pre-materialization, i can get the schema just fine.Everett Kleven
05/28/2025, 2:15 PMYuri Gorokhov
05/28/2025, 4:14 PM.dropDuplicates(subset: Optional[List[str]] = None)
where you can specify which columns to consider?Pat Patterson
05/29/2025, 11:37 PM# How many records are in the current Drive Stats dataset?
count, elapsed_time = time_collect(drivestats.count())
print(f'Total record count: {count.to_pydict()['count'][0]} ({elapsed_time:.2f} seconds)')
With the other systems I tested in my blog post, the equivalent query takes between a fraction of a second and 15 seconds. That Daft call to drivestats.count()
takes 80 seconds. I’m guessing it’s doing way more work than it needs to - reading the record counts from each of the 365 Parquet files rather than simply reading total-records
from the most recent metadata file. Since SELECT COUNT(*)
is such a common operation, I think it’s worth short-circuiting the current behavior.Giridhar Pathak
06/03/2025, 2:17 PMEverett Kleven
06/03/2025, 2:59 PMPat Patterson
06/06/2025, 10:47 PMimport daft
import ray
ray.init("<ray://head_node_host:10001>", runtime_env={"pip": ["daft"]})
daft.context.set_runner_ray("<ray://head_node_host:10001>")
catalog = load_catalog(
'iceberg',
**{
'uri': 'sqlite:///:memory:',
# configuration to access Backblaze B2's S3-compatible API such as
# s3.endpoint, s3.region, etc
}
}
catalog.create_namespace('default', { 'location': f'<s3://my-bucket/'}>)
table = catalog.register_table('default.drivestats', metadata_location)
drivestats = daft.read_iceberg(table)
result = drivestats.count().collect()
print(f'Total record count: {result.to_pydict()['count'][0]}')
Presumably, the code to read Parquet files from Backblaze B2 via the AWS SDK executes on the Ray cluster, so I have to either install the necessary packages there ahead of time or specify them, and environment variables, in runtime_env
? For example:
ray.init("ray://<head_node_host>:10001", runtime_env={
"pip": ["daft==0.5.2", "boto3==1.34.162", "botocore==1.34.162", ...etc...],
"env_vars": {
"AWS_ENDPOINT_URL": os.environ["AWS_ENDPOINT_URL"],
"AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
"AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
...etc...
}
})
Dimitris
06/10/2025, 8:41 PMTabrez Mohammed
06/16/2025, 10:41 PMDimitris
06/16/2025, 11:22 PMSasha Shtern
06/17/2025, 8:06 PMDimitris
06/18/2025, 10:21 PMGarrett Weaver
06/19/2025, 4:25 PMcatalog.namepace.table
(e.g. write_table
), but others break and seem to expect namespace.table
(e.g. create_table_if_not_exists
which calls pyiceberg under the hood that had breaking change that no longer allows including catalog). Any advice on how I should be providing identifiers?Garrett Weaver
06/20/2025, 7:52 PMMarco Gorelli
06/24/2025, 8:58 AMEverett Kleven
06/24/2025, 4:11 PMSammy Sidhu
06/24/2025, 4:19 PMArtur Ciocanu
06/25/2025, 7:55 PMChanChan Mao
06/26/2025, 4:41 PMGarrett Weaver
06/27/2025, 7:23 PMpyarrow
when using Ray data directly and also seeing errors trying to read the same data with daft
. pyspark
works fine. I assume daft might be impacted by the same issue (too large row groups)?