Hi all! I found that ingestion process from deltal...
# ingestion
r
Hi all! I found that ingestion process from deltalake could use a lot of memory (in my case more then 8G) and it looks like memory reduction. And the reduction is critical as for me. Datahub ingestion library uses deltalake’s library (in python). And the deltalake’s library creates a vector with all parquet file-names for all delta-table’s states. The vector could be big. Very big! Huge! Dramatically huge! Datahub needs the vector to calculate number of files only. The deltalake’s python library uses a deltalake’s library on rust. And the rust-library has special flag (require_files) which can handle if the files-vector has to be created or not. And avoiding using the vector has to save memory.
Copy code
pub struct DeltaTableLoadOptions {
	..............
    /// Indicates whether DeltaTable should track files.
    /// This defaults to `true`
    ///
    /// Some append-only applications might have no need of tracking any files.
    /// Hence, DeltaTable will be loaded with significant memory reduction.
    pub require_files: bool,
}
The main problem is that the flag couldn’t be managed from the python deltalake’s library (it needs to be changed to manage the flag). And also a question is how we can calculate the number of files in alternative way.Datahub’s code (using of DeltaTable class): https://github.com/datahub-project/datahub/blob/083ab9bc0e7b9d8ba293afcf9fae4ffb71c4f86c/metadata-ingestion/src/datahub/ingestion/source/delta_lake/delta_lake_utils.py#L24Deltalake’s python library: - DeltaTable class: https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/python/deltalake/table.py#L72 - RawDeltaTable class: https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/python/src/lib.rs#L78Deltalake’s rust library: - DeltaTableBuilder class (require_files is in the options: DeltaTableLoadOptions field): https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/rust/src/builder.rs#L116
h
Hi @rich-battery-25772, thanks for reporting this issue and along with the root-cause. Please open a github issue here and we will address this as soon as we can.