Hi! I have a contribution proposal, and would like...
# ingestion
r
Hi! I have a contribution proposal, and would like to know if it would be useful. We're ingesting a lot of entities, more than 500k, using stateful ingestion - thus checkpoint state become huge, about 50Mb, and it is crucial to use compression. There is a warning that compression could not be turned on https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/checkpoint.py#L42 . If I understood right, reason is that aspect should be represented as string, not byte array. So, in our custom CheckpointState I compress data with bz2, then encode it to string with base85. Result is about 5 times less than aspect's json representation. I'm wondering if this approach would be useful in CheckpointStateBase#to_bytes method, i.e. make something like
base64.b85encode(bz2.compress(pickle.dumps(self)))
instead of
json_str_self.encode("utf-8")
here https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/checkpoint.py#L49 ? Thank you!
h
This is excellent @rich-machine-24265. Please feel free to contribute.
r
Thanks Ravindra, I'll create PR