Thanks for the bug confirmation <@U028L1V9BE1>. I...
# integrate-iceberg-datahub
m
Thanks for the bug confirmation @helpful-optician-78938. It will probably fix DecimalType, as well as TimeType, Timestamp and TimestampZ types... They all use avro logical types. I have another thing that bugs me regarding field descriptions and Avro. This will be easier to show once I update my PR. I will do that tomorrow morning and let you know.
👍 1
h
Hi @modern-monitor-81461, btw, the fix for this is now merged to master: https://github.com/datahub-project/datahub/pull/4433
👍 1
m
@helpful-optician-78938 I've updated my code with the latest changes from
master
and I still have an issue with logical types. In my test, I'm trying to map a
Date
to an avro
date
logical type and when I look at
self.actual___schema
, I see the following:
self._actual_schema = {"type": {"type": "int", "logicalType": "date", "native_data_type": "date", "_nullable": false}, "name": "required_field", "doc": "required field documentation"}
"logicalType"
is not under
props
, so that code is not matching and that if is by-passing the lookup in the mapping table.. Am I missing something?
The temp fix you proposed the other day worked
Copy code
type=self._converter._get_column_type(
         actual_schema.type,
          (
              getattr(actual_schema, "logical_type", None)
              or actual_schema.props.get("logicalType")
           ),
),
h
Thanks @modern-monitor-81461! Could you share the output of
pip freeze | grep avro
?
m
Copy code
avro==1.10.2
avro-gen3==0.7.2
fastavro==1.4.10
@helpful-optician-78938 Any update on this? ^ Were you able to reproduce it?
I apologize for the lengthy post... 😊 I took the time to look into this I now understand why @helpful-optician-78938 asked for the avro version. Turns out there is a bug in version
1.10.2
and there is a fix in
1.11.1
, which hasn't been released yet 😢 (I have inquired when it would be. Still waiting...) Here is my documentation of the problem. I wrote a test case to show the problem when Decimal logical type is used:
Copy code
def test_avro_nullable():
    import avro.schema
    decimal_avro_schema_string = """{"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "bytes", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false}, "doc": "required field documentation"}]}"""
    decimal_avro_schema = avro.schema.parse(decimal_avro_schema_string)
    print("\nDecimal")
    print(f"Before: {decimal_avro_schema_string}")
    print(f"After:  {decimal_avro_schema}")

    boolean_avro_schema_string = """{"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "boolean", "native_data_type": "boolean", "_nullable": false}, "doc": "required field documentation"}]}"""
    boolean_avro_schema = avro.schema.parse(boolean_avro_schema_string)
    print("\nBoolean")
    print(f"Before: {boolean_avro_schema_string}")
    print(f"After:  {boolean_avro_schema}")
the output is:
Copy code
Decimal
Before: {"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "bytes", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false}, "doc": "required field documentation"}]}
After:  {"type": "record", "name": "__struct_", "fields": [{"type": {"type": "bytes", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false}, "name": "required_field", "doc": "required field documentation"}]}

Boolean
Before: {"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "boolean", "native_data_type": "boolean", "_nullable": false}, "doc": "required field documentation"}]}
After:  {"type": "record", "name": "__struct_", "fields": [{"type": {"type": "boolean", "native_data_type": "boolean", "_nullable": false}, "name": "required_field", "doc": "required field documentation"}]}
Notice for the Decimal output,
_nullable
and
native_data_type
are dropped. In the avro code, those are considered
other_props
. In avro 1.10.2, here is how
BytesDecimalSchema
is created: https://github.com/apache/avro/blob/release-1.10.2/lang/py/avro/schema.py#L1022. In avro 1.11.1 (master branch), here is the fix: https://github.com/apache/avro/blob/master/lang/py/avro/schema.py#L1072 So after looking some more at avro's code, I noticed that there is an alternate path for Decimal type. I added the following test and now I'm able to get
_nullable
and `native_data_type`:
Copy code
decimal_fixed_avro_schema_string = """{"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "fixed", "size": 16, "name": "bogusName", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false}, "doc": "required field documentation"}]}"""
    decimal_fixed_avro_schema = avro.schema.parse(decimal_fixed_avro_schema_string)
    print("\nDecimal (fixed)")
    print(f"Before: {decimal_fixed_avro_schema_string}")
    print(f"After:  {decimal_fixed_avro_schema}")
returns this output:
Copy code
Decimal (fixed)
Before: {"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "fixed", "size": 16, "name": "bogusName", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false}, "doc": "required field documentation"}]}
After:  {"type": "record", "name": "__struct_", "fields": [{"type": {"type": "fixed", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false, "name": "bogusName", "size": 16}, "name": "required_field", "doc": "required field documentation"}]}
All of this to say that @helpful-optician-78938, you don't have to look into this anymore (in case it was on your to-do list). @mammoth-bear-12532 I've said this at the beginning, but decided to still go that route to go faster, but I really don't like the fact that we go through Avro to generate SchemaFields. I understand it was to avoid duplicated code in many sources, but I would be curious to get your opinion on whether we could implement a builder to do that. The builder would track the field paths in a nested record... I might be missing something, but there's got to be a better way then going through avro and being at the mercy of bugs like this. Thoughts?
h
Hi @modern-monitor-81461, thanks for digging into this. I'll get back to you by tomorrow with the answers to your questions.
m
Just to add to my previous comment about questioning the use of avro, I'm facing another "bug" with timestamp. Iceberg support timestamps with and without timezone. The avro spec supports local-timestamp-millis and local-timestamp-micros, but the actual Python implementation does not (but the Java one does...). I tried to use
local-timestamp-micros
and it wasn't being recognized. I had to dig into avro's code to figure it has been "forgotten". I totally back the need for having common code that can be re-used across ingestion sources, but I'm afraid having to rely on avro is not the best dev-experience. I'm just relating my experience for you guys to maybe make things better...
@helpful-optician-78938 If I may ask since you will answer my questions 😅 , I am trying to figure out why my Arrays are always showing with a
nestedType=None
. I have looked around and I can't figure out where we would be setting this. The only place I can see is here in schema_util.py, but we are always going to call the
ArrayTypeClass.__init__()
with no argument, so effectively setting
nestedType=None
. Am I right to believe that the current code will never set the array nested type?
h
will look into this as well.
m
Totally agree @modern-monitor-81461. We need to plan this though. Will be a medium-large size project.