modern-monitor-81461
03/15/2022, 11:46 PMhelpful-optician-78938
03/18/2022, 5:25 PMmodern-monitor-81461
03/18/2022, 8:06 PMmaster
and I still have an issue with logical types. In my test, I'm trying to map a Date
to an avro date
logical type and when I look at self.actual___schema
, I see the following:
self._actual_schema = {"type": {"type": "int", "logicalType": "date", "native_data_type": "date", "_nullable": false}, "name": "required_field", "doc": "required field documentation"}
"logicalType"
is not under props
, so that code is not matching and that if is by-passing the lookup in the mapping table.. Am I missing something?modern-monitor-81461
03/18/2022, 8:08 PMtype=self._converter._get_column_type(
actual_schema.type,
(
getattr(actual_schema, "logical_type", None)
or actual_schema.props.get("logicalType")
),
),
helpful-optician-78938
03/18/2022, 8:56 PMpip freeze | grep avro
?modern-monitor-81461
03/18/2022, 10:13 PMavro==1.10.2
avro-gen3==0.7.2
fastavro==1.4.10
modern-monitor-81461
03/28/2022, 11:53 AMmodern-monitor-81461
03/30/2022, 9:15 PM1.10.2
and there is a fix in 1.11.1
, which hasn't been released yet 😢 (I have inquired when it would be. Still waiting...)
Here is my documentation of the problem.
I wrote a test case to show the problem when Decimal logical type is used:
def test_avro_nullable():
import avro.schema
decimal_avro_schema_string = """{"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "bytes", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false}, "doc": "required field documentation"}]}"""
decimal_avro_schema = avro.schema.parse(decimal_avro_schema_string)
print("\nDecimal")
print(f"Before: {decimal_avro_schema_string}")
print(f"After: {decimal_avro_schema}")
boolean_avro_schema_string = """{"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "boolean", "native_data_type": "boolean", "_nullable": false}, "doc": "required field documentation"}]}"""
boolean_avro_schema = avro.schema.parse(boolean_avro_schema_string)
print("\nBoolean")
print(f"Before: {boolean_avro_schema_string}")
print(f"After: {boolean_avro_schema}")
the output is:
Decimal
Before: {"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "bytes", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false}, "doc": "required field documentation"}]}
After: {"type": "record", "name": "__struct_", "fields": [{"type": {"type": "bytes", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false}, "name": "required_field", "doc": "required field documentation"}]}
Boolean
Before: {"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "boolean", "native_data_type": "boolean", "_nullable": false}, "doc": "required field documentation"}]}
After: {"type": "record", "name": "__struct_", "fields": [{"type": {"type": "boolean", "native_data_type": "boolean", "_nullable": false}, "name": "required_field", "doc": "required field documentation"}]}
Notice for the Decimal output, _nullable
and native_data_type
are dropped. In the avro code, those are considered other_props
.
In avro 1.10.2, here is how BytesDecimalSchema
is created: https://github.com/apache/avro/blob/release-1.10.2/lang/py/avro/schema.py#L1022.
In avro 1.11.1 (master branch), here is the fix: https://github.com/apache/avro/blob/master/lang/py/avro/schema.py#L1072
So after looking some more at avro's code, I noticed that there is an alternate path for Decimal type. I added the following test and now I'm able to get _nullable
and `native_data_type`:
decimal_fixed_avro_schema_string = """{"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "fixed", "size": 16, "name": "bogusName", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false}, "doc": "required field documentation"}]}"""
decimal_fixed_avro_schema = avro.schema.parse(decimal_fixed_avro_schema_string)
print("\nDecimal (fixed)")
print(f"Before: {decimal_fixed_avro_schema_string}")
print(f"After: {decimal_fixed_avro_schema}")
returns this output:
Decimal (fixed)
Before: {"type": "record", "name": "__struct_", "fields": [{"name": "required_field", "type": {"type": "fixed", "size": 16, "name": "bogusName", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false}, "doc": "required field documentation"}]}
After: {"type": "record", "name": "__struct_", "fields": [{"type": {"type": "fixed", "logicalType": "decimal", "precision": 3, "scale": 2, "native_data_type": "decimal(3, 2)", "_nullable": false, "name": "bogusName", "size": 16}, "name": "required_field", "doc": "required field documentation"}]}
All of this to say that @helpful-optician-78938, you don't have to look into this anymore (in case it was on your to-do list).
@mammoth-bear-12532 I've said this at the beginning, but decided to still go that route to go faster, but I really don't like the fact that we go through Avro to generate SchemaFields. I understand it was to avoid duplicated code in many sources, but I would be curious to get your opinion on whether we could implement a builder to do that. The builder would track the field paths in a nested record... I might be missing something, but there's got to be a better way then going through avro and being at the mercy of bugs like this. Thoughts?helpful-optician-78938
03/30/2022, 9:20 PMmodern-monitor-81461
03/30/2022, 10:52 PMlocal-timestamp-micros
and it wasn't being recognized. I had to dig into avro's code to figure it has been "forgotten". I totally back the need for having common code that can be re-used across ingestion sources, but I'm afraid having to rely on avro is not the best dev-experience. I'm just relating my experience for you guys to maybe make things better...modern-monitor-81461
03/31/2022, 12:48 AMnestedType=None
. I have looked around and I can't figure out where we would be setting this. The only place I can see is here in schema_util.py, but we are always going to call the ArrayTypeClass.__init__()
with no argument, so effectively setting nestedType=None
. Am I right to believe that the current code will never set the array nested type?helpful-optician-78938
03/31/2022, 12:49 AMmammoth-bear-12532