Hi all, I am experimenting with running Datahub as...
# ingestion
l
Hi all, I am experimenting with running Datahub as the internal data catalog for my company. Our data sets are all in Trino, but the sqlalchemy trino source doesn't cut it for us. We have structured fields in trino. Has there been any attempt to create a Trino source which is not SQLalchemy based?
m
Hi @lemon-lion-66467 welcome to DataHub! We are starting to look at structured fields across our sources (Hive, Trino) etc, so the timing is great 🙂
Is there a library you would recommend for Trino ingestion?
l
I don't have a recommendation for a library yet. I am still experimenting with different libraries. I'll update here. Have you looked at any libraries @mammoth-bear-12532?
m
One option would be pyspark
p
@mammoth-bear-12532 and @lemon-lion-66467 Tried building Trino Metadata ingestion I am using it for my PoC, I am not the python guy. so was struggling with writing test case 😃 https://github.com/rahulbsw/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/trino.py sqlalchemy_trino return struct as ROW type so I registerec custom type mapping to RecordTypeClass , register_custom_type(datatype.ROW, RecordTypeClass) Hope this will help
We are using older version of Trino (i.e. 350) so had to do lots of change on sqlalchemy_trino module https://github.com/rahulbsw/sqlalchemy-trino/tree/v350_Support
l
ah I see, thanks for that @prehistoric-yak-75049 I'll try this out. Any particular problems you running into when parsing ROW as a Record?
p
If Struct is multilevel it don’t worked well . I have test it on 0.8.10, it seems there is fix in 0.8.11 for multilevel. One more thing , please make sure your Trino Catalogs metadata is up to date . What I have notices was show tables includes the stale tables/drop tables
Copy code
SHOW TABLES FROM <CATALOG>.<SCHEMA>
But when you Describe the table it throws error and stop processing
Copy code
SHOW COLUMNS FROM <CATALOG>.<SCHEMA>.<TABLE>
m
yes you need to use the new field path specification going forward to make sure DataHub can render it correctly. https://datahubproject.io/docs/advanced/field-path-spec-v2/
p
Hi @mammoth-bear-12532 Trino Metadata Ingestion is based on sql_common. so just merge with 0.8.11 code base should take care of using SchemaFieldPath spec v2 ,right?