Title
#advice-data-transformation
Jules Druelle

Jules Druelle

03/30/2022, 10:17 AM
Open to all advices & tips 🙂
Matt Sterling

Matt Sterling

03/30/2022, 1:52 PM
denesting is hard, basically. You can write your own, but the edge cases will quickly drive you crazy. Airbyte will do denesting in many cases though - is that not working here?
m

Marie

03/30/2022, 2:13 PM
I think the issue is that the denesting is done for the 1st level object but then when you have another nested object inside it is parsed as a string. I encountered the same issue with MongoDB documents and also look for a solution (dbt transformation?)
j

Jordan Fox

03/30/2022, 9:27 PM
What platform are you using? I've written multiple different styles of semi-structured normalization scripts, either pure python or pyspark or scala spark. It's much easier to do this pre-sql and then just land the normalized dataframe into sql and select what you need. Unfortunately with some IOT devices, you get jibberish across the network. Combining schema's for some set of say drilling devices created ~500 columns, a lot of which were the same thing just identified differently in the json stream. Takes effort to map them.
m

Marie

03/31/2022, 8:56 AM
For me I'm trying to move my MongoDB collection to PostgreSQL.. in your case have you scheduled your python/pyspark jobs to exctrat the data ?
j

Jordan Fox

03/31/2022, 9:03 PM
@Marie if that was targeted at me, yes, I'd have a source > lake raw > lake normalized > destination pipeline that extracts json and normalizes it at the lake layer, and then we can push it into a destination in tabular format.
Jules Druelle

Jules Druelle

04/01/2022, 10:32 AM
@Matt Sterling Nope, airbyte doesn't handle this case
10:33 AM
@Jordan Fox In my case I am currently extracting data from MongoDb and loading it into a BigQuery database. I'm thinking about implementing a generic dbt function that can iterate over nested properties and objects