https://linen.dev logo
a

Aditya Guru

05/19/2021, 3:08 AM
Hey guys, I have this MongoDB collection with thousands of docs, • connected airbyte to it • It tries to find the schema. • returns me some fields It does not contain fields that were seldom used in the collection. I checked the code https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-mongodb/lib/mongodb_types_explorer.rb Apparently the approach it takes is just check the first 1000 records. So, even this is not sampling of any sort. Correct? Is there a way to edit the fields(add to it) in UI or backed API for the mongo source connector? For now I am trying to read https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html but would appreciate any help
u

user

05/19/2021, 4:08 AM
Hi Aditya! Yeap that's right, this is done serially today without a sample. Curious, if we had an option to do a full scan so types are 100% accurate would you be willing to wait?
u

user

05/19/2021, 4:09 AM
It's not possible to change the discovered schema via the api. I think you'd would have to exec into a running pod and check the schema saved to the volume. @charles (since I see you online) does that make sense?
u

user

05/19/2021, 4:15 AM
yeah. agreed. though i got to say that's pretty tricky to do.
u

user

05/19/2021, 4:18 AM
are we thinking about mongo wrong? isn't the point of mongo that it's schema-less?
u

user

05/19/2021, 4:18 AM
this approach of sampling the data is always going to be wrong.
u

user

05/19/2021, 4:19 AM
and reading all of the data to get the schema seems incredibly wasteful (and also not really a guarantee of anything since the next record that comes along could have a new field)
u

user

05/19/2021, 4:22 AM
good point on reading everything = wasteful
u

user

05/19/2021, 4:22 AM
any thoughts Aditya?
u

user

05/19/2021, 4:23 AM
naively it feels like the schema should be the stream name = collection name and then the schema for the stream is
{ "type": "object" }
u

user

05/19/2021, 4:23 AM
or something like that anyway 🤷‍♀️
u

user

05/19/2021, 4:39 AM
@Davin Chia (Airbyte) To "Willing to wait" -> probably not and also would not want to kill my mongo server To "Thoughts" -> There should be an API that allows us to set the schema for mongo. For instance there is a cdata connector which allows to do something similar (https://cdn.cdata.com/help/DGF/odbc/pg_DefinedSchemas.htm). @charles Though I agree to the fact that reading everything is wasteful, I would like to differ at not specifying the schema
{ "type": "object" }
. I think it might be okay for the Extract and Load phases but I guess the Transform phase needs to know more about the schema. Like even the Basic Normalization. Thoughts?
u

user

05/19/2021, 4:51 AM
@Davin Chia (Airbyte) It would be the workspace volume attached to airbyte-server?
u

user

05/19/2021, 5:33 AM
interesting point on ODBC allowing schemas to be defined; maybe a dumb question, that might touch on what Charles was saying, if schema is important, what's the reason for not using a SQL db? is it legacy?
u

user

05/19/2021, 5:46 AM
Yes it is kind of legacy. I have been thinking about this for a while whether it is worth the effort to migrate to Postgres/MySQL now that the use case has popped up. Schema is important once it reaches Destination(say Snowflake/ BigQuery) not much before
u

user

05/19/2021, 5:47 AM
let me poke around first, I'm not entirely familiar with this part of our system
u

user

05/19/2021, 6:01 AM
Thanks for the pointers. Let me try updating the schema
u

user

05/19/2021, 6:18 AM
docker run -it --rm --volume airbyte_data:/data busybox
should get you into the right volume
u

user

05/19/2021, 6:21 AM
you want to do
Copy code
# navigate to the config directory
$ cd data/config/
DESTINATION_CONNECTION/           STANDARD_DESTINATION_DEFINITION/  STANDARD_SYNC/                    STANDARD_WORKSPACE/
SOURCE_CONNECTION/                STANDARD_SOURCE_DEFINITION/       STANDARD_SYNC_SCHEDULE/

/data/config/STANDARD_SYNC $ ls
31f0495e-1251-499f-9a64-a1239a99609d.json

# look at the standard sync - I believe the json_schema field here is what you want to modify
/data/config/STANDARD_SYNC $ cat 31f0495e-1251-499f-9a64-a1239a99609d.json
{"prefix":"","sourceId":"617fd945-566d-4cc4-905b-2c2c31627cba","destinationId":"0a3cde72-0a83-479b-b1db-4024ebf9516f","connectionId":"31f0495e-1251-499f-9a64-a1239a99609d","name":"default","catalog":{"streams":[{"stream":{"name":"med_table","json_schema":{"type":"object","properties":{"date_prod":{"type":"string"},"code":{"type":"string"},"len":{"type":"string"},"kind":{"type":"string"},"title":{"type":"string"},"did":{"type":"string"}}},"supported_sync_modes":["full_refresh","incremental"],"default_cursor_field":[],"source_defined_primary_key":[["code"]],"namespace":"public"},"sync_mode":"full_refresh","cursor_field":[],"destination_sync_mode":"append","primary_key":[["code"]]}]},"status":"active"}
u

user

05/19/2021, 6:21 AM
I haven't done the full end to end cycle so I'm not 100% modifying this here works
u

user

05/20/2021, 3:21 AM
I tried it and it worked. Thanks🙏