Hey guys, I have this MongoDB collection with thou...
# contributing-to-airbyte
a
Hey guys, I have this MongoDB collection with thousands of docs, • connected airbyte to it • It tries to find the schema. • returns me some fields It does not contain fields that were seldom used in the collection. I checked the code https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-mongodb/lib/mongodb_types_explorer.rb Apparently the approach it takes is just check the first 1000 records. So, even this is not sampling of any sort. Correct? Is there a way to edit the fields(add to it) in UI or backed API for the mongo source connector? For now I am trying to read https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html but would appreciate any help
u
Hi Aditya! Yeap that's right, this is done serially today without a sample. Curious, if we had an option to do a full scan so types are 100% accurate would you be willing to wait?
u
It's not possible to change the discovered schema via the api. I think you'd would have to exec into a running pod and check the schema saved to the volume. @charles (since I see you online) does that make sense?
u
yeah. agreed. though i got to say that's pretty tricky to do.
u
are we thinking about mongo wrong? isn't the point of mongo that it's schema-less?
u
this approach of sampling the data is always going to be wrong.
u
and reading all of the data to get the schema seems incredibly wasteful (and also not really a guarantee of anything since the next record that comes along could have a new field)
u
good point on reading everything = wasteful
u
any thoughts Aditya?
u
naively it feels like the schema should be the stream name = collection name and then the schema for the stream is
{ "type": "object" }
u
or something like that anyway 🤷‍♀️
u
@Davin Chia (Airbyte) To "Willing to wait" -> probably not and also would not want to kill my mongo server To "Thoughts" -> There should be an API that allows us to set the schema for mongo. For instance there is a cdata connector which allows to do something similar (https://cdn.cdata.com/help/DGF/odbc/pg_DefinedSchemas.htm). @charles Though I agree to the fact that reading everything is wasteful, I would like to differ at not specifying the schema
{ "type": "object" }
. I think it might be okay for the Extract and Load phases but I guess the Transform phase needs to know more about the schema. Like even the Basic Normalization. Thoughts?
u
@Davin Chia (Airbyte) It would be the workspace volume attached to airbyte-server?
u
interesting point on ODBC allowing schemas to be defined; maybe a dumb question, that might touch on what Charles was saying, if schema is important, what's the reason for not using a SQL db? is it legacy?
u
Yes it is kind of legacy. I have been thinking about this for a while whether it is worth the effort to migrate to Postgres/MySQL now that the use case has popped up. Schema is important once it reaches Destination(say Snowflake/ BigQuery) not much before
u
let me poke around first, I'm not entirely familiar with this part of our system
u
Thanks for the pointers. Let me try updating the schema
u
docker run -it --rm --volume airbyte_data:/data busybox
should get you into the right volume
u
you want to do
Copy code
# navigate to the config directory
$ cd data/config/
DESTINATION_CONNECTION/           STANDARD_DESTINATION_DEFINITION/  STANDARD_SYNC/                    STANDARD_WORKSPACE/
SOURCE_CONNECTION/                STANDARD_SOURCE_DEFINITION/       STANDARD_SYNC_SCHEDULE/

/data/config/STANDARD_SYNC $ ls
31f0495e-1251-499f-9a64-a1239a99609d.json

# look at the standard sync - I believe the json_schema field here is what you want to modify
/data/config/STANDARD_SYNC $ cat 31f0495e-1251-499f-9a64-a1239a99609d.json
{"prefix":"","sourceId":"617fd945-566d-4cc4-905b-2c2c31627cba","destinationId":"0a3cde72-0a83-479b-b1db-4024ebf9516f","connectionId":"31f0495e-1251-499f-9a64-a1239a99609d","name":"default","catalog":{"streams":[{"stream":{"name":"med_table","json_schema":{"type":"object","properties":{"date_prod":{"type":"string"},"code":{"type":"string"},"len":{"type":"string"},"kind":{"type":"string"},"title":{"type":"string"},"did":{"type":"string"}}},"supported_sync_modes":["full_refresh","incremental"],"default_cursor_field":[],"source_defined_primary_key":[["code"]],"namespace":"public"},"sync_mode":"full_refresh","cursor_field":[],"destination_sync_mode":"append","primary_key":[["code"]]}]},"status":"active"}
u
I haven't done the full end to end cycle so I'm not 100% modifying this here works
u
I tried it and it worked. Thanks🙏