Hey guys I have this MongoDB collection with thousands of do Airbyte #contributing-to-airbyte

Hey guys, I have this MongoDB collection with thou...

Aditya Guru

05/19/2021, 3:08 AM

Hey guys, I have this MongoDB collection with thousands of docs, • connected airbyte to it • It tries to find the schema. • returns me some fields It does not contain fields that were seldom used in the collection. I checked the code https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-mongodb/lib/mongodb_types_explorer.rb Apparently the approach it takes is just check the first 1000 records. So, even this is not sampling of any sort. Correct? Is there a way to edit the fields(add to it) in UI or backed API for the mongo source connector? For now I am trying to read https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html but would appreciate any help

user

05/19/2021, 4:08 AM

Hi Aditya! Yeap that's right, this is done serially today without a sample. Curious, if we had an option to do a full scan so types are 100% accurate would you be willing to wait?

user

05/19/2021, 4:09 AM

It's not possible to change the discovered schema via the api. I think you'd would have to exec into a running pod and check the schema saved to the volume. @charles (since I see you online) does that make sense?

user

05/19/2021, 4:15 AM

yeah. agreed. though i got to say that's pretty tricky to do.

user

05/19/2021, 4:18 AM

are we thinking about mongo wrong? isn't the point of mongo that it's schema-less?

user

05/19/2021, 4:18 AM

this approach of sampling the data is always going to be wrong.

user

05/19/2021, 4:19 AM

and reading all of the data to get the schema seems incredibly wasteful (and also not really a guarantee of anything since the next record that comes along could have a new field)

user

05/19/2021, 4:22 AM

good point on reading everything = wasteful

user

05/19/2021, 4:22 AM

any thoughts Aditya?

user

05/19/2021, 4:23 AM

naively it feels like the schema should be the stream name = collection name and then the schema for the stream is

{ "type": "object" }

user

05/19/2021, 4:23 AM

or something like that anyway 🤷‍♀️

user

05/19/2021, 4:39 AM

@Davin Chia (Airbyte) To "Willing to wait" -> probably not and also would not want to kill my mongo server To "Thoughts" -> There should be an API that allows us to set the schema for mongo. For instance there is a cdata connector which allows to do something similar (https://cdn.cdata.com/help/DGF/odbc/pg_DefinedSchemas.htm). @charles Though I agree to the fact that reading everything is wasteful, I would like to differ at not specifying the schema

{ "type": "object" }

. I think it might be okay for the Extract and Load phases but I guess the Transform phase needs to know more about the schema. Like even the Basic Normalization. Thoughts?

user

05/19/2021, 4:51 AM

@Davin Chia (Airbyte) It would be the workspace volume attached to airbyte-server?

user

05/19/2021, 5:33 AM

interesting point on ODBC allowing schemas to be defined; maybe a dumb question, that might touch on what Charles was saying, if schema is important, what's the reason for not using a SQL db? is it legacy?

user

05/19/2021, 5:46 AM

Yes it is kind of legacy. I have been thinking about this for a while whether it is worth the effort to migrate to Postgres/MySQL now that the use case has popped up. Schema is important once it reaches Destination(say Snowflake/ BigQuery) not much before

user

05/19/2021, 5:47 AM

it would be this same volume: https://docs.airbyte.io/tutorials/browsing-output-logs#opening-a-unix-shell-prompt-to-browse-the-docker-volume

user

05/19/2021, 5:47 AM

let me poke around first, I'm not entirely familiar with this part of our system

user

05/19/2021, 6:01 AM

Thanks for the pointers. Let me try updating the schema

user

05/19/2021, 6:18 AM

docker run -it --rm --volume airbyte_data:/data busybox

should get you into the right volume

user

05/19/2021, 6:21 AM

you want to do

Copy code

# navigate to the config directory
$ cd data/config/
DESTINATION_CONNECTION/           STANDARD_DESTINATION_DEFINITION/  STANDARD_SYNC/                    STANDARD_WORKSPACE/
SOURCE_CONNECTION/                STANDARD_SOURCE_DEFINITION/       STANDARD_SYNC_SCHEDULE/

/data/config/STANDARD_SYNC $ ls
31f0495e-1251-499f-9a64-a1239a99609d.json

# look at the standard sync - I believe the json_schema field here is what you want to modify
/data/config/STANDARD_SYNC $ cat 31f0495e-1251-499f-9a64-a1239a99609d.json
{"prefix":"","sourceId":"617fd945-566d-4cc4-905b-2c2c31627cba","destinationId":"0a3cde72-0a83-479b-b1db-4024ebf9516f","connectionId":"31f0495e-1251-499f-9a64-a1239a99609d","name":"default","catalog":{"streams":[{"stream":{"name":"med_table","json_schema":{"type":"object","properties":{"date_prod":{"type":"string"},"code":{"type":"string"},"len":{"type":"string"},"kind":{"type":"string"},"title":{"type":"string"},"did":{"type":"string"}}},"supported_sync_modes":["full_refresh","incremental"],"default_cursor_field":[],"source_defined_primary_key":[["code"]],"namespace":"public"},"sync_mode":"full_refresh","cursor_field":[],"destination_sync_mode":"append","primary_key":[["code"]]}]},"status":"active"}

user

05/19/2021, 6:21 AM

I haven't done the full end to end cycle so I'm not 100% modifying this here works

user

05/20/2021, 3:21 AM

I tried it and it worked. Thanks🙏

Open in Slack

Previous Next