< kapa ai> So i am trying to move data from s3 folder to pos Airbyte #ask-ai

<@U04SE163WC8> So, i am trying to move data from s...

Sudip Adhikari

05/29/2024, 9:50 AM

@kapa.ai So, i am trying to move data from s3 folder to postgres database via airbyte which was working very well with all the configuration about connection setup. But, now i have came across the situation where new type of json file have arrived which have list of json objects in them like ["key1":"value1", "key2:"value2"] i just want them to be loaded in under single column via airbyte and i can parse them using dbt later on. But the issue is, the default schemaless behavior of airbyte is

Copy code

{
  "data": "object"
}

which leads to the issue of my json file having list of json object not single object. And if i try to use the airbyte infer schema , it gives me file configuration error. Have anyone faced this same issue ???

kapa.ai

05/29/2024, 9:50 AM

It sounds like you're encountering an issue with Airbyte's schema inference when dealing with JSON files that contain a list of JSON objects. Airbyte's default behavior is to treat the data as a single JSON object, which is causing problems with your new file format. To address this, you can use Airbyte's basic normalization feature to handle the JSON data appropriately. Here’s a high-level overview of how you can achieve this: 1. Load the JSON Data as Raw Data: Ensure that Airbyte loads the JSON data into your Postgres database as raw data. This will store the JSON blob in a single column, which you can then parse using dbt. 2. Use Basic Normalization: Enable basic normalization in Airbyte to transform the JSON data into a format that can be easily consumed by your destination. This will help in structuring the data from the source into a relational table format. Here’s an example of how Airbyte handles basic normalization:

Copy code

sql
CREATE TABLE "_airbyte_raw_cars" (  
 -- metadata added by airbyte  
 "_airbyte_ab_id" VARCHAR, -- uuid value assigned by connectors to each row of the data written in the destination.  
 "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE, -- time at which the record was emitted.  
 "_airbyte_data" JSONB -- data stored as a Json Blob.  
);

And the normalized table would look like this:

Copy code

sql
CREATE TABLE "cars" (  
 "_airbyte_ab_id" VARCHAR,  
 "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,  
 "_airbyte_cars_hashid" VARCHAR,  
 "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,  
  
 -- data from source  
 "make" VARCHAR,  
 "model" VARCHAR,  
 "horsepower" INTEGER  
);

3. Transform Data Using dbt: Once the data is loaded into your Postgres database, you can use dbt to parse and transform the JSON data. Here’s an example of how you can use dbt to parse a JSON blob stored in a single column:

kapa.ai

05/29/2024, 9:50 AM

Copy code

sql CREATE TABLE "_airbyte_raw_cars" (    -- metadata added by airbyte    "_airbyte_ab_id" VARCHAR, -- uuid value assigned by connectors to each row of the data written in the destination.    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE, -- time at which the record was emitted.    "_airbyte_data" JSONB -- data stored as a Json Blob.   );

And the normalized table would look like this:

Copy code

sql CREATE TABLE "cars" (    "_airbyte_ab_id" VARCHAR,    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,    "_airbyte_cars_hashid" VARCHAR,    "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,       -- data from source    "make" VARCHAR,    "model" VARCHAR,    "horsepower" INTEGER   );

Copy code

sql {{ config(alias="covid_epidemiology_ab1", schema="_airbyte_quarantine", tags=["top-level-intermediate"]) }}   -- SQL model to parse JSON blob stored in a single column and extract into separated field columns as described by the JSON Schema   select    {{ json_extract_scalar('_airbyte_data', ['key']) }} as {{ adapter.quote('key') }},    {{ json_extract_scalar('_airbyte_data', ['date']) }} as {{ adapter.quote('date') }},    {{ json_extract_scalar('_airbyte_data', ['new_tested']) }} as new_tested,    {{ json_extract_scalar('_airbyte_data', ['new_deceased']) }} as new_deceased,    {{ json_extract_scalar('_airbyte_data', ['total_tested']) }} as total_tested,    {{ json_extract_scalar('_airbyte_data', ['new_confirmed']) }} as new_confirmed,    {{ json_extract_scalar('_airbyte_data', ['new_recovered']) }} as new_recovered,    {{ json_extract_scalar('_airbyte_data', ['total_deceased']) }} as total_deceased,    {{ json_extract_scalar('_airbyte_data', ['total_confirmed']) }} as total_confirmed,    {{ json_extract_scalar('_airbyte_data', ['total_recovered']) }} as total_recovered,    _airbyte_emitted_at   from {{ source('quarantine', '_airbyte_raw_covid_epidemiology') }}   -- covid_epidemiology

For more detailed information, you can refer to the [Basic Normalization documentation](https://docs.airbyte.com/using-airbyte

Sudip Adhikari

05/29/2024, 9:52 AM

@kapa.ai The 1st step you mentioned is not working. It is not able to load the raw data in the first place.

kapa.ai

05/29/2024, 9:52 AM

A message from kapa.ai

kapa.ai

05/29/2024, 9:52 AM

If Airbyte is not able to load the raw data in the first place, it might be due to the schema inference issue you mentioned. Here are some steps you can take to troubleshoot and resolve this issue: 1. Check the JSON File Format: Ensure that your JSON file is correctly formatted. Since you mentioned that the JSON file contains a list of JSON objects, it should look something like this:

Copy code

json     [         {"key1": "value1", "key2": "value2"},         {"key1": "value3", "key2": "value4"}     ]

2. Disable Schema Inference: If schema inference is causing issues, you can try to disable it and load the data as raw JSON. This can be done by configuring the source connector to treat the data as a single JSON blob. 3. Use a Custom Schema: If disabling schema inference is not an option, you can define a custom schema that matches the structure of your JSON data. This can help Airbyte understand how to parse and load the data. 4. Review Airbyte Logs: Check the Airbyte logs for any errors or warnings that might indicate why the data is not being loaded. This can provide insights into what might be going wrong. Here’s an example of how you can configure the source connector to load raw JSON data:

Copy code

json {   "source": {     "name": "S3",     "config": {       "bucket": "your-bucket-name",       "path": "path/to/your/json/files",       "format": "json",       "schema": {         "type": "object",         "properties": {           "data": {             "type": "array",             "items": {               "type": "object"             }           }         }       }     }   },   "destination": {     "name": "Postgres",     "config": {       "host": "your-postgres-host",       "port": 5432,       "database": "your-database",       "username": "your-username",       "password": "your-password"     }   } }

If you continue to face issues, you can refer to the [Airbyte documentation](https://docs.airbyte.com) for more detailed troubleshooting steps or seek help from the Airbyte community on the [Airbyte forum](https://discuss.airbyte.io). If you have specific error messages or logs, please share them, and I can provide more targeted assistance.

Sudip Adhikari

05/29/2024, 9:53 AM

@kapa.ai What should i put in the input schema field if i am using only UI??

kapa.ai

05/29/2024, 9:54 AM

A message from kapa.ai

kapa.ai

05/29/2024, 9:54 AM

To handle the issue of loading JSON files that contain a list of JSON objects into a single column via the Airbyte UI, you can specify a custom input schema. This will help Airbyte understand how to parse and load your data correctly. Here’s how you can configure the input schema in the Airbyte UI: 1. Navigate to the Source Configuration: Go to the source configuration page for your S3 connector in the Airbyte UI. 2. Specify the Input Schema: In the input schema field, you can define a custom schema that matches the structure of your JSON data. Since you want to load the entire JSON list into a single column, you can use the following schema:

Copy code

json     {       "type": "array",       "items": {         "type": "object"       }     }

3. Save and Test the Configuration: Save the configuration and test the connection to ensure that Airbyte can correctly parse and load the data. Here’s a step-by-step example: 1. Open the Airbyte UI and navigate to your S3 source configuration. 2. Find the Input Schema Field: Look for the field where you can specify the input schema. 3. Enter the Custom Schema: Copy and paste the following JSON schema into the input schema field:

Copy code

json     {       "type": "array",       "items": {         "type": "object"       }     }

4. Save the Configuration: Click on the save button to apply the changes. 5. Test the Connection: Test the connection to ensure that Airbyte can correctly parse and load the JSON data. By specifying this custom schema, you instruct Airbyte to treat the input data as an array of JSON objects, which should resolve the issue of loading the data into a single column. For more details on providing a custom schema, you can refer to the [User Schema section](https://docs.airbyte.com/integrations/sources/microsoft-onedrive#user-schema) in the Airbyte documentation.

Open in Slack

Previous Next