Is there any need to do database maintenance on th...
# ask-ai
e
Is there any need to do database maintenance on the airbyte databases? such as deleting old rows from any of the tables.
u
Yes, there is a need for database maintenance on Airbyte databases, particularly for managing job history logs and temporary tables. Here are some key points to consider: 1. Job History Logs: The Airbyte database can grow significantly over time due to job history logs. It's important to manage this growth to prevent disk space issues. As mentioned in a GitHub issue, after just a week of data processing with one connection, the Airbyte Postgres DB consumed 1GB already [GitHub Issue](https://github.com/airbytehq/airbyte/issues/39358). 2. Temporal Database: You can set the
TEMPORAL_HISTORY_RETENTION_IN_DAYS
parameter to help clean up the temporal database [GitHub Issue](https://github.com/airbytehq/airbyte/issues/39358). 3. Manual Cleanup: For the Airbyte database (db-airbyte), there isn't an automatic cleanup process. Users have reported manually deleting logs folders to free up space [Forum Post](https://discuss.airbyte.io/t/3303). 4. Temporary Tables: It's okay to delete
_AIRBYTE_TMP
tables. These should ideally be deleted after a sync is completed, but if they persist, you can safely remove them [Forum Post](https://discuss.airbyte.io/t/1388). 5. Clearing Data: Airbyte provides a "Clear data" feature that allows you to remove all previously synced data from the destination for a connection. This can be done for all streams or a single stream [Clearing your data](https://docs.airbyte.com/operator-guides/clear#steps-to-clear-data). 6. Orphaned Tables: If you have any orphaned tables or files that are no longer being synced to, they should be cleaned up separately, as Airbyte will not clean them up automatically [Clear behavior](https://docs.airbyte.com/operator-guides/clear#clear-behavior). It's important to note that there isn't a built-in automatic cleanup process for all aspects of the Airbyte database. Regular monitoring and manual intervention may be necessary to manage database growth effectively.
e
@kapa.ai after a month of running airbyte the storage for my container is at 40% usage. How can I remove old data to prevent my instance from running out of storage?
u
To address your concern about storage usage in Airbyte, there are a few approaches you can take to remove old data and prevent your instance from running out of storage: 1. Configure log retention: You can set the
TEMPORAL_HISTORY_RETENTION_IN_DAYS
environment variable to control how long Airbyte keeps logs. For example, setting it to a lower value like 7 days instead of the default 30 days can help reduce storage usage [Airbyte Rotation Logs](https://discuss.airbyte.io/t/332). 2. Clear data for specific streams or connections: Airbyte allows you to clear data for individual streams or entire connections. This can be done through the UI or API. To clear data for a single stream, navigate to the Connection's status page, click the three grey dots next to the stream, and select "Clear data". For clearing all streams in a connection, go to the connection's Settings tab and click "Clear data" [Clearing your data](https://docs.airbyte.com/operator-guides/clear#steps-to-clear-data). 3. Clean up Minio storage (if applicable): If you're using Kubernetes, Minio stores logs that can accumulate over time. You can clean these up by executing into the Minio pod and deleting files:
Copy code
kubectl exec -it -n airbyte airbyte-minio-0 -- /bin/bash
Then remove files in
/storage/airbyte-dev-logs/job-logging/workspace/
[How to clean up airbyte-minio and what's stored in minio?](https://discuss.airbyte.io/t/3101). 4. Adjust workspace retention settings: Although this feature might have been removed in newer versions, you can check if your version still supports these environment variables: -
MINIMUM_WORKSPACE_RETENTION_DAYS
-
MAXIMUM_WORKSPACE_RETENTION_DAYS
-
MAXIMUM_WORKSPACE_SIZE_MB
These can help manage workspace data retention [Airbyte Rotation Logs](https://discuss.airbyte.io/t/332). 5. Regular database maintenance: If you're using Postgres as a destination, remember that it doesn't automatically free up disk space. Consider running VACUUM operations regularly to reclaim space [Github Issue Comment](https://github.com/airbytehq/airbyte/issues/36095#issuecomment-1499330181). Remember to be cautious when clearing data, as these operations cannot be undone. Always ensure you have necessary backups before performing any data removal operations.
e
I'm running OSS on an EC2 instance. What would you recommend out of those options?
@kapa.ai I'm running OSS on an EC2 instance. What would you recommend out of those options?
u
A message from kapa.ai
u
Based on your situation of running Airbyte OSS on an EC2 instance, I would recommend the following options: 1. Configure log retention: Set the
TEMPORAL_HISTORY_RETENTION_IN_DAYS
environment variable to a lower value. This can significantly reduce storage usage over time. According to the [Airbyte forum post](https://discuss.airbyte.io/t/2417), you can change this variable to control how long Airbyte keeps logs:
Copy code
TEMPORAL_HISTORY_RETENTION_IN_DAYS=7
This will keep logs for 7 days instead of the default, which should help manage your storage usage. 2. Clear data for specific streams or connections: Regularly review and clear data for streams or connections that you no longer need. This can be done through the Airbyte UI as mentioned in the previous answer. 3. Manage disk space: According to the [Airbyte documentation on scaling](https://docs.airbyte.com/operator-guides/scaling-airbyte#disk-space), allocate a minimum of 30GBs of disk space per node. Since you're running on EC2, you have the flexibility to increase your EBS volume size if needed. However, be aware that long-running syncs can produce a fair amount of logs, which can consume disk space. 4. Consider external storage: If you're consistently running into storage issues, you might want to consider using external storage for state and logging information. Airbyte recommends using an object storage solution such as S3 for this purpose, as mentioned in the [Airbyte documentation on state and logging storage](https://docs.airbyte.com/deploying-airbyte/integrations/storage). This can help offload some of the storage requirements from your EC2 instance. 5. Regular maintenance: Periodically check the
/var/lib/docker/volumes/airbyte_workspace/_data
directory, as this seems to be a common source of growing storage usage according to the [forum post](https://discuss.airbyte.io/t/2417). You might need to manually clean up this directory if it grows too large. Remember to always backup your data before performing any cleanup operations. Also, monitor your EC2 instance's storage usage regularly to prevent unexpected outages due to full disks.
e
@kapa.ai does airbyte 1.0+ have anything that takes care of these storage issues on its own?
u
A message from kapa.ai
u
Based on the provided knowledge sources, I don't have specific information about Airbyte 1.0+ having built-in features that automatically take care of storage issues. However, there are some relevant points to consider: 1. Log retention: Airbyte version 0.45.5 and later reimplemented a feature to control log retention. You can set the
TEMPORAL_HISTORY_RETENTION_IN_DAYS
variable to manage how long logs are kept [Forum Post: Quickly run out of storage](https://discuss.airbyte.io/t/2417). This helps in automatically managing log storage. 2. Workspace cleanup: Airbyte schedules periodic jobs to delete logs from syncs. By default, these logs are deleted every month or when their size exceeds 5GB [Airbyte Rotation Logs](https://discuss.airbyte.io/t/332). However, it's worth noting that this information might be outdated, as the configuration mentioned in that post was removed with the Scheduler. 3. Current approach: As of version 0.40.5, Airbyte cleans up files from the workspace by default every 30 days [Airbyte log sweeping](https://discuss.airbyte.io/t/2714). This is configurable, so you can set it to retain logs for a shorter period by setting the environment variable:
TEMPORAL_HISTORY_RETENTION_IN_DAYS=1
While these features help manage storage to some extent, they may not completely solve all storage issues automatically. It's still recommended to monitor your storage usage and implement additional measures if needed, especially for long-running syncs or deployments with multiple connectors. For the most up-to-date information on storage management features in Airbyte 1.0+, I recommend checking the latest official Airbyte documentation or reaching out to Airbyte support directly.
e
@kapa.ai I've seen some tips about running sql statements against the internal airbyte db but I don't know how to connect and do that
u
A message from kapa.ai
u
Based on the information provided in the knowledge sources, I can guide you on how to connect and run SQL statements against the internal Airbyte database. This process is specifically for accessing the default database located in the Docker container
airbyte-db
. Here's how you can do it: 1. First, ensure you're using the default
airbyte-db
Postgres database. If you've set up an external database, these instructions may not apply. 2. The default credentials for the database are specified in the
.env
file used to run Airbyte. By default, they are:
Copy code
DATABASE_USER=docker    DATABASE_PASSWORD=docker    DATABASE_DB=airbyte
If you've changed these defaults, you'll need to use your custom values instead. 3. To access the database using
psql
, run the following command in your terminal:
Copy code
docker exec -ti airbyte-db psql -U docker -d airbyte
This command connects you to the Postgres database inside the
airbyte-db
container [Accessing the default database located in docker airbyte-db](https://docs.airbyte.com/operator-guides/configuring-airbyte-db#accessing-the-default-database-located-in-docker-airbyte-db). 4. Once connected, you can run SQL queries against the various tables in the Airbyte database. Some of the important tables include: -
workspace
-
actor_definition
-
actor
-
actor_oauth_parameter
-
operation
-
connection
-
connection_operation
-
state
5. To see all available tables, you can use the
\d
command in psql. 6. To view the structure of a specific table, use
\d table_name
, for example:
\d actor
Remember that this database contains important configuration and operational data for Airbyte. Be cautious when making any changes, and it's recommended to only perform read operations unless you're absolutely sure about what you're doing. Also, note that no actual data from your source or destination connectors is stored in this internal database [Documentation > Operator-guides > Configuring the Airbyte Database](https://docs.airbyte.com/operator-guides/configuring-airbyte-db).
e
@kapa.ai Error response from daemon: No such container: airbyte-db
u
A message from kapa.ai
u
It seems you're encountering an error when trying to access the
airbyte-db
container. This error suggests that the Docker daemon cannot find a container named
airbyte-db
. There could be a few reasons for this: 1. Airbyte is not currently running: Make sure you've started Airbyte using the appropriate command (usually
docker-compose up
or
docker compose up
, depending on your Docker version). 2. The container name has changed: In some Airbyte versions or custom setups, the container name might be different. 3. You're using a different deployment method: If you're using Kubernetes or a cloud-managed version of Airbyte, the approach to accessing the database would be different. To troubleshoot this issue: 1. First, check if Airbyte is running. You can list all running Docker containers with:
Copy code
docker ps
Look for containers with names starting with
airbyte_
. 2. If you don't see any Airbyte containers, try starting Airbyte. Navigate to your Airbyte directory and run:
Copy code
docker-compose up -d
3. If Airbyte is running but you still don't see the
airbyte-db
container, check the container names. The database container might have a slightly different name. Look for a container with "db" in its name. 4. If you're using a newer version of Airbyte, the container names might have changed. You can try using the container ID instead of the name. First, list all containers (including stopped ones):
Copy code
docker ps -a
Then, look for the container ID of the database container and use it in place of
airbyte-db
in the command. If you're still having issues, it might be helpful to check your Airbyte version and deployment method, as the exact steps can vary. Also, ensure that you're running the command from the same environment where Airbyte is deployed. Remember, accessing the internal database directly should be done cautiously, as mentioned in the [Airbyte documentation](https://docs.airbyte.com/operator-guides/configuring-airbyte-db#accessing-the-default-database-located-in-docker-airbyte-db).
e
@kapa.ai I only see one container
airbyte-abctl-control-plane
u
A message from kapa.ai
u
Based on the information you've provided, it seems you're using the
abctl
tool to manage Airbyte, which uses a different deployment method than the standard Docker Compose setup. This explains why you're only seeing one container named
airbyte-abctl-control-plane
. The
abctl
tool uses Kubernetes to deploy Airbyte, even when running locally. This means that instead of individual Docker containers, Airbyte components are running as Kubernetes pods within a local Kubernetes cluster (likely managed by Kind). To interact with this setup: 1. First, ensure you have
kubectl
installed and configured to work with the local Kubernetes cluster created by
abctl
. 2. You can list the Airbyte pods by running:
Copy code
kubectl get pods -n airbyte-abctl
This should show you the various Airbyte components running as pods [Documentation > Contributing-to-airbyte > Developing Locally > Prerequisites > Using abctl for Airbyte development Part 2](https://docs.airbyte.com/contributing-to-airbyte/developing-locally#using-abctl-for-airbyte-development). 3. To access the database in this setup, you'll need to find the database pod. It's likely named something like
airbyte-db-0
. 4. To connect to the database, you can use:
Copy code
kubectl exec -it -n airbyte-abctl airbyte-db-0 -- psql -U airbyte
This command might need to be adjusted based on the exact pod name and database user. 5. If you need to view logs for a particular component, you can use:
Copy code
kubectl logs -n airbyte-abctl <pod-name>
For example:
Copy code
kubectl logs -n airbyte-abctl airbyte-abctl-server-74465db7fd-gk25q
Remember, when using
abctl
, you're working with a Kubernetes-based deployment, so the standard Docker commands won't work directly. You'll need to use Kubernetes commands (
kubectl
) to interact with the Airbyte components. If you're having issues with the
abctl
installation, you might want to check the [Github issue about abctl local install failing](https://github.com/airbytehq/airbyte/issues/44931), which discusses some common problems and solutions when using
abctl
.
e
@kapa.ai E1028 171304.893049 14761 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.18080 connect: connection refused" E1028 171304.894577 14761 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.18080 connect: connection refused" E1028 171304.896075 14761 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.18080 connect: connection refused" E1028 171304.897600 14761 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.18080 connect: connection refused" E1028 171304.899066 14761 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.18080 connect: connection refused" The connection to the server localhost:8080 was refused - did you specify the right host or port?
u
A message from kapa.ai
u
Based on the error message you've provided, it appears you're encountering a connection refused error when trying to connect to the Kubernetes API server. This is a common issue when using
abctl
to manage Airbyte locally. Let's break down the problem and potential solutions: 1. The error suggests that the Kubernetes API server is not accessible at
localhost:8080
. This could be because: - The local Kubernetes cluster (likely managed by Kind) is not running - The API server is not exposed on the expected port - There's a network issue preventing the connection 2. This error is similar to the one reported in the [Github issue about abctl local install failing](https://github.com/airbytehq/airbyte/issues/40231), which indicates it's a known issue with the
abctl
tool. To resolve this issue, you can try the following steps: 1. Ensure that the local Kubernetes cluster is running. You can check this by running:
Copy code
kind get clusters
If you don't see a cluster named
airbyte-abctl
, you may need to create it. 2. If the cluster exists, try to get the correct API server address and port:
Copy code
kubectl cluster-info
This should show you the correct address and port for the Kubernetes API server. 3. Check if the
abctl
kubeconfig file is correctly configured. It should be located at
~/.airbyte/abctl/abctl.kubeconfig
. You can try setting it explicitly:
Copy code
export KUBECONFIG=~/.airbyte/abctl/abctl.kubeconfig
4. If the issue persists, you might need to recreate the local Kubernetes cluster. You can do this by running:
Copy code
abctl local uninstall    abctl local install
5. If you're still encountering issues, it might be worth checking your firewall settings to ensure it's not blocking the connection to localhost:8080. Remember,
abctl
is using a local Kubernetes setup, so troubleshooting steps for standard Docker-based Airbyte installations may not apply here. If these steps don't resolve the issue, you might want to consider reporting it as a bug to the Airbyte GitHub repository, as it seems to be a recurring issue with the
abctl
tool. [Documentation > Deploying-airbyte > Troubleshooting abctl > Common Errors > Connection Refused](https://docs.airbyte.com/deploying-airbyte/troubleshoot-deploy#connection-refused) provides similar error messages and context, confirming that this is a known issue with
abctl
.
e
@kapa.ai what are in the different databases? db-airbyte, postgres, templates, and temporal?
u
A message from kapa.ai
u
Based on the knowledge sources provided, I can give you information about some of the databases you mentioned. However, I don't have specific information about all the databases you listed. Let me share what I know: 1. db-airbyte (Config and Job databases): Airbyte uses two internal databases to store different types of data: - Config database: This stores data about connectors (sources and destinations), sync connections, and various Airbyte configuration objects [Documentation > Operator-guides > Configuring the Airbyte Database](https://docs.airbyte.com/operator-guides/configuring-airbyte-db). - Job database: This contains data about executions of Airbyte Jobs and various runtime metadata. It also stores data about the internal orchestrator used by Airbyte, Temporal.io (Tasks, Workflow data, Events, and visibility data) [Documentation > Operator-guides > Configuring the Airbyte Database](https://docs.airbyte.com/operator-guides/configuring-airbyte-db). It's important to note that no actual data from the source or destination connectors is stored in these internal databases. 2. Temporal database: While I don't have specific information about a separate "temporal" database, the Job database mentioned above does contain data related to Temporal.io, which is the internal orchestrator used by Airbyte [Documentation > Operator-guides > Configuring the Airbyte Database](https://docs.airbyte.com/operator-guides/configuring-airbyte-db). 3. Postgres: Postgres is the default database system used by Airbyte for its internal databases (Config and Job). It's described in the
docker-compose.yml
file as
airbyte/db
[Documentation > Operator-guides > Configuring the Airbyte Database](https://docs.airbyte.com/operator-guides/configuring-airbyte-db). I don't have specific information about a "templates" database in the provided knowledge sources. To get more detailed or up-to-date information about these databases and their contents, I recommend checking the latest Airbyte documentation or reaching out to Airbyte support directly.