<@U04SE163WC8>, the shopify historical data fetch ...
# ask-ai
d
@kapa.ai, the shopify historical data fetch is taking a lot of time to get all the data. We have installed Airbyte on 32GB EC2. The job is waiting for a lot of time to get the data streams using GraphQL BULK API. Here is our configuration of Shopify connection. Could you please let us know what would be the ideal values to be set to the following parameters to fetch the data quickly? • GraphQL BULK Date Range in Days • BULK Job checkpoint (rows collected) • BULK Job termination threshold
u
Sorry @Dhiraj Bhalerao, it looks like your question contains files, which I cannot process. Please reformulate your question without including any files.
d
Okay @kapa.ai, here is my question. the shopify historical data fetch is taking a lot of time to get all the data. We have installed Airbyte on 32GB EC2. The job is waiting for a lot of time to get the data streams using GraphQL BULK API. Here is our configuration of Shopify connection. Could you please let us know what would be the ideal values to be set to the following parameters to fetch the data quickly? • GraphQL BULK Date Range in Days • BULK Job checkpoint (rows collected) • BULK Job termination threshold
u
The ideal values for the parameters you mentioned can vary depending on the density of your data and the specific requirements of your use case. However, here are some general guidelines: - GraphQL BULK Date Range in Days: This parameter defines the date range for each BULK job. If you have a high density of data (many records within a short period), you might want to set this to a smaller value to avoid overloading the server. However, if your data is spread out over a longer period, a larger value might be more efficient. - BULK Job checkpoint (rows collected): This parameter is not mentioned in the provided knowledge sources, so I can't provide specific advice on it. - BULK Job termination threshold: This parameter controls the limit when the Bulk Job is canceled. If you have a lot of data for a specific stream, you might want to set this to a larger value to allow the job to run longer. For example, one of the comments in a GitHub issue suggests setting it to
10800
, which corresponds to 3 hours per single Bulk Job run ([source](https://github.com/airbytehq/airbyte/issues/39499)). Remember to monitor your jobs and adjust these parameters as needed based on their performance. If you're still having trouble, you might want to reach out to the Airbyte community for more specific advice ([source](https://docs.airbyte.com/operator-guides/scaling-airbyte#feedback)).
d
@kapa.ai, We have a huge amount of data for a longer period. What would you suggest?
u
A message from kapa.ai
u
If you have a large amount of data spread over a long period, you might want to consider the following: - GraphQL BULK Date Range in Days: Set this to a larger value to cover a longer period per BULK job. However, be aware that this could potentially lead to larger jobs that take longer to complete. - BULK Job termination threshold: Set this to a larger value to allow the job to run longer. This could help ensure that each job has enough time to complete, especially if you're dealing with large amounts of data. Additionally, you might want to consider optimizing your data infrastructure to handle large datasets. For example, you could: - Extend the retention period of CDC data: In SQL Server, by default, only three days of data are retained in the change tables. Unless you are running very frequent syncs, it is suggested to increase this retention so that in case of a failure in sync or if the sync is paused, there is still some bandwidth to start from the last point in incremental sync. This can be done using the stored procedure
sys.sp_cdc_change_job
([source](https://docs.airbyte.com/integrations/sources/mssql#4-extend-the-retention-period-of-cdc-data)). - Scale your setup: Consider scaling your setup to handle more data, more sources, and more transformations. Optimize your configurations and resources to ensure smooth and efficient processing of larger datasets ([source](https://airbyte.com/tutorials/e-commerce-analytics-with-airbyte-dbt-dagster-and-bigquery)). Remember, these are just suggestions and the ideal configuration can depend on many factors. Monitor your jobs and adjust these parameters as needed based on their performance. If you're still having trouble, consider reaching out to the Airbyte community for more specific advice ([source](https://discuss.airbyte.io/t/3783)).