Slackbot
09/01/2023, 2:50 PMIonut Andrei Oanca
09/01/2023, 2:51 PMcurl --location '<http://localhost:7002/data/config>' \
--header 'Content-Type: application/json' \
--data-raw '{
"entries": [
{
"url": "<postgresql://postgres@example_db:5432/accounts>",
"config": {
"fetcher": "PostgresFetchProvider",
"query": "SELECT * from profiles;",
"connection_params": {
"password": "postgres"
}
},
"topics": [
"policy_data"
],
"dst_path": "accounts/profiles"
}
]
}'Ionut Andrei Oanca
09/01/2023, 2:51 PMIonut Andrei Oanca
09/01/2023, 2:55 PMOr Weis
09/01/2023, 3:19 PMasyncpg
A more exact answer would require some investiagtion.
Here are a few workarounds to consider:
1. Breakdown the large select into multiple queries and build up the data you need with multiple update entries
a. make sure to use different a dst_path for each entry
2. Create a new data-fetcher that uses an external process to read the data from the DB and then pass it to the fetcher itself
a. This could also maybe an option feature to add to the current Postgres fetcher (if you feel like contributing back)
Hope this helps šEdoardo Viviani
09/01/2023, 4:00 PMasyncpg as we observed the same behaviour in our mongo fetcher using motor (weāre just trying to make sure it works as expected before publishing it š
)Or Weis
09/01/2023, 4:19 PMWhat would probably be a nice to have is the possibility for fetchers to push data progressively into opal instead of having to return a single big dataset as result of the procedure, even though i see complications coming from this approach, again, data integrity would be hard to guarantee if the fetcher fails after it has partially pushed some data into opal, but it would be a way to let the fetcher handle the chunking of a single update trigger.This is something we (in particular @Shaul Kremer) are working on for the next major release of OPAL- the idea there is to have a virtual document both the client and the server can keep track of as itās being assembled from various sources .
Also, i think that we should not totally blameTotally agree, I just said itās my prime-suspect - which isnāt saying itās defiently the culprit. What the two libs you mention have in common is external C code managing memory to/from Python - which is a known sensitive area for the type of issues described. CC: @Asaf Cohenas we observed the same behavior in our mongo fetcher usingasyncpgmotor
Ionut Andrei Oanca
09/08/2023, 12:50 PM- OPAL_FETCHING_WORKER_COUNT=2
- GOMAXPROCS=1
- GOGC=1
- GOMEMLIMIT=3500
we managed to keep the opal-client container well under the 6GB cap, while still loading 120MB of JSON file with one million records.
One big mystery still remains though: once the system has warmed up, and loaded the initial data, the used memory by opal-client doubles when triggering an update. OPA does this too but will free the memory over time when the GC kicks in. opal-client, on the other hand, seems to be keeping a copy of the previous data, like some sort of FIFO with size 2.
For these new tests we used only the HttpFetcher, in order to reduce the variables at play and better pin point the source of these behaviours. If you wish, I can invite you to my testbench repository on GitHub, just let me know.
Questions time!
Maybe more like a confirmation - thereās no way to have multiple entries (in the config) update the same dest_path but have them join/merge the data into the same final object, right?
Something like OPAL_SPLIT_ROOT_DATA where dest_path="/map" . I suppose that such a feature would need to be implemented either by OPAL (by āmanuallyā targeting "/map/key1" where key1 is a field of the result from the fetcher) or by OPA by exposing in the REST API a method to merge the given input with the object at the specified target.
# in config.entries
entry1: { dest_path: "/map", data: { "key1": object } }
entry2: { dest_path: "/map", data: { "key2": object } }
# final data
/map = { "key1": object, "key2": object }
OPA seems to be doing this merge when using the CLI but does not have something similar in the REST API.
We could, I guess, use the PATCH save_method of POST /data/config but what doesnāt convince us is:
⢠whether the save_method can be used with any Fetch Provider or just the Http one
⢠and the overhead generated by the extra fields "op": "add", "path": "/key1" when weāre handling this many entries (we would be partitioning the 1M rows at this point, so 1M / n with n carefully tuned)
Thanks again for your time! āØOr Weis
09/08/2023, 1:11 PMOr Weis
09/08/2023, 1:17 PMIonut Andrei Oanca
09/08/2023, 2:11 PMYou can use the patch method technically with any fetcher, but the question is if the data fetched will match the patch format. In most cases youād probably need to write a custom fetcherThis definitely makes sense, in MongoDB it might still be possible to transform the output from the query to match the patch format (not the prettiest query probably), but I see how other fetchers might need a specific implementation As for the data assembly, thanks for confirming the available routes š
Ro'e Katz
10/09/2023, 3:07 PMroekatz / <mailto:roe@roekatz.com|roe@roekatz.com>)Ionut Andrei Oanca
11/15/2023, 3:39 PMOded Bd
11/22/2023, 11:15 AMIonut Andrei Oanca
11/23/2023, 5:21 PMopal-client not freeing memory once its job is done, given that a subsequent call with less data seems to free the memory used by the previous call? In the repository Iāve tried to provide a way to reproduce what Iām referring to, let me know if you want an invite as well :)Oded Bd
11/23/2023, 5:49 PM