This message was deleted.
# opal
s
This message was deleted.
i
The request to trigger data updates:
Copy code
curl --location '<http://localhost:7002/data/config>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "entries": [
        {
            "url": "<postgresql://postgres@example_db:5432/accounts>",
            "config": {
                "fetcher": "PostgresFetchProvider",
                "query": "SELECT * from profiles;",
                "connection_params": {
                    "password": "postgres"
                }
            },
            "topics": [
                "policy_data"
            ],
            "dst_path": "accounts/profiles"
        }
    ]
}'
If needed, I can try and collect some logs from the containers :)
Could it be that the 1M rows do actually take up that much space because of how Python stores the data in memory?
o
Hi @Ionut Andrei Oanca - thanks for sharing this information. DataFetchers in general and Postgres one included - are more designed to do multiple light updates and compile something larger in aggregate. Yes, the bloat in memory here can be due to Python, the fetcher code, and my prime suspects the underlying Postgres lib used
asyncpg
A more exact answer would require some investiagtion. Here are a few workarounds to consider: 1. Breakdown the large select into multiple queries and build up the data you need with multiple update entries a. make sure to use different a
dst_path
for each entry 2. Create a new data-fetcher that uses an external process to read the data from the DB and then pass it to the fetcher itself a. This could also maybe an option feature to add to the current Postgres fetcher (if you feel like contributing back) Hope this helps šŸ™‚
šŸ‘ 1
e
Hey @Or Weis, we will surely try your first suggestion, we have been in fact getting this issue only in our initial data fetching, where we’re trying to pull quite a lot of entries to create our initial state, and our triggered updates which only target a single entity or small chunks of the data set do not seem to be affected (i’ve also accidentally triggered a couple hundred thousands of updates in a few seconds and it only started throwing errors when mongo went out of available connections, the opal-client pods did not seem to care much) Doing so means chunking our first initial query into small ones and providing opal with many small update entries in our initial config, and even if i am a bit concerned with data integrity on a live database, i did see that coming and we will sort this out somehow. What would probably be a nice to have is the possibility for fetchers to push data progressively into opal instead of having to return a single big dataset as result of the procedure, even though i see complications coming from this approach, again, data integrity would be hard to guarantee if the fetcher fails after it has partially pushed some data into opal, but it would be a way to let the fetcher handle the chunking of a single update trigger. Also, i think that we should not totally blame
asyncpg
as we observed the same behaviour in our mongo fetcher using
motor
(we’re just trying to make sure it works as expected before publishing it šŸ˜…)
o
Hi @Edoardo Viviani - sounds good.
What would probably be a nice to have is the possibility for fetchers to push data progressively into opal instead of having to return a single big dataset as result of the procedure, even though i see complications coming from this approach, again, data integrity would be hard to guarantee if the fetcher fails after it has partially pushed some data into opal, but it would be a way to let the fetcher handle the chunking of a single update trigger.
This is something we (in particular @Shaul Kremer) are working on for the next major release of OPAL- the idea there is to have a virtual document both the client and the server can keep track of as it’s being assembled from various sources .
Also, i think that we should not totally blame
asyncpg
as we observed the same behavior in our mongo fetcher using
motor
Totally agree, I just said it’s my prime-suspect - which isn’t saying it’s defiently the culprit. What the two libs you mention have in common is external C code managing memory to/from Python - which is a known sensitive area for the type of issues described. CC: @Asaf Cohen
āž• 1
i
Hi @Or Weis, following up on this - that is ā€œinitial data load where a map with 1M keys is loaded in one goā€ - it turns out that it probably was a combination of too many workers, not enough Garbage Collection on both Python and Go, and too much data. Eventually, by setting
Copy code
- OPAL_FETCHING_WORKER_COUNT=2
- GOMAXPROCS=1
- GOGC=1
- GOMEMLIMIT=3500
we managed to keep the
opal-client
container well under the 6GB cap, while still loading 120MB of JSON file with one million records. One big mystery still remains though: once the system has warmed up, and loaded the initial data, the used memory by
opal-client
doubles when triggering an update. OPA does this too but will free the memory over time when the GC kicks in.
opal-client
, on the other hand, seems to be keeping a copy of the previous data, like some sort of FIFO with size 2. For these new tests we used only the
HttpFetcher
, in order to reduce the variables at play and better pin point the source of these behaviours. If you wish, I can invite you to my testbench repository on GitHub, just let me know. Questions time! Maybe more like a confirmation - there’s no way to have multiple entries (in the config) update the same
dest_path
but have them join/merge the data into the same final object, right? Something like
OPAL_SPLIT_ROOT_DATA
where
dest_path="/map"
. I suppose that such a feature would need to be implemented either by OPAL (by ā€œmanuallyā€ targeting
"/map/key1"
where
key1
is a field of the result from the fetcher) or by OPA by exposing in the REST API a method to merge the given input with the object at the specified target.
Copy code
# in config.entries
entry1: { dest_path: "/map", data: { "key1": object } }
entry2: { dest_path: "/map", data: { "key2": object } }

# final data
/map = { "key1": object, "key2": object }
OPA seems to be doing this merge when using the CLI but does not have something similar in the REST API. We could, I guess, use the
PATCH
save_method
of
POST /data/config
but what doesn’t convince us is: • whether the
save_method
can be used with any Fetch Provider or just the Http one • and the overhead generated by the extra fields
"op": "add", "path": "/key1"
when we’re handling this many entries (we would be partitioning the 1M rows at this point, so
1M / n
with
n
carefully tuned) Thanks again for your time! ✨
o
Hi @Ionut Andrei Oanca thank you for sharing this, very useful information. I'll have the team investing ate this (@Ro'e Katz, @Asaf Cohen, @Shaul Kremer) - this might be improved by our planned move to Pydantic 2. Re: data assembly - you can either avoid filling in nested keys, have a intermediate server combine the data, or use a custom data-fetcher in combination with the patch method.
You can use the patch method technically with any fetcher, but the question is if the data fetched will match the patch format. In most cases you'd probably need to write a custom fetcher
i
You can use the patch method technically with any fetcher, but the question is if the data fetched will match the patch format. In most cases you’d probably need to write a custom fetcher
This definitely makes sense, in MongoDB it might still be possible to transform the output from the query to match the patch format (not the prettiest query probably), but I see how other fetchers might need a specific implementation As for the data assembly, thanks for confirming the available routes šŸ‘
r
@Ionut Andrei Oanca - Can you please invite me to the said repository? šŸ™‚ (my GitHub user is
roekatz
/
<mailto:roe@roekatz.com|roe@roekatz.com>
)
i
Hi @Ro'e Katz sorry for the long delay, I’ve completely missed the notification on this Slack šŸ˜…, just sent you the invitation to the testbench repo, let me know if you need more info
o
Hi @Ionut Andrei Oanca thanks for sharing, we will create a ticket and prioritize this, if you have more information to share it will help and we will add it to the ticket
i
Hi @Oded Bd, I guess this whole adventure boils down to the following question: why is
opal-client
not freeing memory once its job is done, given that a subsequent call with less data seems to free the memory used by the previous call? In the repository I’ve tried to provide a way to reproduce what I’m referring to, let me know if you want an invite as well :)
o
Thanks for the clear explanation šŸ™‚
šŸ™Œ 1