This message was deleted Permit #opal

Join Slack

This message was deleted.

# opal

Slackbot

09/01/2023, 2:50 PM

This message was deleted.

Ionut Andrei Oanca

09/01/2023, 2:51 PM

The request to trigger data updates:

Copy code

curl --location '<http://localhost:7002/data/config>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "entries": [
        {
            "url": "<postgresql://postgres@example_db:5432/accounts>",
            "config": {
                "fetcher": "PostgresFetchProvider",
                "query": "SELECT * from profiles;",
                "connection_params": {
                    "password": "postgres"
                }
            },
            "topics": [
                "policy_data"
            ],
            "dst_path": "accounts/profiles"
        }
    ]
}'

Ionut Andrei Oanca

09/01/2023, 2:51 PM

If needed, I can try and collect some logs from the containers :)

Ionut Andrei Oanca

09/01/2023, 2:55 PM

Could it be that the 1M rows do actually take up that much space because of how Python stores the data in memory?

Or Weis

09/01/2023, 3:19 PM

Hi @Ionut Andrei Oanca - thanks for sharing this information. DataFetchers in general and Postgres one included - are more designed to do multiple light updates and compile something larger in aggregate. Yes, the bloat in memory here can be due to Python, the fetcher code, and my prime suspects the underlying Postgres lib used

asyncpg

A more exact answer would require some investiagtion. Here are a few workarounds to consider: 1. Breakdown the large select into multiple queries and build up the data you need with multiple update entries a. make sure to use different a

dst_path

for each entry 2. Create a new data-fetcher that uses an external process to read the data from the DB and then pass it to the fetcher itself a. This could also maybe an option feature to add to the current Postgres fetcher (if you feel like contributing back) Hope this helps 🙂

👍 1

Edoardo Viviani

09/01/2023, 4:00 PM

Hey @Or Weis, we will surely try your first suggestion, we have been in fact getting this issue only in our initial data fetching, where we’re trying to pull quite a lot of entries to create our initial state, and our triggered updates which only target a single entity or small chunks of the data set do not seem to be affected (i’ve also accidentally triggered a couple hundred thousands of updates in a few seconds and it only started throwing errors when mongo went out of available connections, the opal-client pods did not seem to care much) Doing so means chunking our first initial query into small ones and providing opal with many small update entries in our initial config, and even if i am a bit concerned with data integrity on a live database, i did see that coming and we will sort this out somehow. What would probably be a nice to have is the possibility for fetchers to push data progressively into opal instead of having to return a single big dataset as result of the procedure, even though i see complications coming from this approach, again, data integrity would be hard to guarantee if the fetcher fails after it has partially pushed some data into opal, but it would be a way to let the fetcher handle the chunking of a single update trigger. Also, i think that we should not totally blame

asyncpg

as we observed the same behaviour in our mongo fetcher using

motor

(we’re just trying to make sure it works as expected before publishing it 😅)

Or Weis

09/01/2023, 4:19 PM

Hi @Edoardo Viviani - sounds good.

What would probably be a nice to have is the possibility for fetchers to push data progressively into opal instead of having to return a single big dataset as result of the procedure, even though i see complications coming from this approach, again, data integrity would be hard to guarantee if the fetcher fails after it has partially pushed some data into opal, but it would be a way to let the fetcher handle the chunking of a single update trigger.

This is something we (in particular @Shaul Kremer) are working on for the next major release of OPAL- the idea there is to have a virtual document both the client and the server can keep track of as it’s being assembled from various sources .

Also, i think that we should not totally blame
asyncpg
as we observed the same behavior in our mongo fetcher using
motor

Totally agree, I just said it’s my prime-suspect - which isn’t saying it’s defiently the culprit. What the two libs you mention have in common is external C code managing memory to/from Python - which is a known sensitive area for the type of issues described. CC: @Asaf Cohen

➕ 1

Ionut Andrei Oanca

09/08/2023, 12:50 PM

Hi @Or Weis, following up on this - that is “initial data load where a map with 1M keys is loaded in one go” - it turns out that it probably was a combination of too many workers, not enough Garbage Collection on both Python and Go, and too much data. Eventually, by setting

Copy code

- OPAL_FETCHING_WORKER_COUNT=2
- GOMAXPROCS=1
- GOGC=1
- GOMEMLIMIT=3500

we managed to keep the

opal-client

container well under the 6GB cap, while still loading 120MB of JSON file with one million records. One big mystery still remains though: once the system has warmed up, and loaded the initial data, the used memory by

opal-client

doubles when triggering an update. OPA does this too but will free the memory over time when the GC kicks in.

opal-client

, on the other hand, seems to be keeping a copy of the previous data, like some sort of FIFO with size 2. For these new tests we used only the

HttpFetcher

, in order to reduce the variables at play and better pin point the source of these behaviours. If you wish, I can invite you to my testbench repository on GitHub, just let me know. Questions time! Maybe more like a confirmation - there’s no way to have multiple entries (in the config) update the same

dest_path

but have them join/merge the data into the same final object, right? Something like

OPAL_SPLIT_ROOT_DATA

where

dest_path="/map"

. I suppose that such a feature would need to be implemented either by OPAL (by “manually” targeting

"/map/key1"

where

key1

is a field of the result from the fetcher) or by OPA by exposing in the REST API a method to merge the given input with the object at the specified target.

Copy code

# in config.entries
entry1: { dest_path: "/map", data: { "key1": object } }
entry2: { dest_path: "/map", data: { "key2": object } }

# final data
/map = { "key1": object, "key2": object }

OPA seems to be doing this merge when using the CLI but does not have something similar in the REST API. We could, I guess, use the

PATCH

save_method

POST /data/config

but what doesn’t convince us is: • whether the

save_method

can be used with any Fetch Provider or just the Http one • and the overhead generated by the extra fields

"op": "add", "path": "/key1"

when we’re handling this many entries (we would be partitioning the 1M rows at this point, so

1M / n

with

carefully tuned) Thanks again for your time! ✨

Or Weis

09/08/2023, 1:11 PM

Hi @Ionut Andrei Oanca thank you for sharing this, very useful information. I'll have the team investing ate this (@Ro'e Katz, @Asaf Cohen, @Shaul Kremer) - this might be improved by our planned move to Pydantic 2. Re: data assembly - you can either avoid filling in nested keys, have a intermediate server combine the data, or use a custom data-fetcher in combination with the patch method.

Or Weis

09/08/2023, 1:17 PM

You can use the patch method technically with any fetcher, but the question is if the data fetched will match the patch format. In most cases you'd probably need to write a custom fetcher

Ionut Andrei Oanca

09/08/2023, 2:11 PM

You can use the patch method technically with any fetcher, but the question is if the data fetched will match the patch format. In most cases you’d probably need to write a custom fetcher

This definitely makes sense, in MongoDB it might still be possible to transform the output from the query to match the patch format (not the prettiest query probably), but I see how other fetchers might need a specific implementation As for the data assembly, thanks for confirming the available routes 👍

Ro'e Katz

10/09/2023, 3:07 PM

@Ionut Andrei Oanca - Can you please invite me to the said repository? 🙂 (my GitHub user is

roekatz

<mailto:roe@roekatz.com|roe@roekatz.com>

)

Ionut Andrei Oanca

11/15/2023, 3:39 PM

Hi @Ro'e Katz sorry for the long delay, I’ve completely missed the notification on this Slack 😅, just sent you the invitation to the testbench repo, let me know if you need more info

Oded Bd

11/22/2023, 11:15 AM

Hi @Ionut Andrei Oanca thanks for sharing, we will create a ticket and prioritize this, if you have more information to share it will help and we will add it to the ticket

Ionut Andrei Oanca

11/23/2023, 5:21 PM

Hi @Oded Bd, I guess this whole adventure boils down to the following question: why is

opal-client

not freeing memory once its job is done, given that a subsequent call with less data seems to free the memory used by the previous call? In the repository I’ve tried to provide a way to reproduce what I’m referring to, let me know if you want an invite as well :)

Oded Bd

11/23/2023, 5:49 PM

Thanks for the clear explanation 🙂

🙌 1

6 Views

Open in Slack

Previous Next