htmx #🔥-django-htmx

prehistoric-cat-63987

05/05/2023, 7:22 PM

My bad, I forgot to mention that the process has nothing to do with DB, we don't have any database on the microservice to start with and its task is somewhat scraping and crawling related, and the .5 seconds would be considered fast considering the amount of crawling needed.

prehistoric-cat-63987

05/05/2023, 7:24 PM

The microservice isn't dependent on database, it's some pure scraping and crawling job as explained above

ancient-shoe-86801

05/05/2023, 7:28 PM

So, it's like passing a list of keywords, and then for each keyword it makes one (or more) network requests looking for that keyword on some websites?

prehistoric-cat-63987

05/05/2023, 8:08 PM

Yes, then return a list of response for each name over the api

ancient-shoe-86801

05/05/2023, 8:09 PM

are you crawling the list items concurrently? That should help a bit

ancient-shoe-86801

05/05/2023, 8:11 PM

I know I'm not answering to your original question, but still feels like improvement can come from somewhere else (with the limited knowledge of the system)

bitter-machine-55943

05/05/2023, 10:14 PM

Do you have control of the microservice? Or is it a third party thing? Seems like you either have to optimize the microservice, like parallel processing the entire list at once, or caching. Or cache in Django from the microservice, if the nature of the problem would allow that (might not).

bitter-machine-55943

05/05/2023, 10:15 PM

When you say crawling, is it like web scraping, or like crawling a dataset to calculate some report/results?

great-cartoon-12331

05/05/2023, 10:18 PM

a query doesn't necessarily mean a database query 🙂 N+1 problem can happen even if you are doing network requests like crawling. this is very much a question of system design to reduce network requests as much as possible

prehistoric-cat-63987

05/06/2023, 5:36 PM

No, we've actually improved the performance cause it takes a lot longer before now, and we all think the performance is great for now

prehistoric-cat-63987

05/06/2023, 5:37 PM

Yes, I have absolute control of the microservice, how do I process the entire list parallel? Please shed more light

prehistoric-cat-63987

05/06/2023, 5:38 PM

The request initially takes allot longer, it was reduced to this point after allot of processing and we all think it's good for now

bitter-machine-55943

05/06/2023, 5:40 PM

What does the microservice do? Like in general terms, you mentioned crawling something, but what’s happening there? Like making connections to the web, or crawling a database, or folders of files, or what?

prehistoric-cat-63987

05/06/2023, 5:55 PM

It makes connection to web(not just 1, we have like 25), crawl it check for some specific data for each item in the list user provided and returns with those data from each site for each item on user list.

bitter-machine-55943

05/06/2023, 5:59 PM

Are the web connections to unknown locations (like “anywhere on the internet”)? Or is it a set of known locations (like “one of these 200 locations”)?

prehistoric-cat-63987

05/06/2023, 6:10 PM

Yes, they're all known locations with huge database, say Youtube

bitter-machine-55943

05/06/2023, 6:35 PM

Ok. I’m thinking of that as “unknown”, in the sense that you can’t cache or pre compute all of the results. In that case you can try to run all of the crawls in parallel, with a pool of workers for instance. So if 1 crawl takes 0.5s, instead of 20 crawls taking 10s, if you can run 10 processes/threads as workers, then you could run 10 crawls in 0.5s (ish). Then it’s a matter of scaling up the number of workers within the micro service to 100 or 1000 or however many you need and can afford. At some point you’ll probably have rate limiting or issues with getting blocked, which is a different problem.

prehistoric-cat-63987

05/06/2023, 6:41 PM

This is the point, someone suggests that we run the whole 100 process at once parallel, so we can run the whole 100 in .5 sec instead of running 100 in 50sec, but I don't know how I can trigger 100 requests at once

great-cartoon-12331

05/06/2023, 6:44 PM

normal way, https://docs.python.org/3/library/multiprocessing.html ?

bitter-machine-55943

05/06/2023, 6:58 PM

Right. Python’s

multiprocessing

would be the easiest, if you have the server resources to run 100 processes. You can also run 100 threads which will consume much fewer resources, but threads are easier to break if you don’t know how they work

bitter-machine-55943

05/06/2023, 7:00 PM

If you use threads I’d use the built-in Python queue module. It lets you pass thread-safe messages between threads

great-cartoon-12331

05/06/2023, 7:01 PM

aren't Python threads running one at a time because of the GIL?

bitter-machine-55943

05/06/2023, 7:15 PM

Well, there is that minor detail 😂

bitter-machine-55943

05/06/2023, 7:21 PM

Might be okay actually. GIL usually affects CPU-limited tasks, by not letting you use all cores on a CPU. For I/O I think it’s okay. May depend which library you use, like

requests

might release the GIL while waiting for response but another one might not.

ancient-shoe-86801

05/06/2023, 10:32 PM

If the bottleneck is IO, threads are the recommended approach. It it's CPU, multiprocess is recommended.

refined-waiter-90422

05/07/2023, 7:07 AM

Yeah.. python threads are exactly like nodejs- only for blocking I/O (async/await for context switching), rather than parallel python inside the process.

refined-waiter-90422

05/07/2023, 7:08 AM

Like node, python has to spin up many workers to have multiple python things at one time.

refined-waiter-90422

05/07/2023, 7:17 AM

the easiest way if you're in django land is to use one of the ASGI servers, like daphne, then just add the appropriate await statements in where the blocking calls are happening.

refined-waiter-90422

05/07/2023, 7:21 AM

also "django channels" = ASGI using daphne

prehistoric-cat-63987

05/07/2023, 3:53 PM

I'm using channels now