Performance decay when using openai api_server #308

sleepwalker2017 · 2023-08-25T09:48:01Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.

Describe the bug

Seems the implementation uses only one thread to process requests, and rely on coroutine to process multiple requests.

The problem is:

in LlamaV2.cc when a LlamaV2::forward is called from the python level, the enqueue thread will wait until the request is finished. So there won't be more requests enqueue as long as there are requests being processed.

So the requests are processed one by one. Is that the fact?

if (rank == 0) {
        TM_LOG_INFO("[forward] Enqueue requests");
        auto futures = shared_state_->request_queue.enqueue(std::move(requests));

        TM_LOG_INFO("[forward] Wait for requests to complete ...");
        for (auto& f : futures) {
            auto ec = f.get();
            error_codes.push_back(ec);
            if (ec) {
                has_error = true;
            }
        }
    }

Reproduction

 python -u -m lmdeploy.serve.openai.api_server ./benchmark/workspace 0.0.0.0 10086 --instance_num 32 --tp 2

Error traceback

No response

AllentDan · 2023-08-25T10:06:33Z

You may have a test for like opening two clients with different instance_id 1 and 2. Check if the responses are generated simultaneously.

curl http://{server_ip}:{server_port}/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello!",
    "instance_id": 1,
    "sequence_start": true,
    "sequence_end": true
  }'

curl http://{server_ip}:{server_port}/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello! How are you?",
    "instance_id": 2,
    "sequence_start": true,
    "sequence_end": true
  }'

sleepwalker2017 · 2023-08-25T10:16:05Z

Got it. Thank you for the fast response.

In the case you mentioned, it's processed simultaneously.

But if you use one process with multi-threads to send requests simultaneously, the requests are not processed simultaneously. I wonder why is that?

AllentDan · 2023-08-25T10:35:42Z

check benchmark/profile_restful_api.py out. If you use one process with multi-threads to send requests simultaneously, the requests can be processed simultaneously only if you set the proper instance_id. If you set the same instance_id, the instance must wait until the last request ends. Besides, we did not implement any complicated schedule method except a plain one. So, it could be possible session 1 is waiting for session 33 if the instance_num is 32 even if there are only these two requests.

sleepwalker2017 · 2023-08-25T13:51:16Z

check benchmark/profile_restful_api.py out. If you use one process with multi-threads to send requests simultaneously, the requests can be processed simultaneously only if you set the proper instance_id. If you set the same instance_id, the instance must wait until the last request ends. Besides, we did not implement any complicated schedule method except a plain one. So, it could be possible session 1 is waiting for session 33 if the instance_num is 32 even if there are only these two requests.

Hi, I run this file and I need to clarify something:

when setting stream is true, the requests can be processed simultaneously.
when stream is set to false, persistent batching doesn't work, no matter using multiple threads or multiple processes.

I want to check the following points:

@app.post('/generate')
async def generate(request: GenerateRequest, raw_request: Request = None):

The async api in lmdeploy.serve.openai.api_server is a single thread function, and fastapi uses only one thread to process all requests, is that right?
When it calls into the LlamaV2::forward function in C++, the request is added to the queue, and another thread read this request from the queue, then the main thread wait until the request is finished, is that true?
when waiting for the thread to be done, the newly coming requests can't be added into the queue as this thread is waiting, Is that the fact?

I see this happen in the log. So I guess that there may be something wrong in the async '/generate' api.

Hope for your reply and correct it if there is something wrong, thank you.

AllentDan · 2023-08-26T02:09:18Z

Yes, you are right. It worked only when streaming is true. Or the request is stuck in model forwarding until finished.

sleepwalker2017 · 2023-08-26T06:28:17Z

Yes, you are right. It worked only when streaming is true. Or the request is stuck in model forwarding until finished.

thank you.

sleepwalker2017 closed this as completed Aug 26, 2023

AllentDan mentioned this issue Aug 28, 2023

[Feature] locust can not launch multi-client requests on restful api server #304

Closed

This was referenced Aug 31, 2023

bug-fix: when using stream is False, continuous batching doesn't work #345

Closed

bug-fix: when using stream is False, continuous batching doesn't work #346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance decay when using openai api_server #308

Performance decay when using openai api_server #308

sleepwalker2017 commented Aug 25, 2023

AllentDan commented Aug 25, 2023

sleepwalker2017 commented Aug 25, 2023

AllentDan commented Aug 25, 2023

sleepwalker2017 commented Aug 25, 2023 •

edited

Loading

AllentDan commented Aug 26, 2023

sleepwalker2017 commented Aug 26, 2023

Performance decay when using openai api_server #308

Performance decay when using openai api_server #308

Comments

sleepwalker2017 commented Aug 25, 2023

Checklist

Describe the bug

Reproduction

Error traceback

AllentDan commented Aug 25, 2023

sleepwalker2017 commented Aug 25, 2023

AllentDan commented Aug 25, 2023

sleepwalker2017 commented Aug 25, 2023 • edited Loading

AllentDan commented Aug 26, 2023

sleepwalker2017 commented Aug 26, 2023

sleepwalker2017 commented Aug 25, 2023 •

edited

Loading