Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance decay when using openai api_server #308

Closed
2 tasks done
sleepwalker2017 opened this issue Aug 25, 2023 · 6 comments
Closed
2 tasks done

Performance decay when using openai api_server #308

sleepwalker2017 opened this issue Aug 25, 2023 · 6 comments

Comments

@sleepwalker2017
Copy link
Contributor

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

Seems the implementation uses only one thread to process requests, and rely on coroutine to process multiple requests.

The problem is:

in LlamaV2.cc when a LlamaV2::forward is called from the python level, the enqueue thread will wait until the request is finished. So there won't be more requests enqueue as long as there are requests being processed.

So the requests are processed one by one. Is that the fact?

if (rank == 0) {
        TM_LOG_INFO("[forward] Enqueue requests");
        auto futures = shared_state_->request_queue.enqueue(std::move(requests));

        TM_LOG_INFO("[forward] Wait for requests to complete ...");
        for (auto& f : futures) {
            auto ec = f.get();
            error_codes.push_back(ec);
            if (ec) {
                has_error = true;
            }
        }
    }

Reproduction

 python -u -m lmdeploy.serve.openai.api_server ./benchmark/workspace 0.0.0.0 10086 --instance_num 32 --tp 2

Error traceback

No response

@AllentDan
Copy link
Collaborator

You may have a test for like opening two clients with different instance_id 1 and 2. Check if the responses are generated simultaneously.

curl http://{server_ip}:{server_port}/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello!",
    "instance_id": 1,
    "sequence_start": true,
    "sequence_end": true
  }'
curl http://{server_ip}:{server_port}/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello! How are you?",
    "instance_id": 2,
    "sequence_start": true,
    "sequence_end": true
  }'

@sleepwalker2017
Copy link
Contributor Author

Got it. Thank you for the fast response.

In the case you mentioned, it's processed simultaneously.

But if you use one process with multi-threads to send requests simultaneously, the requests are not processed simultaneously. I wonder why is that?

@AllentDan
Copy link
Collaborator

check benchmark/profile_restful_api.py out. If you use one process with multi-threads to send requests simultaneously, the requests can be processed simultaneously only if you set the proper instance_id. If you set the same instance_id, the instance must wait until the last request ends. Besides, we did not implement any complicated schedule method except a plain one. So, it could be possible session 1 is waiting for session 33 if the instance_num is 32 even if there are only these two requests.

@sleepwalker2017
Copy link
Contributor Author

sleepwalker2017 commented Aug 25, 2023

check benchmark/profile_restful_api.py out. If you use one process with multi-threads to send requests simultaneously, the requests can be processed simultaneously only if you set the proper instance_id. If you set the same instance_id, the instance must wait until the last request ends. Besides, we did not implement any complicated schedule method except a plain one. So, it could be possible session 1 is waiting for session 33 if the instance_num is 32 even if there are only these two requests.

Hi, I run this file and I need to clarify something:

when setting stream is true, the requests can be processed simultaneously.
when stream is set to false, persistent batching doesn't work, no matter using multiple threads or multiple processes.

I want to check the following points:

@app.post('/generate')
async def generate(request: GenerateRequest, raw_request: Request = None):
  1. The async api in lmdeploy.serve.openai.api_server is a single thread function, and fastapi uses only one thread to process all requests, is that right?
  2. When it calls into the LlamaV2::forward function in C++, the request is added to the queue, and another thread read this request from the queue, then the main thread wait until the request is finished, is that true?
  3. when waiting for the thread to be done, the newly coming requests can't be added into the queue as this thread is waiting, Is that the fact?

I see this happen in the log. So I guess that there may be something wrong in the async '/generate' api.

Hope for your reply and correct it if there is something wrong, thank you.

@AllentDan
Copy link
Collaborator

Yes, you are right. It worked only when streaming is true. Or the request is stuck in model forwarding until finished.

@sleepwalker2017
Copy link
Contributor Author

Yes, you are right. It worked only when streaming is true. Or the request is stuck in model forwarding until finished.

thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants