-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance decay when using openai api_server #308
Comments
You may have a test for like opening two clients with different instance_id 1 and 2. Check if the responses are generated simultaneously. curl http://{server_ip}:{server_port}/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello!",
"instance_id": 1,
"sequence_start": true,
"sequence_end": true
}' curl http://{server_ip}:{server_port}/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello! How are you?",
"instance_id": 2,
"sequence_start": true,
"sequence_end": true
}' |
Got it. Thank you for the fast response. In the case you mentioned, it's processed simultaneously. But if you use one process with multi-threads to send requests simultaneously, the requests are not processed simultaneously. I wonder why is that? |
check |
Hi, I run this file and I need to clarify something: when setting stream is true, the requests can be processed simultaneously. I want to check the following points: @app.post('/generate')
async def generate(request: GenerateRequest, raw_request: Request = None):
I see this happen in the log. So I guess that there may be something wrong in the async '/generate' api. Hope for your reply and correct it if there is something wrong, thank you. |
Yes, you are right. It worked only when streaming is true. Or the request is stuck in model forwarding until finished. |
thank you. |
Checklist
Describe the bug
Seems the implementation uses only one thread to process requests, and rely on coroutine to process multiple requests.
The problem is:
in LlamaV2.cc when a LlamaV2::forward is called from the python level, the enqueue thread will wait until the request is finished. So there won't be more requests enqueue as long as there are requests being processed.
So the requests are processed one by one. Is that the fact?
Reproduction
Error traceback
No response
The text was updated successfully, but these errors were encountered: