-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug-fix: when using stream is False, continuous batching doesn't work #346
Conversation
@sleepwalker2017 May fix linting problems. You can install pre-commit in your host, which can help fix them pip install pre-commit
# Move the the root path of lmdeploy repository and execute the following commands
pre-commit install You can run |
lmdeploy/serve/async_engine.py
Outdated
@@ -136,6 +136,8 @@ async def generate( | |||
repetition_penalty=repetition_penalty, | |||
ignore_eos=ignore_eos, | |||
random_seed=seed if sequence_start else None): | |||
if outputs is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generate_openai
should be updated too.
lmdeploy/turbomind/turbomind.py
Outdated
@@ -329,6 +329,9 @@ def _broadcast_np(data, dtype, shape=(batch_size, )): | |||
while self.que.qsize() > 1: | |||
self.que.get() | |||
|
|||
if stream_output == False and self.que.qsize() == 0: | |||
yield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this made GPU-Util quite low. And the benchmark went down to an unacceptable number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this made GPU-Util quite low. And the benchmark went down to an unacceptable number.
I find the first version has this problem with streaming mode,
if self.que.qsize() == 0:
yield
So I modified it, and I test the GPU util is 100% for both modes:
if stream_output == False and self.que.qsize() == 0:
yield
Anyway, I followed your advice and use stream infer to process non-stream requests.
Another method for this is we do |
710e892
to
73e21f9
Compare
Hi, I think it's a better solution. As my solution keeps the main thread running in the while loop doing little work. I tried this method, it works well. Does the openai generate function also needs to be modified? I'm not familiar with the openai generate function and don't know how to test it. If it need more changes, you can refer to my changes and send a new pr to replace this one. Thank you. |
Not a final solution. Maybe there are other methods. @grimoire Any comments? |
The core problem is that: the main thread waits for the output queue until there is some output data generated. So when non-stream request come, the thread can't process new requests. I have a question, what is this line for?
why do you lose all results until it only has 1 left? |
The turbomind instance is designed to process the session in it's own threads. I think it is the coroutine that block the processing of continuous batching. Await the queue getting might be helpful.
assume we have queue:
Output the last one |
73e21f9
to
0770aaf
Compare
Can we warp the queue get while loop inside the wrapper? |
the problem comes from the situation when the queue is empty. I non-stream mode, the queue is empty for a long time until a sequence is finished. So the line waits for a long time and can't process new requests. That's the problem. I think this line is the cause for non-stream modes.
|
I see. I guess that Queue is not awaitable since there is another Queue in module |
The code I posted above is trying to wrap the que.get() in a awaitable mode, but it doesn't work, I wonder why is that. Using multiple threads is ok, the api_server file may be modified a lot or it needs to be refactored. |
An object is awaitable only when it has |
Motivation
bug fix : when using stream=False, the continuous batching is not working.
See details in this issue: #308
Modification
In non-streaming mode, the main thread which is responsible for receiving request is stuck in the following code since self.que.get() will wait until the queue is not empty.
But when using non-stream mode, the queue is always empty unless a request is finished.
So the newly coming request won't be processed before the last request is done. So the batching doesn't work.
I modify here to avoid the long wait. just see the commits.
I'm not sure if there is a better way to fix this. If any, please comment.
I have a question, what is this line for? Won't it cause output tokens being lost???
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist