-
-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misc: playing with rl/verifiers #445
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## master #445 +/- ##
=======================================
Coverage 67.21% 67.21%
=======================================
Files 71 71
Lines 6304 6304
=======================================
Hits 4237 4237
Misses 2067 2067
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to 58d1c2d in 3 minutes and 11 seconds
More details
- Looked at
299
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
8
drafted comments based on config settings.
1. scripts/gptme_verifiers.py:164
- Draft comment:
Using a lambda that always returns True for tool confirmation can be a security risk. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
This is a RL training environment, not a production system. The tools are being executed in a controlled training context where automatic confirmation is likely intentional for the learning process. The security risk is minimal since this is meant for training/testing. The comment, while technically correct, may be missing the context that this is a training environment.
I could be wrong about the security implications - even in a training environment, automatically confirming all tool executions could potentially be dangerous if malicious inputs are provided.
However, this is clearly a controlled RL environment where the tool executions are part of the training process. Adding confirmation checks would interfere with the automated training. The security tradeoff appears intentional and appropriate for this specific use case.
The comment should be deleted because the automatic confirmation is an intentional design choice appropriate for this training environment context, not a security oversight.
2. scripts/gptme_verifiers.py:103
- Draft comment:
Shallow copying files dict may be insufficient if nested structures are expected. - Reason this comment was not posted:
Confidence changes required:50%
<= threshold50%
None
3. scripts/gptme_verifiers.py:258
- Draft comment:
Use the defined 'max_history' instead of hardcoding '10' for tool history length. - Reason this comment was not posted:
Marked as duplicate.
4. scripts/gptme_verifiers.py:209
- Draft comment:
Consider computing redundant tool usage penalties in a single pass rather than subtracting within the loop. - Reason this comment was not posted:
Marked as duplicate.
5. scripts/gptme_verifiers.py:164
- Draft comment:
The always-true lambda passed to execute_msg may be risky for untrusted inputs; ensure this is intentional. - Reason this comment was not posted:
Marked as duplicate.
6. scripts/gptme_verifiers.py:238
- Draft comment:
Clarify partial success condition in task reward; verify that using 'elif any(results)' is the intended behavior. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%
<= threshold50%
The comment is asking the author to verify their intention, which is against the rules. It does not provide a specific suggestion or ask for a specific test to be written. Therefore, it should be removed.
7. scripts/gptme_verifiers.py:125
- Draft comment:
Action 'tool' index is used without bounds checking; validate input to prevent potential IndexError. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
While technically true that there's no bounds checking, this is an internal RL environment where the action space is controlled. The example usage shows proper bounds checking with randint. The class inherits from MultiStepEnv which likely handles action space validation. Without seeing the verifiers framework code, I can't be 100% certain how actions are validated.
I may be making assumptions about the MultiStepEnv parent class. There could be ways to call step() directly with invalid actions.
Even if step() is called directly, this would be a misuse of the internal API. The code follows standard RL environment patterns where action space validation happens at a higher level.
The comment should be deleted. While technically correct, bounds checking is handled appropriately through the RL framework's action space mechanisms.
8. scripts/gptme_verifiers.py:180
- Draft comment:
Ensure error values in tool results are normalized (e.g., None vs empty string) to avoid misclassification in reward calculations. - Reason this comment was not posted:
Comment looked like it was already resolved.
Workflow ID: wflow_B0RWI5jJMgULEV15
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
tool_name = tool.split(":")[0] | ||
tool_counts[tool_name] = tool_counts.get(tool_name, 0) + 1 | ||
if tool_counts[tool_name] > 2: | ||
reward -= 0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant tool usage penalty loop may over-penalize due to per-iteration subtraction. Consider calculating penalty based on counts instead.
reward -= 0.2 | |
reward -= 0.2 * (tool_counts[tool_name] - 2) |
|
||
# Define limits | ||
self.max_history = 10 | ||
self.max_args_length = 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable 'max_args_length'; consider enforcing tool argument length based on this limit.
raise RuntimeError("Environment not initialized, call reset() first") | ||
|
||
return { | ||
"files": str(self.current_state["files"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation returns 'files' as a string; consider returning the dictionary directly for richer downstream processing.
"files": str(self.current_state["files"]), | |
"files": self.current_state["files"], |
self.current_state: State | None = None | ||
|
||
# Store tool format | ||
self.tool_format = tool_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'tool_format' attribute is stored but not used; consider integrating it into tool message formatting or removing it.
Didn't get it to run, was just vibecoding with gptme.
Important
Adds
gptme_verifiers.py
to implement an RL environment for training LLMs usinggptme
andverifiers
, with classes for tool execution, state management, and reward calculation.gptme_verifiers.py
to implement an RL environment for training LLMs withgptme
andverifiers
.GPTMeRLEnv
: Manages RL environment, tool execution, and reward calculation.ToolResult
,State
: Define types for tool execution results and environment state.ToolUseReward
,TaskReward
: Define reward structures for tool usage and task completion.reset()
,step()
,_execute_tool()
,_calculate_tool_reward()
,_calculate_task_reward()
,_get_observation()
,_is_done()
: Manage environment lifecycle and reward logic.main()
: Demonstrates example usage of the environment.This description was created by
for 58d1c2d. It will automatically update as commits are pushed.