misc: playing with rl/verifiers #445

ErikBjare · 2025-02-17T22:28:28Z

Didn't get it to run, was just vibecoding with gptme.

Important

Adds gptme_verifiers.py to implement an RL environment for training LLMs using gptme and verifiers, with classes for tool execution, state management, and reward calculation.

New Script:
- Adds gptme_verifiers.py to implement an RL environment for training LLMs with gptme and verifiers.
Classes:
- GPTMeRLEnv: Manages RL environment, tool execution, and reward calculation.
- ToolResult, State: Define types for tool execution results and environment state.
- ToolUseReward, TaskReward: Define reward structures for tool usage and task completion.
Functions:
- reset(), step(), _execute_tool(), _calculate_tool_reward(), _calculate_task_reward(), _get_observation(), _is_done(): Manage environment lifecycle and reward logic.
- main(): Demonstrates example usage of the environment.

^{This description was created by}^{for 58d1c2d. It will automatically update as commits are pushed.}

codecov-commenter · 2025-02-17T22:30:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.21%. Comparing base (8845a1a) to head (58d1c2d).

✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #445   +/-   ##
=======================================
  Coverage   67.21%   67.21%           
=======================================
  Files          71       71           
  Lines        6304     6304           
=======================================
  Hits         4237     4237           
  Misses       2067     2067

Flag	Coverage Δ
anthropic/claude-3-haiku-20240307	`65.97% <ø> (ø)`
openai/gpt-4o-mini	`65.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ellipsis-dev

❌ Changes requested. Reviewed everything up to 58d1c2d in 3 minutes and 11 seconds

More details

Looked at 299 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 8 drafted comments based on config settings.

1. scripts/gptme_verifiers.py:164

Draft comment:
Using a lambda that always returns True for tool confirmation can be a security risk.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
This is a RL training environment, not a production system. The tools are being executed in a controlled training context where automatic confirmation is likely intentional for the learning process. The security risk is minimal since this is meant for training/testing. The comment, while technically correct, may be missing the context that this is a training environment.
I could be wrong about the security implications - even in a training environment, automatically confirming all tool executions could potentially be dangerous if malicious inputs are provided.
However, this is clearly a controlled RL environment where the tool executions are part of the training process. Adding confirmation checks would interfere with the automated training. The security tradeoff appears intentional and appropriate for this specific use case.
The comment should be deleted because the automatic confirmation is an intentional design choice appropriate for this training environment context, not a security oversight.

2. scripts/gptme_verifiers.py:103

Draft comment:
Shallow copying files dict may be insufficient if nested structures are expected.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50%
None

3. scripts/gptme_verifiers.py:258

Draft comment:
Use the defined 'max_history' instead of hardcoding '10' for tool history length.
Reason this comment was not posted:
Marked as duplicate.

4. scripts/gptme_verifiers.py:209

Draft comment:
Consider computing redundant tool usage penalties in a single pass rather than subtracting within the loop.
Reason this comment was not posted:
Marked as duplicate.

5. scripts/gptme_verifiers.py:164

Draft comment:
The always-true lambda passed to execute_msg may be risky for untrusted inputs; ensure this is intentional.
Reason this comment was not posted:
Marked as duplicate.

6. scripts/gptme_verifiers.py:238

Draft comment:
Clarify partial success condition in task reward; verify that using 'elif any(results)' is the intended behavior.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
The comment is asking the author to verify their intention, which is against the rules. It does not provide a specific suggestion or ask for a specific test to be written. Therefore, it should be removed.

7. scripts/gptme_verifiers.py:125

Draft comment:
Action 'tool' index is used without bounds checking; validate input to prevent potential IndexError.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
While technically true that there's no bounds checking, this is an internal RL environment where the action space is controlled. The example usage shows proper bounds checking with randint. The class inherits from MultiStepEnv which likely handles action space validation. Without seeing the verifiers framework code, I can't be 100% certain how actions are validated.
I may be making assumptions about the MultiStepEnv parent class. There could be ways to call step() directly with invalid actions.
Even if step() is called directly, this would be a misuse of the internal API. The code follows standard RL environment patterns where action space validation happens at a higher level.
The comment should be deleted. While technically correct, bounds checking is handled appropriately through the RL framework's action space mechanisms.

8. scripts/gptme_verifiers.py:180

Draft comment:
Ensure error values in tool results are normalized (e.g., None vs empty string) to avoid misclassification in reward calculations.
Reason this comment was not posted:
Comment looked like it was already resolved.

Workflow ID: wflow_B0RWI5jJMgULEV15

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2025-02-17T22:31:46Z

scripts/gptme_verifiers.py

+                tool_name = tool.split(":")[0]
+                tool_counts[tool_name] = tool_counts.get(tool_name, 0) + 1
+                if tool_counts[tool_name] > 2:
+                    reward -= 0.2


Redundant tool usage penalty loop may over-penalize due to per-iteration subtraction. Consider calculating penalty based on counts instead.

Suggested change

reward -= 0.2

reward -= 0.2 * (tool_counts[tool_name] - 2)

ellipsis-dev · 2025-02-17T22:31:46Z

scripts/gptme_verifiers.py

+
+        # Define limits
+        self.max_history = 10
+        self.max_args_length = 1024


Unused variable 'max_args_length'; consider enforcing tool argument length based on this limit.

ellipsis-dev · 2025-02-17T22:31:47Z

scripts/gptme_verifiers.py

+            raise RuntimeError("Environment not initialized, call reset() first")
+
+        return {
+            "files": str(self.current_state["files"]),


Observation returns 'files' as a string; consider returning the dictionary directly for richer downstream processing.

Suggested change

"files": str(self.current_state["files"]),

"files": self.current_state["files"],

ellipsis-dev · 2025-02-17T22:31:47Z

scripts/gptme_verifiers.py

+        self.current_state: State | None = None
+
+        # Store tool format
+        self.tool_format = tool_format


The 'tool_format' attribute is stored but not used; consider integrating it into tool message formatting or removing it.

ErikBjare marked this pull request as draft February 17, 2025 22:28

misc: playing with rl/verifiers

58d1c2d

ellipsis-dev bot reviewed Feb 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misc: playing with rl/verifiers #445

misc: playing with rl/verifiers #445

ErikBjare commented Feb 17, 2025 •

edited by ellipsis-dev bot

Loading

codecov-commenter commented Feb 17, 2025 •

edited

Loading

ellipsis-dev bot left a comment

ellipsis-dev bot Feb 17, 2025

ellipsis-dev bot Feb 17, 2025

ellipsis-dev bot Feb 17, 2025

ellipsis-dev bot Feb 17, 2025

	"files": str(self.current_state["files"]),
	"files": self.current_state["files"],

misc: playing with rl/verifiers #445

Are you sure you want to change the base?

misc: playing with rl/verifiers #445

Conversation

ErikBjare commented Feb 17, 2025 • edited by ellipsis-dev bot Loading

codecov-commenter commented Feb 17, 2025 • edited Loading

Codecov Report

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot Feb 17, 2025

Choose a reason for hiding this comment

ellipsis-dev bot Feb 17, 2025

Choose a reason for hiding this comment

ellipsis-dev bot Feb 17, 2025

Choose a reason for hiding this comment

ellipsis-dev bot Feb 17, 2025

Choose a reason for hiding this comment

ErikBjare commented Feb 17, 2025 •

edited by ellipsis-dev bot

Loading

codecov-commenter commented Feb 17, 2025 •

edited

Loading