Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc: playing with rl/verifiers #445

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

misc: playing with rl/verifiers #445

wants to merge 1 commit into from

Conversation

ErikBjare
Copy link
Owner

@ErikBjare ErikBjare commented Feb 17, 2025

Didn't get it to run, was just vibecoding with gptme.


Important

Adds gptme_verifiers.py to implement an RL environment for training LLMs using gptme and verifiers, with classes for tool execution, state management, and reward calculation.

  • New Script:
    • Adds gptme_verifiers.py to implement an RL environment for training LLMs with gptme and verifiers.
  • Classes:
    • GPTMeRLEnv: Manages RL environment, tool execution, and reward calculation.
    • ToolResult, State: Define types for tool execution results and environment state.
    • ToolUseReward, TaskReward: Define reward structures for tool usage and task completion.
  • Functions:
    • reset(), step(), _execute_tool(), _calculate_tool_reward(), _calculate_task_reward(), _get_observation(), _is_done(): Manage environment lifecycle and reward logic.
    • main(): Demonstrates example usage of the environment.

This description was created by Ellipsis for 58d1c2d. It will automatically update as commits are pushed.

@ErikBjare ErikBjare marked this pull request as draft February 17, 2025 22:28
@codecov-commenter
Copy link

codecov-commenter commented Feb 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.21%. Comparing base (8845a1a) to head (58d1c2d).

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #445   +/-   ##
=======================================
  Coverage   67.21%   67.21%           
=======================================
  Files          71       71           
  Lines        6304     6304           
=======================================
  Hits         4237     4237           
  Misses       2067     2067           
Flag Coverage Δ
anthropic/claude-3-haiku-20240307 65.97% <ø> (ø)
openai/gpt-4o-mini 65.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 58d1c2d in 3 minutes and 11 seconds

More details
  • Looked at 299 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 8 drafted comments based on config settings.
1. scripts/gptme_verifiers.py:164
  • Draft comment:
    Using a lambda that always returns True for tool confirmation can be a security risk.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
    This is a RL training environment, not a production system. The tools are being executed in a controlled training context where automatic confirmation is likely intentional for the learning process. The security risk is minimal since this is meant for training/testing. The comment, while technically correct, may be missing the context that this is a training environment.
    I could be wrong about the security implications - even in a training environment, automatically confirming all tool executions could potentially be dangerous if malicious inputs are provided.
    However, this is clearly a controlled RL environment where the tool executions are part of the training process. Adding confirmation checks would interfere with the automated training. The security tradeoff appears intentional and appropriate for this specific use case.
    The comment should be deleted because the automatic confirmation is an intentional design choice appropriate for this training environment context, not a security oversight.
2. scripts/gptme_verifiers.py:103
  • Draft comment:
    Shallow copying files dict may be insufficient if nested structures are expected.
  • Reason this comment was not posted:
    Confidence changes required: 50% <= threshold 50%
    None
3. scripts/gptme_verifiers.py:258
  • Draft comment:
    Use the defined 'max_history' instead of hardcoding '10' for tool history length.
  • Reason this comment was not posted:
    Marked as duplicate.
4. scripts/gptme_verifiers.py:209
  • Draft comment:
    Consider computing redundant tool usage penalties in a single pass rather than subtracting within the loop.
  • Reason this comment was not posted:
    Marked as duplicate.
5. scripts/gptme_verifiers.py:164
  • Draft comment:
    The always-true lambda passed to execute_msg may be risky for untrusted inputs; ensure this is intentional.
  • Reason this comment was not posted:
    Marked as duplicate.
6. scripts/gptme_verifiers.py:238
  • Draft comment:
    Clarify partial success condition in task reward; verify that using 'elif any(results)' is the intended behavior.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    The comment is asking the author to verify their intention, which is against the rules. It does not provide a specific suggestion or ask for a specific test to be written. Therefore, it should be removed.
7. scripts/gptme_verifiers.py:125
  • Draft comment:
    Action 'tool' index is used without bounds checking; validate input to prevent potential IndexError.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
    While technically true that there's no bounds checking, this is an internal RL environment where the action space is controlled. The example usage shows proper bounds checking with randint. The class inherits from MultiStepEnv which likely handles action space validation. Without seeing the verifiers framework code, I can't be 100% certain how actions are validated.
    I may be making assumptions about the MultiStepEnv parent class. There could be ways to call step() directly with invalid actions.
    Even if step() is called directly, this would be a misuse of the internal API. The code follows standard RL environment patterns where action space validation happens at a higher level.
    The comment should be deleted. While technically correct, bounds checking is handled appropriately through the RL framework's action space mechanisms.
8. scripts/gptme_verifiers.py:180
  • Draft comment:
    Ensure error values in tool results are normalized (e.g., None vs empty string) to avoid misclassification in reward calculations.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.

Workflow ID: wflow_B0RWI5jJMgULEV15


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

tool_name = tool.split(":")[0]
tool_counts[tool_name] = tool_counts.get(tool_name, 0) + 1
if tool_counts[tool_name] > 2:
reward -= 0.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant tool usage penalty loop may over-penalize due to per-iteration subtraction. Consider calculating penalty based on counts instead.

Suggested change
reward -= 0.2
reward -= 0.2 * (tool_counts[tool_name] - 2)


# Define limits
self.max_history = 10
self.max_args_length = 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable 'max_args_length'; consider enforcing tool argument length based on this limit.

raise RuntimeError("Environment not initialized, call reset() first")

return {
"files": str(self.current_state["files"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation returns 'files' as a string; consider returning the dictionary directly for richer downstream processing.

Suggested change
"files": str(self.current_state["files"]),
"files": self.current_state["files"],

self.current_state: State | None = None

# Store tool format
self.tool_format = tool_format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'tool_format' attribute is stored but not used; consider integrating it into tool message formatting or removing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants