Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(GroupBy): adding group by functionality #1603

Merged
merged 6 commits into from
Feb 12, 2025

Conversation

scaliseraoul
Copy link
Contributor

@scaliseraoul scaliseraoul commented Feb 12, 2025


Important

Adds group_by functionality to SQL queries, updating schema validation, query builders, and data loaders, with corresponding tests.

  • Behavior:
    • Adds group_by parameter to create() in pandasai/__init__.py for grouping columns in SQL queries.
    • Updates SemanticLayerSchema in semantic_layer_schema.py to include group_by validation.
    • Implements _apply_grouping() in local_loader.py to handle DataFrame grouping and aggregation.
    • Modifies BaseQueryBuilder and ViewQueryBuilder to support GROUP BY clauses in SQL queries.
  • Validation:
    • Adds validation for group_by in SemanticLayerSchema to ensure only non-aggregated columns are included.
  • Tests:
    • Adds test_group_by.py to test group_by functionality in query builders.
    • Updates existing tests in test_loader.py and test_pandasai_init.py to cover new group_by logic.
  • Misc:
    • Minor import reordering in test_azure_openai.py.

This description was created by Ellipsis for 36c7505. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 36c7505 in 2 minutes and 12 seconds

More details
  • Looked at 810 lines of code in 11 files
  • Skipped 0 files when reviewing.
  • Skipped posting 14 drafted comments based on config settings.
1. pandasai/__init__.py:48
  • Draft comment:
    Docstring updated to include group_by; ensure user examples match new behavior and validation errors are descriptive.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
2. pandasai/core/cache.py:98
  • Draft comment:
    Using df.schema.name for cache key may lead to collisions if names aren’t unique; ensure schema naming uniqueness.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50%
    The comment points out a real potential issue - if two different schemas have the same name, they would generate the same cache key even though they represent different data. This could lead to incorrect cache hits. However, the comment is somewhat speculative ("may lead") and asks for verification ("ensure") rather than suggesting a specific fix.
    The comment doesn't propose a concrete solution and falls into the pattern of asking the author to "ensure" something. Also, schema naming conventions might be defined elsewhere in the codebase.
    While the comment style isn't ideal, this is a significant change that could introduce subtle bugs. Moving from a hash-based key to a name-based key is a fundamental change in caching strategy.
    The comment should be rephrased to be more actionable, suggesting a specific solution like "Consider using both schema name and a hash of column names/types to ensure cache key uniqueness."
3. pandasai/data_loader/local_loader.py:75
  • Draft comment:
    Grouping logic correctly maps aggregation functions; verify that _apply_grouping gracefully handles cases with no aggregations.
  • Reason this comment was not posted:
    Confidence changes required: 20% <= threshold 50%
    None
4. pandasai/data_loader/semantic_layer_schema.py:305
  • Draft comment:
    Validation in _validate_group_by_columns is clear; consider rewording error messages to be more user‐friendly.
  • Reason this comment was not posted:
    Confidence changes required: 30% <= threshold 50%
    None
5. pandasai/query_builders/base_query_builder.py:16
  • Draft comment:
    Group by clause is added consistently; ensure column normalization is applied uniformly in all builders.
  • Reason this comment was not posted:
    Confidence changes required: 20% <= threshold 50%
    None
6. pandasai/query_builders/view_query_builder.py:70
  • Draft comment:
    SQL injection prevention via sanitize_view_column_name is implemented; double-check alias normalization to avoid unexpected SQL syntax.
  • Reason this comment was not posted:
    Confidence changes required: 20% <= threshold 50%
    None
7. tests/unit_tests/query_builders/test_group_by.py:117
  • Draft comment:
    Group by tests cover both correct and invalid schemas; ensure edge cases (e.g., empty group_by) are also tested.
  • Reason this comment was not posted:
    Confidence changes required: 20% <= threshold 50%
    None
8. tests/unit_tests/test_pandasai_init.py:22
  • Draft comment:
    Unit tests for create function now include group_by functionality; ensure new examples are in sync with updated docs.
  • Reason this comment was not posted:
    Confidence changes required: 20% <= threshold 50%
    None
9. pandasai/core/cache.py:98
  • Draft comment:
    Using 'df.schema.name' for the cache key may not uniquely identify different dataframes. Consider retaining a content-based hash (e.g., the previous 'df.column_hash') to avoid cache collisions.
  • Reason this comment was not posted:
    Marked as duplicate.
10. pandasai/data_loader/local_loader.py:75
  • Draft comment:
    In _apply_grouping, note that grouping only happens if any column has an aggregation expression. If group_by is set but no aggregation is defined, the DataFrame returns unchanged. It may be useful to clarify this behavior in the docstring.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    This comment is purely informative and suggests updating the docstring to clarify behavior. It doesn't provide a specific code suggestion or address a potential issue directly related to the code changes. Therefore, it violates the rule against making purely informative comments.
11. pandasai/query_builders/base_query_builder.py:51
  • Draft comment:
    The aggregation and alias handling in _get_columns is clean and concise. No issues detected.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
12. pandasai/query_builders/view_query_builder.py:32
  • Draft comment:
    Good use of alias normalization in _get_group_by_columns to prevent injection. Verify consistency across implementations.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
13. tests/unit_tests/query_builders/test_group_by.py:119
  • Draft comment:
    Comprehensive tests in test_invalid_group_by validate error conditions for group_by configuration. Excellent test coverage.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
14. tests/unit_tests/test_pandasai_init.py:424
  • Draft comment:
    Tests for dataset creation with description verify that the description is properly set. The use of clear assertions improves maintainability.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None

Workflow ID: wflow_WCveOlhKJZhL0Ksh


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

scaliseraoul and others added 2 commits February 12, 2025 15:39
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
Copy link

codebeaver-ai bot commented Feb 12, 2025

I opened a Pull Request with the following:

🔄 3 tests added.
🐛 No bugs detected in your changes
🛠️ 3/3 tests passed

🔄 Test Updates

I've added 3 tests. They all pass ☑️
New Tests:

  • tests/unit_tests/core/code_generation/test_code_cleaning.py
  • tests/test_local_loader.py
  • tests/unit_tests/dataframe/test_semantic_layer_schema.py

No existing tests required updates.

🐛 Bug Detection

No bugs detected in your changes. Good job!

🛠️ Test Results

All 3 tests passed.

☂️ Coverage Improvements

Coverage improvements by file:

  • tests/unit_tests/core/code_generation/test_code_cleaning.py

    New coverage: 75.61%
    Improvement: +0.00%

  • tests/test_local_loader.py

    New coverage: 79.73%
    Improvement: +79.73%

  • tests/unit_tests/dataframe/test_semantic_layer_schema.py

    New coverage: 94.33%
    Improvement: +5.15%

🎨 Final Touches

  • I ran the hooks included in the pre-commit config.

Settings | Logs | CodeBeaver

@gventuri gventuri merged commit 71145e7 into sinaptik-ai:main Feb 12, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants