feature(GroupBy): adding group by functionality #1603

scaliseraoul · 2025-02-12T14:31:38Z

Tests added and passed if fixing a bug or adding a new feature.
All code checks passed.

Important

Adds group_by functionality to SQL queries, updating schema validation, query builders, and data loaders, with corresponding tests.

Behavior:
- Adds group_by parameter to create() in pandasai/__init__.py for grouping columns in SQL queries.
- Updates SemanticLayerSchema in semantic_layer_schema.py to include group_by validation.
- Implements _apply_grouping() in local_loader.py to handle DataFrame grouping and aggregation.
- Modifies BaseQueryBuilder and ViewQueryBuilder to support GROUP BY clauses in SQL queries.
Validation:
- Adds validation for group_by in SemanticLayerSchema to ensure only non-aggregated columns are included.
Tests:
- Adds test_group_by.py to test group_by functionality in query builders.
- Updates existing tests in test_loader.py and test_pandasai_init.py to cover new group_by logic.
Misc:
- Minor import reordering in test_azure_openai.py.

^{This description was created by}^{for 36c7505. It will automatically update as commits are pushed.}

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

ellipsis-dev

❌ Changes requested. Reviewed everything up to 36c7505 in 2 minutes and 12 seconds

More details

Looked at 810 lines of code in 11 files
Skipped 0 files when reviewing.
Skipped posting 14 drafted comments based on config settings.

1. pandasai/__init__.py:48

Draft comment:
Docstring updated to include group_by; ensure user examples match new behavior and validation errors are descriptive.
Reason this comment was not posted:
Comment looked like it was already resolved.

2. pandasai/core/cache.py:98

Draft comment:
Using df.schema.name for cache key may lead to collisions if names aren’t unique; ensure schema naming uniqueness.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50%
The comment points out a real potential issue - if two different schemas have the same name, they would generate the same cache key even though they represent different data. This could lead to incorrect cache hits. However, the comment is somewhat speculative ("may lead") and asks for verification ("ensure") rather than suggesting a specific fix.
The comment doesn't propose a concrete solution and falls into the pattern of asking the author to "ensure" something. Also, schema naming conventions might be defined elsewhere in the codebase.
While the comment style isn't ideal, this is a significant change that could introduce subtle bugs. Moving from a hash-based key to a name-based key is a fundamental change in caching strategy.
The comment should be rephrased to be more actionable, suggesting a specific solution like "Consider using both schema name and a hash of column names/types to ensure cache key uniqueness."

3. pandasai/data_loader/local_loader.py:75

Draft comment:
Grouping logic correctly maps aggregation functions; verify that _apply_grouping gracefully handles cases with no aggregations.
Reason this comment was not posted:
Confidence changes required: 20% <= threshold 50%
None

4. pandasai/data_loader/semantic_layer_schema.py:305

Draft comment:
Validation in _validate_group_by_columns is clear; consider rewording error messages to be more user‐friendly.
Reason this comment was not posted:
Confidence changes required: 30% <= threshold 50%
None

5. pandasai/query_builders/base_query_builder.py:16

Draft comment:
Group by clause is added consistently; ensure column normalization is applied uniformly in all builders.
Reason this comment was not posted:
Confidence changes required: 20% <= threshold 50%
None

6. pandasai/query_builders/view_query_builder.py:70

Draft comment:
SQL injection prevention via sanitize_view_column_name is implemented; double-check alias normalization to avoid unexpected SQL syntax.
Reason this comment was not posted:
Confidence changes required: 20% <= threshold 50%
None

7. tests/unit_tests/query_builders/test_group_by.py:117

Draft comment:
Group by tests cover both correct and invalid schemas; ensure edge cases (e.g., empty group_by) are also tested.
Reason this comment was not posted:
Confidence changes required: 20% <= threshold 50%
None

8. tests/unit_tests/test_pandasai_init.py:22

Draft comment:
Unit tests for create function now include group_by functionality; ensure new examples are in sync with updated docs.
Reason this comment was not posted:
Confidence changes required: 20% <= threshold 50%
None

9. pandasai/core/cache.py:98

Draft comment:
Using 'df.schema.name' for the cache key may not uniquely identify different dataframes. Consider retaining a content-based hash (e.g., the previous 'df.column_hash') to avoid cache collisions.
Reason this comment was not posted:
Marked as duplicate.

10. pandasai/data_loader/local_loader.py:75

Draft comment:
In _apply_grouping, note that grouping only happens if any column has an aggregation expression. If group_by is set but no aggregation is defined, the DataFrame returns unchanged. It may be useful to clarify this behavior in the docstring.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
This comment is purely informative and suggests updating the docstring to clarify behavior. It doesn't provide a specific code suggestion or address a potential issue directly related to the code changes. Therefore, it violates the rule against making purely informative comments.

11. pandasai/query_builders/base_query_builder.py:51

Draft comment:
The aggregation and alias handling in _get_columns is clean and concise. No issues detected.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50%
None

12. pandasai/query_builders/view_query_builder.py:32

Draft comment:
Good use of alias normalization in _get_group_by_columns to prevent injection. Verify consistency across implementations.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50%
None

13. tests/unit_tests/query_builders/test_group_by.py:119

Draft comment:
Comprehensive tests in test_invalid_group_by validate error conditions for group_by configuration. Excellent test coverage.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50%
None

14. tests/unit_tests/test_pandasai_init.py:424

Draft comment:
Tests for dataset creation with description verify that the description is properly set. The use of clear assertions improves maintainability.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50%
None

Workflow ID: wflow_WCveOlhKJZhL0Ksh

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

pandasai/data_loader/semantic_layer_schema.py

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

codebeaver-ai · 2025-02-12T15:04:31Z

I opened a Pull Request with the following:

🔄 3 tests added.
🐛 No bugs detected in your changes
🛠️ 3/3 tests passed

🔄 Test Updates

I've added 3 tests. They all pass ☑️
New Tests:

tests/unit_tests/core/code_generation/test_code_cleaning.py
tests/test_local_loader.py
tests/unit_tests/dataframe/test_semantic_layer_schema.py

No existing tests required updates.

🐛 Bug Detection

No bugs detected in your changes. Good job!

🛠️ Test Results

All 3 tests passed.

☂️ Coverage Improvements

Coverage improvements by file:

tests/unit_tests/core/code_generation/test_code_cleaning.py

New coverage: 75.61%
Improvement: +0.00%
tests/test_local_loader.py

New coverage: 79.73%
Improvement: +79.73%
tests/unit_tests/dataframe/test_semantic_layer_schema.py

New coverage: 94.33%
Improvement: +5.15%

🎨 Final Touches

I ran the hooks included in the pre-commit config.

_{Settings | Logs | CodeBeaver}

scaliseraoul and others added 4 commits February 11, 2025 12:06

fix(schema): make columns mandatory for view

fcbfcda

Update pandasai/data_loader/semantic_layer_schema.py

66e768d

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

fix (cache): better handling of cache for sqldataframe

a5caf8c

feature(GroupBy): adding group by functionality

36c7505

ellipsis-dev bot reviewed Feb 12, 2025

View reviewed changes

pandasai/data_loader/semantic_layer_schema.py Outdated Show resolved Hide resolved

scaliseraoul and others added 2 commits February 12, 2025 15:39

Update pandasai/data_loader/semantic_layer_schema.py

3f9339e

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

fix: fixing ellipsis formatting

b720f36

codebeaver-ai bot mentioned this pull request Feb 12, 2025

feature(GroupBy): adding group by functionality - Unit Tests #1604

Closed

This was referenced Feb 12, 2025

feature(GroupBy): adding group by functionality - Unit Tests #1605

Closed

test(GroupBy): adding group by functionality #1606

Closed

gventuri merged commit 71145e7 into sinaptik-ai:main Feb 12, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(GroupBy): adding group by functionality #1603

feature(GroupBy): adding group by functionality #1603

scaliseraoul commented Feb 12, 2025 •

edited by ellipsis-dev bot

Loading

ellipsis-dev bot left a comment

codebeaver-ai bot commented Feb 12, 2025

feature(GroupBy): adding group by functionality #1603

feature(GroupBy): adding group by functionality #1603

Conversation

scaliseraoul commented Feb 12, 2025 • edited by ellipsis-dev bot Loading

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

codebeaver-ai bot commented Feb 12, 2025

🔄 Test Updates

🐛 Bug Detection

🛠️ Test Results

☂️ Coverage Improvements

🎨 Final Touches

scaliseraoul commented Feb 12, 2025 •

edited by ellipsis-dev bot

Loading