-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(GroupBy): adding group by functionality #1603
Conversation
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to 36c7505 in 2 minutes and 12 seconds
More details
- Looked at
810
lines of code in11
files - Skipped
0
files when reviewing. - Skipped posting
14
drafted comments based on config settings.
1. pandasai/__init__.py:48
- Draft comment:
Docstring updated to include group_by; ensure user examples match new behavior and validation errors are descriptive. - Reason this comment was not posted:
Comment looked like it was already resolved.
2. pandasai/core/cache.py:98
- Draft comment:
Using df.schema.name for cache key may lead to collisions if names aren’t unique; ensure schema naming uniqueness. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50%
The comment points out a real potential issue - if two different schemas have the same name, they would generate the same cache key even though they represent different data. This could lead to incorrect cache hits. However, the comment is somewhat speculative ("may lead") and asks for verification ("ensure") rather than suggesting a specific fix.
The comment doesn't propose a concrete solution and falls into the pattern of asking the author to "ensure" something. Also, schema naming conventions might be defined elsewhere in the codebase.
While the comment style isn't ideal, this is a significant change that could introduce subtle bugs. Moving from a hash-based key to a name-based key is a fundamental change in caching strategy.
The comment should be rephrased to be more actionable, suggesting a specific solution like "Consider using both schema name and a hash of column names/types to ensure cache key uniqueness."
3. pandasai/data_loader/local_loader.py:75
- Draft comment:
Grouping logic correctly maps aggregation functions; verify that _apply_grouping gracefully handles cases with no aggregations. - Reason this comment was not posted:
Confidence changes required:20%
<= threshold50%
None
4. pandasai/data_loader/semantic_layer_schema.py:305
- Draft comment:
Validation in _validate_group_by_columns is clear; consider rewording error messages to be more user‐friendly. - Reason this comment was not posted:
Confidence changes required:30%
<= threshold50%
None
5. pandasai/query_builders/base_query_builder.py:16
- Draft comment:
Group by clause is added consistently; ensure column normalization is applied uniformly in all builders. - Reason this comment was not posted:
Confidence changes required:20%
<= threshold50%
None
6. pandasai/query_builders/view_query_builder.py:70
- Draft comment:
SQL injection prevention via sanitize_view_column_name is implemented; double-check alias normalization to avoid unexpected SQL syntax. - Reason this comment was not posted:
Confidence changes required:20%
<= threshold50%
None
7. tests/unit_tests/query_builders/test_group_by.py:117
- Draft comment:
Group by tests cover both correct and invalid schemas; ensure edge cases (e.g., empty group_by) are also tested. - Reason this comment was not posted:
Confidence changes required:20%
<= threshold50%
None
8. tests/unit_tests/test_pandasai_init.py:22
- Draft comment:
Unit tests for create function now include group_by functionality; ensure new examples are in sync with updated docs. - Reason this comment was not posted:
Confidence changes required:20%
<= threshold50%
None
9. pandasai/core/cache.py:98
- Draft comment:
Using 'df.schema.name' for the cache key may not uniquely identify different dataframes. Consider retaining a content-based hash (e.g., the previous 'df.column_hash') to avoid cache collisions. - Reason this comment was not posted:
Marked as duplicate.
10. pandasai/data_loader/local_loader.py:75
- Draft comment:
In _apply_grouping, note that grouping only happens if any column has an aggregation expression. If group_by is set but no aggregation is defined, the DataFrame returns unchanged. It may be useful to clarify this behavior in the docstring. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%
<= threshold50%
This comment is purely informative and suggests updating the docstring to clarify behavior. It doesn't provide a specific code suggestion or address a potential issue directly related to the code changes. Therefore, it violates the rule against making purely informative comments.
11. pandasai/query_builders/base_query_builder.py:51
- Draft comment:
The aggregation and alias handling in _get_columns is clean and concise. No issues detected. - Reason this comment was not posted:
Confidence changes required:0%
<= threshold50%
None
12. pandasai/query_builders/view_query_builder.py:32
- Draft comment:
Good use of alias normalization in _get_group_by_columns to prevent injection. Verify consistency across implementations. - Reason this comment was not posted:
Confidence changes required:0%
<= threshold50%
None
13. tests/unit_tests/query_builders/test_group_by.py:119
- Draft comment:
Comprehensive tests in test_invalid_group_by validate error conditions for group_by configuration. Excellent test coverage. - Reason this comment was not posted:
Confidence changes required:0%
<= threshold50%
None
14. tests/unit_tests/test_pandasai_init.py:424
- Draft comment:
Tests for dataset creation with description verify that the description is properly set. The use of clear assertions improves maintainability. - Reason this comment was not posted:
Confidence changes required:0%
<= threshold50%
None
Workflow ID: wflow_WCveOlhKJZhL0Ksh
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
I opened a Pull Request with the following: 🔄 3 tests added. 🔄 Test UpdatesI've added 3 tests. They all pass ☑️
No existing tests required updates. 🐛 Bug DetectionNo bugs detected in your changes. Good job! 🛠️ Test ResultsAll 3 tests passed. ☂️ Coverage ImprovementsCoverage improvements by file:
🎨 Final Touches
Settings | Logs | CodeBeaver |
Important
Adds
group_by
functionality to SQL queries, updating schema validation, query builders, and data loaders, with corresponding tests.group_by
parameter tocreate()
inpandasai/__init__.py
for grouping columns in SQL queries.SemanticLayerSchema
insemantic_layer_schema.py
to includegroup_by
validation._apply_grouping()
inlocal_loader.py
to handle DataFrame grouping and aggregation.BaseQueryBuilder
andViewQueryBuilder
to supportGROUP BY
clauses in SQL queries.group_by
inSemanticLayerSchema
to ensure only non-aggregated columns are included.test_group_by.py
to testgroup_by
functionality in query builders.test_loader.py
andtest_pandasai_init.py
to cover newgroup_by
logic.test_azure_openai.py
.This description was created by
for 36c7505. It will automatically update as commits are pushed.