Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Memory leak when creating a df inside a loop #60897

Open
3 tasks done
Chuck321123 opened this issue Feb 9, 2025 · 14 comments
Open
3 tasks done

BUG: Memory leak when creating a df inside a loop #60897

Chuck321123 opened this issue Feb 9, 2025 · 14 comments
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance Windows Windows OS

Comments

@Chuck321123
Copy link

Chuck321123 commented Feb 9, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import tracemalloc
import numpy as np
import time
import gc

# Start memory tracking
tracemalloc.start()

iteration = 0

Row_Number = 20000

while iteration < 1000:
    
    test_lst = [*range(12)]
    
    for i in range(12):
        
        # Create a DataFrame with X amount of rows
        df = pd.DataFrame({
            "A": np.arange(Row_Number),  # Sequential Row_Numbers from 0 to 999999
            "B": np.random.rand(Row_Number),  # Random floats between 0 and 1
            "C": np.random.randint(0, 100, size=Row_Number),  # Random integers between 0 and 99
            "D": np.random.choice(["apple", "banana", "cherry"], size=Row_Number),  # Random categories
            "E": np.random.randn(Row_Number)  # Normally distributed random Row_Numbers
        })

        test_lst[i] = df # The bug also appears without appending to list

        del df # Deleting df at the end of loop doesnt affect memory leak
  
    del test_lst # Deleting list at the end of loop doesnt affect memory leak
        
    time.sleep(0.01)
    
    iteration += 1

    # Check memory usage for 3rd party packages
    if iteration % 1 == 0:
    
        snapshot = tracemalloc.take_snapshot()
        
        # Get memory statistics **without filtering** first
        top_stats = snapshot.statistics("lineno")
        
        print(f"\n[ Memory Snapshot at iteration {iteration} ]")
        for stat in top_stats[:5]:  # Show top memory-consuming locations
            print(stat)

Issue Description

By using tracemalloc (a tool to track memory usage in loops), I can see that pandas doesnt release memory when creating dfs inside a loop. The problem seems to come from pandas\core\internals\blocks around line 228. Would be nice if anyone could find a fix to this.

Expected Behavior

That the memory doesnt leak

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.13.1
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : Norwegian Bokmål_Norway.1252

pandas : 2.2.3
numpy : 2.2.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.2
Cython : None
sphinx : 8.1.3
IPython : 8.31.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.5
lxml.etree : None
matplotlib : 3.10.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 19.0.0
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.15.1
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : 2.4.2
pyqt5 : None

@Chuck321123 Chuck321123 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 9, 2025
@rhshadrach
Copy link
Member

rhshadrach commented Feb 9, 2025

Thanks for the report, cannot reproduce on linux. Can you include the stdout from your reproducer.

Further investigations are welcome!

@rhshadrach rhshadrach added Performance Memory or execution speed performance Windows Windows OS Constructors Series/DataFrame/Index/pd.array Constructors and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 9, 2025
@narukaze132
Copy link

narukaze132 commented Feb 9, 2025

I use Windows, so I tried running the reproducer. Here's an excerpt of what I got from stdout, since GitHub isn't letting me post the whole thing. (Note: I don't have Pandas installed in the core Python path, so I manually redacted my user directory out of the result.)

[ Memory Snapshot at iteration 1 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=2820 B, count=49, average=58 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=2240 B, count=35, average=64 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:2215: size=1856 B, count=26, average=71 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=1384 B, count=12, average=115 B
<frozen abc>:123: size=896 B, count=8, average=112 B

[ Memory Snapshot at iteration 2 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=4906 B, count=85, average=58 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=2104 B, count=18, average=117 B
C:\Program Files\Python311\Lib\tracemalloc.py:505: size=1904 B, count=34, average=56 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:2215: size=1904 B, count=27, average=71 B

[ Memory Snapshot at iteration 3 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=7042 B, count=122, average=58 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B
C:\Program Files\Python311\Lib\tracemalloc.py:505: size=2688 B, count=48, average=56 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=2584 B, count=22, average=117 B
C:\Program Files\Python311\Lib\tracemalloc.py:498: size=2304 B, count=48, average=48 B

[ Memory Snapshot at iteration 4 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=8488 B, count=147, average=58 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=3200 B, count=62, average=52 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=3117 B, count=36, average=87 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=2944 B, count=25, average=118 B

[ Memory Snapshot at iteration 5 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=9410 B, count=163, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=4138 B, count=48, average=86 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=3640 B, count=64, average=57 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3184 B, count=27, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B

[ Memory Snapshot at iteration 6 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=10.6 KiB, count=188, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=5159 B, count=60, average=86 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3304 B, count=28, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=2904 B, count=53, average=55 B

[ Memory Snapshot at iteration 7 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=12.2 KiB, count=216, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=6182 B, count=72, average=86 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=5528 B, count=100, average=55 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3544 B, count=30, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B

[ Memory Snapshot at iteration 8 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=12.7 KiB, count=225, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=7206 B, count=84, average=86 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=4776 B, count=88, average=54 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3664 B, count=31, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B

[ Memory Snapshot at iteration 9 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=13.9 KiB, count=247, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=8229 B, count=96, average=86 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=5544 B, count=97, average=57 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3664 B, count=31, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B

[ Memory Snapshot at iteration 10 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=15.1 KiB, count=268, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=9252 B, count=108, average=86 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=6448 B, count=120, average=54 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3664 B, count=31, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B

@Chuck321123
Copy link
Author

Chuck321123 commented Feb 9, 2025

@rhshadrach What Linux OS and architecture are you using? Im using aarch Raspberry PI on Ubuntu and I still get the memory leak

@rhshadrach
Copy link
Member

rhshadrach commented Feb 9, 2025

@narukaze132 - thanks, I neglected to see how many times the reproducer was looping. I've cut your output down to the first 10 iterations; this is more than enough already.

@Chuck321123 -

INSTALLED VERSIONS
------------------
commit                : 846b2b532dfe81855fed2148c77b0d57727306d7
python                : 3.12.3
python-bits           : 64
OS                    : Linux
OS-release            : 6.8.0-52-generic
Version               : #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 11 00:06:25 UTC 2025
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 3.0.0.dev0+1798.g846b2b532d
numpy                 : 2.2.1
pytz                  : 2024.2
dateutil              : 2.9.0.post0
pip                   : 24.2
Cython                : 3.0.11
sphinx                : 8.1.3
IPython               : 8.29.0
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
blosc                 : None
bottleneck            : 1.4.2
dataframe-api-compat  : None
fastparquet           : 2024.5.0
fsspec                : 2024.10.0
html5lib              : 1.1
hypothesis            : 6.115.5
gcsfs                 : 2024.10.0
jinja2                : 3.1.4
lxml.etree            : 5.3.0
matplotlib            : 3.9.2
numba                 : None
numexpr               : 2.10.1
odfpy                 : None
openpyxl              : 3.1.5
pandas_gbq            : None
psycopg2              : 2.9.10
pymysql               : 1.4.6
pyarrow               : 18.1.0
pyreadstat            : 1.2.8
pytest                : 8.3.3
python-calamine       : None
pyxlsb                : 1.0.10
s3fs                  : 2024.10.0
scipy                 : 1.14.1
sqlalchemy            : 2.0.36
tables                : 3.10.1
tabulate              : 0.9.0
xarray                : 2024.9.0
xlrd                  : 2.0.1
xlsxwriter            : 3.2.0
zstandard             : 0.23.0
tzdata                : 2024.2
qtpy                  : None
pyqt5                 : None

@jacobus-herman
Copy link

Hi!
I am seeing the same issue as reported above in 2 different environments (Linux and MacOS).
The first, is in our production environment, which is running in AWS ECS Fargate.
What I am doing is looping over a list of parquet files, reading a file and writing it to S3.
So each iteration's df is short-lived and there is no need for the df to use memory after the iteration is finished.
When the third iteration is reached the task fails due to out-of-memory.

I also ran the script above on my dev environment and I see the same type of behaviour.

If there's something more you need, please let me know :)

The production environment versions:

INSTALLED VERSIONS
------------------
commit                : fd3f57170aa1af588ba877e8e28c158a20a4886d
python                : 3.11.6.final.0
python-bits           : 64
OS                    : Linux
OS-release            : 5.10.230-223.885.amzn2.x86_64
Version               : #1 SMP Tue Dec 3 14:36:00 UTC 2024
machine               : x86_64
processor             :
byteorder             : little
LC_ALL                : C.UTF-8
LANG                  : C.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 2.2.0
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.8.2
setuptools            : 66.1.1
pip                   : 23.2.1
Cython                : None
pytest                : 8.1.2
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : 3.2.0
lxml.etree            : 5.1.0
html5lib              : None
pymysql               : None
psycopg2              : 2.9.9
jinja2                : 3.1.3
IPython               : None
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : 2024.2.0
gcsfs                 : 2024.2.0
matplotlib            : None
numba                 : 0.59.0
numexpr               : None
odfpy                 : None
openpyxl              : 3.1.2
pandas_gbq            : 0.21.0
pyarrow               : 15.0.0
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.12.0
sqlalchemy            : None
tables                : None
tabulate              : 0.9.0
xarray                : None
xlrd                  : 2.0.1
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

My dev environment versions:

INSTALLED VERSIONS
------------------
commit                : 0691c5cf90477d3503834d983f69350f250a6ff7
python                : 3.11.0
python-bits           : 64
OS                    : Darwin
OS-release            : 24.3.0
Version               : Darwin Kernel Version 24.3.0: Thu Jan  2 20:24:06 PST 2025
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 2.2.3
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.9.0.post0
pip                   : 25.0.1
Cython                : None
sphinx                : None
IPython               : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
blosc                 : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : 3.1.3
lxml.etree            : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
psycopg2              : 2.9.9
pymysql               : None
pyarrow               : 15.0.2
pyreadstat            : None
pytest                : 8.1.1
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : 0.9.0
xarray                : None
xlrd                  : None
xlsxwriter            : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

My dev run script output:

[ Memory Snapshot at iteration 1 ]
.venv/lib/python3.11/site-packages/pandas/core/construction.py:605: size=3464 B, count=4, average=866 B
.venv/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: size=2944 B, count=25, average=118 B
.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:228: size=2940 B, count=51, average=58 B
.venv/lib/python3.11/site-packages/pandas/core/internals/managers.py:1778: size=1856 B, count=29, average=64 B
.venv/lib/python3.11/site-packages/pandas/core/internals/managers.py:2215: size=1736 B, count=27, average=64 B

[ Memory Snapshot at iteration 2 ]
.venv/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: size=5824 B, count=49, average=119 B
.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:228: size=5538 B, count=96, average=58 B
.venv/lib/python3.11/site-packages/pandas/core/construction.py:605: size=3464 B, count=4, average=866 B
.venv/lib/python3.11/site-packages/pandas/core/internals/managers.py:1778: size=1856 B, count=29, average=64 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:505: size=1792 B, count=32, average=56 B

[ Memory Snapshot at iteration 3 ]
.venv/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: size=8704 B, count=73, average=119 B
.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:228: size=6998 B, count=121, average=58 B
.venv/lib/python3.11/site-packages/pandas/core/construction.py:605: size=3464 B, count=4, average=866 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:505: size=2464 B, count=44, average=56 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:498: size=2112 B, count=44, average=48 B

[ Memory Snapshot at iteration 4 ]
.venv/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: size=8824 B, count=74, average=119 B
.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:228: size=8152 B, count=141, average=58 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:558: size=3552 B, count=67, average=53 B
.venv/lib/python3.11/site-packages/pandas/core/construction.py:605: size=3464 B, count=4, average=866 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:505: size=2352 B, count=42, average=56 B

[ Memory Snapshot at iteration 5 ]
.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:228: size=9198 B, count=159, average=58 B
.venv/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: size=8824 B, count=74, average=119 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:558: size=3776 B, count=69, average=55 B
.venv/lib/python3.11/site-packages/pandas/core/construction.py:605: size=3464 B, count=4, average=866 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:505: size=2408 B, count=43, average=56 B

[ Memory Snapshot at iteration 6 ]
.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:228: size=10.6 KiB, count=187, average=58 B
.venv/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: size=8824 B, count=74, average=119 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:558: size=3888 B, count=70, average=56 B
.venv/lib/python3.11/site-packages/pandas/core/construction.py:605: size=3464 B, count=4, average=866 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:505: size=2408 B, count=43, average=56 B

[ Memory Snapshot at iteration 7 ]
.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:228: size=11.5 KiB, count=203, average=58 B
.venv/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: size=8824 B, count=74, average=119 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:558: size=5560 B, count=103, average=54 B
.venv/lib/python3.11/site-packages/pandas/core/construction.py:605: size=3464 B, count=4, average=866 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:505: size=2576 B, count=46, average=56 B

[ Memory Snapshot at iteration 8 ]
.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:228: size=12.7 KiB, count=224, average=58 B
.venv/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: size=8824 B, count=74, average=119 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:558: size=5536 B, count=101, average=55 B
.venv/lib/python3.11/site-packages/pandas/core/construction.py:605: size=3464 B, count=4, average=866 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:505: size=2296 B, count=41, average=56 B

[ Memory Snapshot at iteration 9 ]
.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:228: size=13.4 KiB, count=237, average=58 B
.venv/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: size=8824 B, count=74, average=119 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:558: size=6040 B, count=109, average=55 B
.venv/lib/python3.11/site-packages/pandas/core/construction.py:605: size=3464 B, count=4, average=866 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:505: size=2408 B, count=43, average=56 B

[ Memory Snapshot at iteration 10 ]
.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:228: size=15.0 KiB, count=265, average=58 B
.venv/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: size=8824 B, count=74, average=119 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:558: size=7240 B, count=133, average=54 B
.venv/lib/python3.11/site-packages/pandas/core/construction.py:605: size=3464 B, count=4, average=866 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:505: size=2352 B, count=42, average=56 B

[ Memory Snapshot at iteration 11 ]
.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:228: size=15.3 KiB, count=271, average=58 B
.venv/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: size=8824 B, count=74, average=119 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:558: size=7616 B, count=138, average=55 B
.venv/lib/python3.11/site-packages/pandas/core/construction.py:605: size=3464 B, count=4, average=866 B
.pyenv/versions/3.11.0/lib/python3.11/tracemalloc.py:505: size=2408 B, count=43, average=56 B

@rhshadrach
Copy link
Member

rhshadrach commented Feb 16, 2025

@jacobus-herman - can you post the output of this from your linux env; it is modified from the OP.

Code
import pandas as pd
import tracemalloc
import numpy as np
import time
import gc

# Start memory tracking
tracemalloc.start()

iteration = 0

Row_Number = 20000

prev_snapshot = tracemalloc.take_snapshot()

while iteration < 500:
    test_lst = [*range(12)]
    for i in range(12):
        # Create a DataFrame with X amount of rows
        df = pd.DataFrame({
            "A": np.arange(Row_Number),  # Sequential Row_Numbers from 0 to 999999
            "B": np.random.rand(Row_Number),  # Random floats between 0 and 1
            "C": np.random.randint(0, 100, size=Row_Number),  # Random integers between 0 and 99
            "D": np.random.choice(["apple", "banana", "cherry"], size=Row_Number),  # Random categories
            "E": np.random.randn(Row_Number)  # Normally distributed random Row_Numbers
        })
        test_lst[i] = df # The bug also appears without appending to list
        del df # Deleting df at the end of loop doesnt affect memory leak
  
    del test_lst # Deleting list at the end of loop doesnt affect memory leak
    iteration += 1

    # Check memory usage for 3rd party packages
    if iteration % 50 == 0:
        snapshot = tracemalloc.take_snapshot()
        top_stats = snapshot.compare_to(prev_snapshot, 'lineno')
        print(f"{iteration=}")
        for k, stat in enumerate(top_stats[:10]):
            if "site-packages/pandas" in str(stat.traceback):
                print("  ", k, stat)
        prev_snapshot = snapshot
tracemalloc.stop()

@jacobus-herman
Copy link

@rhshadrach, sure the Linux environment output is below.
It seems that the gc.collect() makes all the difference.
Is it expected to follow this pattern to ensure that the memory doesn't get out of hand?

iteration=50
   0 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=21.5 KiB (+21.5 KiB), count=380 (+380), average=58 B
   1 python3.11/site-packages/pandas/core/construction.py:605: size=3480 B (+3480 B), count=4 (+4), average=870 B
   2 python3.11/site-packages/pandas/core/internals/managers.py:2303: size=1296 B (+1296 B), count=18 (+18), average=72 B
   3 python3.11/site-packages/pandas/core/internals/managers.py:2214: size=1008 B (+1008 B), count=14 (+14), average=72 B
   6 python3.11/site-packages/pandas/core/indexes/base.py:599: size=328 B (+328 B), count=2 (+2), average=164 B
iteration=100
   0 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=24.9 KiB (+3422 B), count=439 (+59), average=58 B
iteration=150
   3 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=25.7 KiB (+812 B), count=453 (+14), average=58 B
iteration=200
   4 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=25.9 KiB (+290 B), count=458 (+5), average=58 B
iteration=250
iteration=300
   9 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=25.9 KiB (+0 B), count=458 (+0), average=58 B
iteration=350
iteration=400
   8 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=25.9 KiB (+0 B), count=458 (+0), average=58 B
   9 python3.11/site-packages/pandas/core/construction.py:605: size=3480 B (+0 B), count=4 (+0), average=870 B
iteration=450
iteration=500

@rhshadrach
Copy link
Member

rhshadrach commented Feb 16, 2025

I am expecting to see the same without gc.collect(); can you remove that line and post that as well. If that doesn't show the memory leak but you still believe you can reproduce in linux, can you post the reduced output from the OP (maybe make it print every 50 iterations).

@Chuck321123
Copy link
Author

Now I get that line 184 accumulates instead. ChatGPT said this:
Image

@rhshadrach
Copy link
Member

rhshadrach commented Feb 16, 2025

Thanks @Chuck321123 - LLMs can be helpful assistants, but I think at this time their responses need to go through a human filter to discern whether they are accurate / helpful. In this particular case, do you think ChatGPTs response was helpful? If you aren't able to tell, I would highly recommend not posting it into issues. Doing so can lead to extra noise without signal, and makes issues harder to understand.

@Chuck321123
Copy link
Author

@rhshadrach Sorry, just wanted to help.

@jacobus-herman
Copy link

jacobus-herman commented Feb 18, 2025

I am expecting to see the same without gc.collect(); can you remove that line and post that as well. If that doesn't show the memory leak but you still believe you can reproduce in linux, can you post the reduced output from the OP (maybe make it print every 50 iterations).

Hey Richard! I ran it without the gc.collect() and the results are as follows:

iteration=50
   0 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=40.0 KiB (+40.0 KiB), count=716 (+716), average=57 B
   3 python3.11/site-packages/pandas/core/construction.py:605: size=3480 B (+3480 B), count=4 (+4), average=870 B
   4 python3.11/site-packages/pandas/core/generic.py:283: size=2928 B (+2928 B), count=52 (+52), average=56 B
   5 python3.11/site-packages/pandas/core/internals/managers.py:2279: size=2800 B (+2800 B), count=50 (+50), average=56 B
   6 python3.11/site-packages/pandas/core/internals/base.py:84: size=2800 B (+2800 B), count=50 (+50), average=56 B
   7 python3.11/site-packages/pandas/core/internals/managers.py:1777: size=1856 B (+1856 B), count=29 (+29), average=64 B
   8 python3.11/site-packages/pandas/core/internals/managers.py:2214: size=1736 B (+1736 B), count=27 (+27), average=64 B
   9 python3.11/site-packages/pandas/core/internals/managers.py:2303: size=1296 B (+1296 B), count=18 (+18), average=72 B
iteration=100
   0 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=58.0 KiB (+18.0 KiB), count=1044 (+328), average=57 B
   2 python3.11/site-packages/pandas/core/generic.py:283: size=5608 B (+2680 B), count=100 (+48), average=56 B
   4 python3.11/site-packages/pandas/core/internals/managers.py:2279: size=5432 B (+2632 B), count=97 (+47), average=56 B
   5 python3.11/site-packages/pandas/core/internals/base.py:84: size=5432 B (+2632 B), count=97 (+47), average=56 B
iteration=150
   0 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=73.1 KiB (+15.1 KiB), count=1320 (+276), average=57 B
   4 python3.11/site-packages/pandas/core/internals/managers.py:2279: size=8064 B (+2632 B), count=144 (+47), average=56 B
   5 python3.11/site-packages/pandas/core/internals/base.py:84: size=8064 B (+2632 B), count=144 (+47), average=56 B
   6 python3.11/site-packages/pandas/core/generic.py:283: size=8136 B (+2528 B), count=145 (+45), average=56 B
iteration=200
   0 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=78.3 KiB (+5376 B), count=1416 (+96), average=57 B
   6 python3.11/site-packages/pandas/core/internals/managers.py:2279: size=9072 B (+1008 B), count=162 (+18), average=56 B
   7 python3.11/site-packages/pandas/core/internals/base.py:84: size=9072 B (+1008 B), count=162 (+18), average=56 B
   8 python3.11/site-packages/pandas/core/generic.py:283: size=9136 B (+1000 B), count=163 (+18), average=56 B
iteration=250
   1 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=77.1 KiB (-1230 B), count=1394 (-22), average=57 B
   9 python3.11/site-packages/pandas/core/generic.py:283: size=8864 B (-272 B), count=158 (-5), average=56 B
iteration=300
   4 python3.11/site-packages/pandas/core/internals/managers.py:1777: size=1856 B (+192 B), count=29 (+3), average=64 B
   9 python3.11/site-packages/pandas/core/indexes/base.py:664: size=704 B (-64 B), count=11 (-1), average=64 B
iteration=350
   6 python3.11/site-packages/pandas/core/internals/blocks.py:228: size=77.0 KiB (-112 B), count=1391 (-2), average=57 B
iteration=400
   8 python3.11/site-packages/pandas/core/generic.py:283: size=8912 B (+48 B), count=159 (+1), average=56 B
iteration=450
   4 python3.11/site-packages/pandas/core/internals/managers.py:1777: size=1664 B (-192 B), count=26 (-3), average=64 B
   7 python3.11/site-packages/pandas/core/generic.py:282: size=704 B (+128 B), count=11 (+2), average=64 B
   9 python3.11/site-packages/pandas/core/indexes/base.py:664: size=768 B (+64 B), count=12 (+1), average=64 B
iteration=500

So while gc.collect() does make a difference, I don't think it's an issue in/with Pandas. I have done further investigation on my side, using memray, and my issue seems related to general Python memory management and optimising it's reuse.

@rhshadrach
Copy link
Member

Thanks @jacobus-herman! @Chuck321123 - can you run the code in #60897 (comment) and post the output you get?

@Chuck321123
Copy link
Author

Chuck321123 commented Feb 22, 2025

@rhshadrach I Tried, and I get similar results as @jacobus-herman. Seems to stabilize after 4-500 iterations. However, the memory usage is more than what it should be. It stabilizes at around 8.5kb if you iterate for longer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance Windows Windows OS
Projects
None yet
Development

No branches or pull requests

4 participants