[Sqlalchemy-tickets] Issue #4296: Memory Leak in large table insertion (zzzeek/sqlalchemy)
Brought to you by:
zzzeek
From: Shaun R. <iss...@bi...> - 2018-06-30 07:13:11
|
New issue 4296: Memory Leak in large table insertion https://bitbucket.org/zzzeek/sqlalchemy/issues/4296/memory-leak-in-large-table-insertion Shaun Rong: I was trying to insert a large pandas csv file (~200MB file) into a MySQL table with sqlalchemy. In the middle of running the script, I found the insertion program eats a huge chunk of the memory, and that in the end, the machine froze and I have to reboot the machine to kill the running program. To prove there's a memory leak, I wrote output to print out the available memories every 10k rows inserted from the pandas dataframe. I also collect garbages with `gc.collect()` to ensure this memory leakage doesn't come from elsewhere. The output are ``` #!txt Row No 0, CPU usage 49.9, Memory available 0.6730947909561938 Row No 100000, CPU usage 2.4, Memory available 0.620737552516592 Row No 200000, CPU usage 2.9, Memory available 0.5690685967649676 Row No 300000, CPU usage 2.5, Memory available 0.5178657630750353 Row No 400000, CPU usage 2.5, Memory available 0.4670138185985817 Row No 500000, CPU usage 2.2, Memory available 0.41836375584339547 Row No 600000, CPU usage 2.4, Memory available 0.36772255252178265 Row No 700000, CPU usage 2.5, Memory available 0.3198458994235243 ``` The codes executed are below. The machine is Ubuntu 16.04, 8G memory, the sqlalchemy version 1.2.9, python version 3.6.5. Driver is installed using `pip install mysql-connector-python` ``` #!python Ccm_Base = declarative_base() class MonthlyPriceTable(Ccm_Base): __tablename__ = 'monthly_price' gvkey = Column(Integer, primary_key=True) liid = Column(Integer, primary_key=True) lpermno = Column(Integer, primary_key=True) datadate = Column(Date, primary_key=True) ajexm = Column(Numeric(10, 6), nullable=False) spcsrc = Column(Integer) prccm_adj = Column(Numeric(16, 8), nullable=False) prchm_adj = Column(Numeric(16, 8), nullable=False) prclm_adj = Column(Numeric(16, 8), nullable=False) cheqvm_adj = Column(Numeric(8, 2), nullable=False) dvpspm_adj = Column(Numeric(8, 2), nullable=False) cshtrm_adj = Column(BigInteger) cshoq_adj = Column(Numeric(12, 6)) import pandas as pd from sqlalchemy import create_engine from sqlalchemy.orm import Session import argparse from settings import DB_USER, DB_PASSWORD, DB_HOST, DB_PORT, CCM_DB from settings import Monthly_Security_Table_Dir import psutil import gc def nan_converter(data): if pd.isnull(data): return None return data def dt_converter(ds): if pd.isnull(ds): return None return datetime.strptime(str(int(ds)), Date_Format).date() if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-r', help='row number to start inserting', type=int, required=True) parser.add_argument('-f', help='output file', type=str, required=True) args = parser.parse_args() engine = create_engine("mysql+mysqlconnector://{}:{}@{}:{}/{}".format(DB_USER, DB_PASSWORD, DB_HOST, DB_PORT, CCM_DB)) df = pd.read_csv(Monthly_Security_Table_Dir) Ccm_Base.metadata.create_all(engine) sess = Session(bind=engine) for idx, row in df.loc[args.r:, :].iterrows(): price = MonthlyPriceTable(gvkey=row['GVKEY'], liid=row['LIID'], lpermno=row['LPERMNO'], datadate=dt_converter(row['DATADATE']), ajexm=row['AJEXM'], spcsrc=nan_converter(row['SPCSRC']), prccm_adj=row['PRCCM_ADJ'], prchm_adj=row['PRCHM_ADJ'], prclm_adj=row['PRCLM_ADJ'], cheqvm_adj=row['CHEQVM_ADJ'], dvpspm_adj=row['DVPSPM_ADJ'], cshtrm_adj=nan_converter(row['CSHTRM_ADJ']), cshoq_adj=nan_converter(row['CSHOQ_ADJ'])) sess.add(price) sess.flush() sess.commit() if idx % 100000 == 0: gc.collect() with open(args.f, 'a') as f: f.write("Row No {}, CPU usage {}, Memory available {}\n".format(idx, psutil.cpu_percent(), psutil.virtual_memory().available / psutil.virtual_memory().total)) ``` To run the scripts reproducing the results, please set DB access in settings.py and point Monthly_Security_Table_Dir to the csv file. Use command `python script.py -r 0 -f output_record.txt` to run script. The memory output is in the output_record.txt. The file I used is larger than 100MB so I can't upload it here. I uploaded in dropbox https://www.dropbox.com/s/vywopo1c0b1zrl5/security_monthly.csv?dl=0. |