|
From: Ian H. <iw...@go...> - 2008-07-10 10:03:57
|
Hi all, Myself and my colleagues use, and have used, matplotlib and it's Tex capabilities quite extensively to create plots to assist in the gravitational wave searches we perform. (and it has been a great tool for us :-) ). However recently we have been running into problems when we have started automating our plot generation by running multiple plotting jobs concurrently using the condor scheduler (and dagmans). Many of our plotting jobs fail with messages such as the one below: ---snip--- Traceback (most recent call last): File "/home/romain/Projects/ ligovirgo/s5_2yr_lv_lowcbc_20080625/868815014-868901414/868815014-868901414/inj001_summary_plots/../executables/plotinjnum", line 298, in ? 'eff_dist_h') File "/home/romain/Projects/ligovirgo/s5_2yr_lv_lowcbc_20080625/868815014-868901414/868815014-868901414/inj001_summary_plots/../executables/plotinjnum", line 119, in plot_found_missed fname_thumb = InspiralUtils.savefig_pylal(filename=fname, doThumb=True, dpi_thumb=opts.figure_resolution) File "/home/romain/codes/s5_2yr_lv_lowcbc_20080625/pylal/lib64/python2.4/site-packages/pylal/InspiralUtils.py", line 58, in savefig_pylal fig.savefig(filename_thumb, dpi=dpi_thumb) .... File "/usr/lib64/python2.4/site-packages/matplotlib/texmanager.py", line 259, in make_png os.remove(outfile) OSError: [Errno 2] No such file or directory: '/home/romain/.matplotlib/tex.cache/ae479c90ff242327b54af004a0846188.output' ---snip--- My feeling is that when the code invokes the Tex 'bit' it creates a temp file in ~/matplotlib/tex.cache and then deletes it and all other temp tex files when it finishes the Tex 'bit'. This would cause problems if another job is in the middle of running Tex when the other job deletes it's temp files! We are running a slightly old version of matplotlib (0.87.7), as we run on multiple clusters our sys admins tend to only update software when there is a need to and we have had no other problems with matplotlib, I apologize if this has been fixed in the meantime (I did do a quick search of the mailing list archive but found nothing). All our clusters currently run Fedora Core 4 (we're going to move to CentOS 5). Currently we are getting around this by forcing condor to retry the failed jobs 2/3 times, this catches most of these errors. Another solution would be to limit the number of jobs running to 1 BUT as we run dagmen from within one 'super' dagman it would prove difficult to limit jobs from multiple dagmen. Anyway if anyone has any ideas of how to solve this I would appreciate this. Also if there are any options where we can set the location of these temp tex files and use a different directory for each job (or stop matplotlib deleting other temp files) that would help us. Thanks in advance for any help Ian Harry -- --------------------------------------------------------------------------- Ian Harry School of Physics & Astronomy Queens Buildings, The Parade Cardiff, CF24 3AA Email: Ian...@as... Phone: (+44) 29 208 75120 Mobile: (+44) 7890 479090 --------------------------------------------------------------------------- |