[Matplotlib-users] Error when running multiple jobs utilizing the Tex utilities in matplotlib

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

Myself and my colleagues use, and have used, matplotlib and it's Tex
capabilities quite extensively to create plots to assist in the
gravitational wave searches we perform. (and it has been a great tool for us
:-) ). However recently we have been running into problems when we have
started automating our plot generation by running multiple plotting jobs
concurrently using the condor scheduler (and dagmans). Many of our plotting
jobs fail with messages such as the one below:

---snip---

Traceback (most recent call last):
 File
"/home/romain/Projects/
ligovirgo/s5_2yr_lv_lowcbc_20080625/868815014-868901414/868815014-868901414/inj001_summary_plots/../executables/plotinjnum",
line 298, in ?
   'eff_dist_h')
 File
"/home/romain/Projects/ligovirgo/s5_2yr_lv_lowcbc_20080625/868815014-868901414/868815014-868901414/inj001_summary_plots/../executables/plotinjnum",
line 119, in plot_found_missed
   fname_thumb = InspiralUtils.savefig_pylal(filename=fname,
doThumb=True, dpi_thumb=opts.figure_resolution)
 File
"/home/romain/codes/s5_2yr_lv_lowcbc_20080625/pylal/lib64/python2.4/site-packages/pylal/InspiralUtils.py",
line 58, in savefig_pylal
   fig.savefig(filename_thumb, dpi=dpi_thumb)
....
 File "/usr/lib64/python2.4/site-packages/matplotlib/texmanager.py", line
259, in make_png
   os.remove(outfile)
 OSError: [Errno 2] No such file or directory:
'/home/romain/.matplotlib/tex.cache/ae479c90ff242327b54af004a0846188.output'

---snip---

My feeling is that when the code invokes the Tex 'bit' it creates a temp
file in ~/matplotlib/tex.cache and then deletes it and all other temp tex
files when it finishes the Tex 'bit'. This would cause problems if another
job is in the middle of running Tex when the other job deletes it's temp
files!

We are running a slightly old version of matplotlib (0.87.7), as we run on
multiple clusters our sys admins tend to only update software when there is
a need to and we have had no other problems with matplotlib, I apologize if
this has been fixed in the meantime (I did do a quick search of the mailing
list archive but found nothing). All our clusters currently run Fedora Core
4 (we're going to move to CentOS 5).

Currently we are getting around this by forcing condor to retry the failed
jobs 2/3 times, this catches most of these errors. Another solution would be
to limit the number of jobs running to 1 BUT as we run dagmen from within
one 'super' dagman it would prove difficult to limit jobs from multiple
dagmen.

Anyway if anyone has any ideas of how to solve this I would appreciate this.
Also if there are any options where we can set the location of these temp
tex files and use a different directory  for each job (or stop matplotlib
deleting other temp files) that would help us.

Thanks in advance for any help

Ian Harry

-- 
---------------------------------------------------------------------------
Ian Harry
School of Physics & Astronomy
Queens Buildings, The Parade
Cardiff, CF24 3AA
Email: Ian...@as...
Phone: (+44) 29 208 75120
Mobile: (+44) 7890 479090
---------------------------------------------------------------------------