Thread: [Matplotlib-users] Millions of data points saved to pdf

Brought to you by: cjgohlke, dsdale, efiring, heeres, and 7 others

matplotlib-users

[Matplotlib-users] Millions of data points saved to pdf

From: nertskull <ner...@gm...> - 2014-05-01 12:09:11

I am trying to create a multipage pdf of about 750 different graphs.

Each graph has around 5,000 - 15,000 data points, giving me roughly 7
million points across the pdf.  I make it in a large pdf with a page length
of about 20 inches, and then plot about 10 graphs to a page.  So I end up
with basically 75 pages in my pdf.  I'm basically trying to graph a line of
XY data points.

The problem, is the pdf is unbearably slow when plotting as a scatter plot
or as a line with markers.

If I make a regular line plot, with no markers, just a single line, it is
plotted and the pdf is fine.  But then it connects my points which I don't
want.  

I assume this is all because its making the pdf in vector format.  And when
I convert it to single lines, I only have ~750 line vectors.  But when I try
to scatter plot, or line plot with markers, I end up with millions of
vectors.

I've tried the 'rasterized=True' and that definitely works.  But the quality
is really bad.  I need to be able to zoom in close on the pdf and still see
rough resolution of the points.  

For clarity, I don't actually need to see each individual points.  The
graphs have two lines on them, and I just need to be able to distinguish
between the two lines.  The two lines are just made up of thousands of
points each.

Is there anyway to keep scalable vectors and do this?  Or will I just be
forced to go to a rasterized image file in order to load the pdf in a
reasonable time.

Thanks.



--
View this message in context: http://matplotlib.1069221.n5.nabble.com/Millions-of-data-points-saved-to-pdf-tp43338.html
Sent from the matplotlib - users mailing list archive at Nabble.com.

Re: [Matplotlib-users] Millions of data points saved to pdf

From: Alan G I. <ala...@gm...> - 2014-05-01 12:41:11

Suppose each data point is only 1 point (1/72 ") in diameter.
A solid line across a 20" page is less than 1500 points.
You're using a fraction of a page per graph and trying to
plot 5,000-15,000 points per graph.  This is pointless (pun
intended) for visual display, especially since you do not
care about the individual points.  What happens if you
decimate the points?  Is the result acceptable?

Perhaps you could do even better than that, given your
posted description.  Fit a line to the points, and only
plot the fitted line.  Or use something like `hexbin`.

Alan Isaac

Re: [Matplotlib-users] Millions of data points saved to pdf

From: Shahar S. K. <ka...@po...> - 2014-05-01 12:48:54

How about different line styles or colors instead of markers?—
Sent from Mailbox

On Thu, May 1, 2014 at 2:10 PM, nertskull <ner...@gm...> wrote:

> I am trying to create a multipage pdf of about 750 different graphs.
> Each graph has around 5,000 - 15,000 data points, giving me roughly 7
> million points across the pdf.  I make it in a large pdf with a page length
> of about 20 inches, and then plot about 10 graphs to a page.  So I end up
> with basically 75 pages in my pdf.  I'm basically trying to graph a line of
> XY data points.
> The problem, is the pdf is unbearably slow when plotting as a scatter plot
> or as a line with markers.
> If I make a regular line plot, with no markers, just a single line, it is
> plotted and the pdf is fine.  But then it connects my points which I don't
> want.  
> I assume this is all because its making the pdf in vector format.  And when
> I convert it to single lines, I only have ~750 line vectors.  But when I try
> to scatter plot, or line plot with markers, I end up with millions of
> vectors.
> I've tried the 'rasterized=True' and that definitely works.  But the quality
> is really bad.  I need to be able to zoom in close on the pdf and still see
> rough resolution of the points.  
> For clarity, I don't actually need to see each individual points.  The
> graphs have two lines on them, and I just need to be able to distinguish
> between the two lines.  The two lines are just made up of thousands of
> points each.
> Is there anyway to keep scalable vectors and do this?  Or will I just be
> forced to go to a rasterized image file in order to load the pdf in a
> reasonable time.
> Thanks.
> --
> View this message in context: http://matplotlib.1069221.n5.nabble.com/Millions-of-data-points-saved-to-pdf-tp43338.html
> Sent from the matplotlib - users mailing list archive at Nabble.com.
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
> unparalleled scalability from the best Selenium testing platform available.
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Matplotlib-users mailing list
> Mat...@li...
> https://lists.sourceforge.net/lists/listinfo/matplotlib-users

Re: [Matplotlib-users] Millions of data points saved to pdf

From: Shahar S. K. <ka...@po...> - 2014-05-01 13:08:44

What do you consider a gap?Perhaps if you know that you can find those in your data and if you really want to visualize the gaps, plot those instead of the data.  
—
Sent from Mailbox

On Thu, May 1, 2014 at 2:41 PM, Alan G Isaac <ala...@gm...> wrote:

> Suppose each data point is only 1 point (1/72 ") in diameter.
> A solid line across a 20" page is less than 1500 points.
> You're using a fraction of a page per graph and trying to
> plot 5,000-15,000 points per graph.  This is pointless (pun
> intended) for visual display, especially since you do not
> care about the individual points.  What happens if you
> decimate the points?  Is the result acceptable?
> Perhaps you could do even better than that, given your
> posted description.  Fit a line to the points, and only
> plot the fitted line.  Or use something like `hexbin`.
> Alan Isaac
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
> unparalleled scalability from the best Selenium testing platform available.
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Matplotlib-users mailing list
> Mat...@li...
> https://lists.sourceforge.net/lists/listinfo/matplotlib-users

Re: [Matplotlib-users] Millions of data points saved to pdf

From: Dominik K. <dk...@as...> - 2014-05-01 13:22:18

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

when reading the number of points you have in each plot, I have to ask
why you need so many (plotted) data points. If you plot e.g. every
10th or 50th data point, you reduce the number of points by a factor
of 10 (or 50). This should make the PDF smaller and faster and even if
you zoom into each plot, you should be able to see enough details (of
course, if there are one or two outliers you might not see them). And
probably you are not able to distinguish between two data points if
they are too close to each other so you probably don't need every data
point.

Cheers,

Dominik

On 05/01/2014 02:09 PM, nertskull wrote:
> I am trying to create a multipage pdf of about 750 different
> graphs.
> 
> Each graph has around 5,000 - 15,000 data points, giving me roughly
> 7 million points across the pdf.  I make it in a large pdf with a
> page length of about 20 inches, and then plot about 10 graphs to a
> page.  So I end up with basically 75 pages in my pdf.  I'm
> basically trying to graph a line of XY data points.
> 
> The problem, is the pdf is unbearably slow when plotting as a
> scatter plot or as a line with markers.
> 
> If I make a regular line plot, with no markers, just a single line,
> it is plotted and the pdf is fine.  But then it connects my points
> which I don't want.
> 
> I assume this is all because its making the pdf in vector format.
> And when I convert it to single lines, I only have ~750 line
> vectors.  But when I try to scatter plot, or line plot with
> markers, I end up with millions of vectors.
> 
> I've tried the 'rasterized=True' and that definitely works.  But
> the quality is really bad.  I need to be able to zoom in close on
> the pdf and still see rough resolution of the points.
> 
> For clarity, I don't actually need to see each individual points.
> The graphs have two lines on them, and I just need to be able to
> distinguish between the two lines.  The two lines are just made up
> of thousands of points each.
> 
> Is there anyway to keep scalable vectors and do this?  Or will I
> just be forced to go to a rasterized image file in order to load
> the pdf in a reasonable time.
> 
> Thanks.
> 
> 
> 
> -- View this message in context:
> http://matplotlib.1069221.n5.nabble.com/Millions-of-data-points-saved-to-pdf-tp43338.html
>
> 
Sent from the matplotlib - users mailing list archive at Nabble.com.
> 
> ------------------------------------------------------------------------------
>
> 
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.
> Get unparalleled scalability from the best Selenium testing
> platform available. Simple to use. Nothing to install. Get started
> now for free." http://p.sf.net/sfu/SauceLabs 
> _______________________________________________ Matplotlib-users
> mailing list Mat...@li... 
> https://lists.sourceforge.net/lists/listinfo/matplotlib-users
> 

- -- 
Dominik Klaes
Deputy student representative of the AIfA
Argelander-Institut für Astronomie
Room 2.027a
Auf dem Hügel 71
53121 Bonn

Telefon: 0228/73-5773
E-Mail: dk...@as...
Homepage: http://www.astro.uni-bonn.de/~dklaes/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTYjxlAAoJEG4xBfJ3OX2tck4QALZ4raj6ZHW50Ie2uj31dC7n
q7LynLhBCfYr8Hs/m8wad/VfNStNqdoyJta683YF4ev6qskUY6lvh3qzYRNZdYMk
8yzT2CmhrWss+jEQYyrfKjrjSZbWMtCRaNHdrF/Ne6Je7VaPE/y6+4PXmkKNNeKU
bCdRdyUsUb+cjQPXgIn0bN9AFqNDOcVpMkkkzxfHU2n0kGGOymVRpOLRqQbSFR5X
AsyawYmia08RRr312dNja27BJcvXA9JHJ+qk0r7UPCOZow2GZGb6BlCu+eE/xtBt
j/r9Ym5KUgSD0q35veT34BLoMuD/L3Q7RujCxkToKezTdNGbRZh/8EkRKbu5Zc48
SnZAkmcAO8GTXqdBaD32l67iOjSCK1qqXLv8/Sb+0OOhZ3gMdF/6PbxvBM3m7U0g
zthwBYWBkJrZUuBRyi3fzYs5olvlQDW6RWPnB4tA5acrOmrHAhSqp5I4nk+ln/mZ
s41R+uG2sx5+F3aJGMLEL0lNpRtQbWUIPQ7RHJ48TVHhrG3oB/41Li4FOcYvboKr
8B5XIygw8eNULja7Q7Coz2T/uFg42pMyRMMMDz92eLhOxlmk8k0bWqAqUp4dj85l
Hodz94PwYvloWV6tagSHVAYiUtkWhkyHOYBSH1oPSfB4Tpsh8aiGCk5oIZn92Kty
kswsXBVzAe4npHOfl05Y
=9Tfy
-----END PGP SIGNATURE-----

Re: [Matplotlib-users] Millions of data points saved to pdf

From: nertskull <ner...@gm...> - 2014-05-01 13:28:17

No we definitely aren't really interested in the gaps.  Gaps are just where
we were unable to collect the data.

I don't know if we can attach pictures to this thread or not, but I'm going
to try.

The attached is roughly what I want, but with all 750 as vectors.

I want to see the 'movement' of the line, but I need the gaps to remain, so
I know where they are.

The problem with plotting a reduced data set, is I lose some of the very
small sections of line.  I'll play around with that idea, but we want to be
able to zoom in on a vector file, and see the tiny areas of less than
10points that would be lost if we plot a reduced data set.

But what it sounds like, is it is unlikely this will work in vector graphics
form.  Its just too much to do without reducing the dataset.

<http://matplotlib.1069221.n5.nabble.com/file/n43344/figure_1.png> 



--
View this message in context: http://matplotlib.1069221.n5.nabble.com/Millions-of-data-points-saved-to-pdf-tp43338p43344.html
Sent from the matplotlib - users mailing list archive at Nabble.com.

Re: [Matplotlib-users] Millions of data points saved to pdf

From: Benjamin R. <ben...@ou...> - 2014-05-01 13:35:49

This makes me wonder if you would be better served with something like
bokeh:

http://bokeh.pydata.org/

Cheers!
Ben Root



On Thu, May 1, 2014 at 9:28 AM, nertskull <ner...@gm...> wrote:

> No we definitely aren't really interested in the gaps.  Gaps are just where
> we were unable to collect the data.
>
> I don't know if we can attach pictures to this thread or not, but I'm going
> to try.
>
> The attached is roughly what I want, but with all 750 as vectors.
>
> I want to see the 'movement' of the line, but I need the gaps to remain, so
> I know where they are.
>
> The problem with plotting a reduced data set, is I lose some of the very
> small sections of line.  I'll play around with that idea, but we want to be
> able to zoom in on a vector file, and see the tiny areas of less than
> 10points that would be lost if we plot a reduced data set.
>
> But what it sounds like, is it is unlikely this will work in vector
> graphics
> form.  Its just too much to do without reducing the dataset.
>
> <http://matplotlib.1069221.n5.nabble.com/file/n43344/figure_1.png>
>
>
>
> --
> View this message in context:
> http://matplotlib.1069221.n5.nabble.com/Millions-of-data-points-saved-to-pdf-tp43338p43344.html
> Sent from the matplotlib - users mailing list archive at Nabble.com.
>
>
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
> unparalleled scalability from the best Selenium testing platform available.
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Matplotlib-users mailing list
> Mat...@li...
> https://lists.sourceforge.net/lists/listinfo/matplotlib-users
>

Re: [Matplotlib-users] Millions of data points saved to pdf

From: Jouni K. S. <jk...@ik...> - 2014-05-01 17:19:03

nertskull <ner...@gm...> writes:

> The problem, is the pdf is unbearably slow when plotting as a scatter plot
> or as a line with markers.
>
> If I make a regular line plot, with no markers, just a single line, it is
> plotted and the pdf is fine.  But then it connects my points which I don't
> want.

Others have commented on the volume of data, but that paragraph makes
me curious: are you saying that the results are acceptable if you do
something like

  plot(x, y, '-')

but not if you do

  plot(x, y, 'o')  or  plot(x, y, '-o')?

The amount of data in the pdf file should be within a constant factor in
all cases, but the '-' case there are only moveto and lineto commands,
while the two other cases render markers as something called an XObject,
which is repeated a lot of times on the page. I wonder if the overhead
from using an XObject is making the rendering application slow.

Does it help at all to use a simpler marker, e.g. plot(x, y, ',')? One
change you could try if you're feeling adventurous is the following
function in lib/matplotlib/backends/backend_pdf.py:

    def draw_markers(self, gc, marker_path, marker_trans, path, trans,
                     rgbFace=None):
        # For simple paths or small numbers of markers, don't bother
        # making an XObject
        if len(path) * len(marker_path) <= 10:
            RendererBase.draw_markers(self, gc, marker_path, marker_trans,
                                      path, trans, rgbFace)
            return
        # ...

The comment is not quite right: only if the path is short *and* the
number of markers is small does the XObject code get skipped. You could
just change the if statemt to "if True:" and rerun your code (possibly
with the ',' marker style). If that helps, it's evidence that we need to
revisit the condition for using XObjects for markers.

-- 
Jouni K. Seppänen
http://www.iki.fi/jks

Re: [Matplotlib-users] Millions of data points saved to pdf

From: nertskull <ner...@gm...> - 2014-05-01 17:50:44

That definitely helps.  Here's what I did.

First.

Yeah, the results are totally acceptable if I do '-' as my line/marker. 
The pdf renders and loads just fine.

If I do 'o' or even ','  as my marker, then the pdf is horrendously
slow.  I'm talking minutes to render a page.

So, I tried your idea of altering the backend

If I change that line the "if True:" then I get MUCH better results. 
But I also get enormous file sizes.

I've taken a subset of 10 of my 750 graphs.

Those 10, before changing the backend, would make file sizes about about
290KiB.  After changing the backend, if I use plot(x, y, '-') I still
get a file size about 290KiB.

But after changing the backend, if I use plot(x, y, '.') for my markers,
my file size is no 21+ MB.  Just for 10 of my graphs.  I'm afraid making
all 750 in the same pdf may be impossible at those size.

BUT, at least now I can render those 10 in vector format.  Before it
took the pdf minutes to load a page.  Now it only takes maybe 15-20
seconds to load a page of 10 graphs.

So that definitely helped.  Thanks!

Is there anyway to do this even better?  At this rate I'd have to split
my pdf file into multiple chunks, which really isn't ideal to have to
send people 70 pdf files. 

Is there anyway to have reasonable pdf sizes as well as this improved
performance for keeping them in vector format?

Thanks again.

On 05/01/2014 01:19 PM, Jouni K. Seppänen [via matplotlib] wrote:
> nertskull <[hidden email]
> </user/SendEmail.jtp?type=node&node=43348&i=0>> writes:
>
> > The problem, is the pdf is unbearably slow when plotting as a
> scatter plot
> > or as a line with markers.
> >
> > If I make a regular line plot, with no markers, just a single line,
> it is
> > plotted and the pdf is fine.  But then it connects my points which I
> don't
> > want.
>
> Others have commented on the volume of data, but that paragraph makes
> me curious: are you saying that the results are acceptable if you do
> something like
>
>   plot(x, y, '-')
>
> but not if you do
>
>   plot(x, y, 'o')  or  plot(x, y, '-o')?
>
> The amount of data in the pdf file should be within a constant factor in
> all cases, but the '-' case there are only moveto and lineto commands,
> while the two other cases render markers as something called an XObject,
> which is repeated a lot of times on the page. I wonder if the overhead
> from using an XObject is making the rendering application slow.
>
> Does it help at all to use a simpler marker, e.g. plot(x, y, ',')? One
> change you could try if you're feeling adventurous is the following
> function in lib/matplotlib/backends/backend_pdf.py:
>
>     def draw_markers(self, gc, marker_path, marker_trans, path, trans,
>                      rgbFace=None):
>         # For simple paths or small numbers of markers, don't bother
>         # making an XObject
>         if len(path) * len(marker_path) <= 10:
>             RendererBase.draw_markers(self, gc, marker_path,
> marker_trans,
>                                       path, trans, rgbFace)
>             return
>         # ...
>
> The comment is not quite right: only if the path is short *and* the
> number of markers is small does the XObject code get skipped. You could
> just change the if statemt to "if True:" and rerun your code (possibly
> with the ',' marker style). If that helps, it's evidence that we need to
> revisit the condition for using XObjects for markers.
>
> -- 
> Jouni K. Seppänen
> http://www.iki.fi/jks
>
>
> ------------------------------------------------------------------------------
>
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
> unparalleled scalability from the best Selenium testing platform
> available.
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Matplotlib-users mailing list
> [hidden email] </user/SendEmail.jtp?type=node&node=43348&i=1>
> https://lists.sourceforge.net/lists/listinfo/matplotlib-users
>
>
> ------------------------------------------------------------------------
> If you reply to this email, your message will be added to the
> discussion below:
> http://matplotlib.1069221.n5.nabble.com/Millions-of-data-points-saved-to-pdf-tp43338p43348.html
>
> To unsubscribe from Millions of data points saved to pdf, click here
> <http://matplotlib.1069221.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=43338&code=bmVydHNrdWxsQGdtYWlsLmNvbXw0MzMzOHwtMTQ3Nzk2OTQ5Nw==>.
> NAML
> <http://matplotlib.1069221.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>





--
View this message in context: http://matplotlib.1069221.n5.nabble.com/Millions-of-data-points-saved-to-pdf-tp43338p43349.html
Sent from the matplotlib - users mailing list archive at Nabble.com.

Re: [Matplotlib-users] Millions of data points saved to pdf

From: Daniele N. <da...@gr...> - 2014-05-02 12:25:45

On 01/05/2014 19:50, nertskull wrote:
> Is there anyway to have reasonable pdf sizes as well as this improved
> performance for keeping them in vector format?

As others tried to explain to you, plotting that many points in a plot
does not make any sense. The only thing that makes sense is to
down-sample your data to a manageable size. Depending on which features
of your data you are interested in, there are different methods for
doing that.

PS: which viewer are you using to render the PDF? I believe different
renders may have substantially different performances in rendering such
PDFs...

Cheers,
Daniele

Re: [Matplotlib-users] Millions of data points saved to pdf

From: Jouni K. S. <jk...@ik...> - 2014-05-02 15:54:23

nertskull <ner...@gm...> writes:

> If I change that line the "if True:" then I get MUCH better results. 
> But I also get enormous file sizes.

That's interesting! It means that your pdf viewing program (which one,
by the way? Adobe Reader or some alternative?) is slow at compositing a
large number of prerendered markers, or perhaps it just renders each of
them again and again instead of prerendering, and does so more slowly
than if they were part of the same path.

> I've taken a subset of 10 of my 750 graphs.
>
> Those 10, before changing the backend, would make file sizes about about
> 290KiB.  After changing the backend, if I use plot(x, y, '-') I still
> get a file size about 290KiB.
>
> But after changing the backend, if I use plot(x, y, '.') for my markers,
> my file size is no 21+ MB.  Just for 10 of my graphs.  I'm afraid making
> all 750 in the same pdf may be impossible at those size.

Does using ',' (comma) instead of '.' (full stop) as the marker help?  I
think the '.' marker is a circle, just at a small size, while the ','
marker is just two very short lines in the pdf backend. If the ','
marker produces an acceptable file size but its shape is not good
enough, we could experiment with creating a marker of intermediate
complexity.

One thing that I never thought about much is the precision in the
numbers the pdf backend outputs in the file. It seems that they are
being output with a fixed precision of ten digits after the decimal
point, which is probably overkill. There is currently no way to change
this except by editing the source code - the critical line is

        r = ("%.10f" % obj).encode('ascii')

where 10 is the number of digits used. The same precision is used for
all floating-point numbers, including various transformation matrices,
so I can't offer a simple rule for how large deviations you will cause
by reducing the precision - you could experiment by making one figure
with the existing code and another with '%.3f', and see if the latter
looks good enough at the kind of zoom levels you are going to use (and
if it really reduces the file size much - there's a compression layer on
top of the ASCII representation).

That reminds me: one thing that could have an effect is the
pdf.compression setting, which defaults to 6 but you can set it to 9 
to make the compressed size a little bit smaller, at the expense of
spending more time when writing the file. That's not going to be a major
difference, though.

> Is there anyway to have reasonable pdf sizes as well as this improved
> performance for keeping them in vector format?

Like others have recommended, rendering huge clouds of single points is
a problematic task. I think it's an entirely valid thing to ask for, but
it's not likely that there will be a perfect solution, and some other
way of visualizing the data may be needed. Bokeh (suggested by Benjamin
Root) looks like something that could fit your needs better than a pdf
file in a viewer.

-- 
Jouni K. Seppänen
http://www.iki.fi/jks

Re: [Matplotlib-users] Millions of data points saved to pdf

From: <cl...@br...> - 2014-05-02 17:05:29

Dear colleagues, 

I had a similar issues with a large plot and several thousands of elements 
printed under Linux and Qt4Agg back-end.  At the PDF render I got some 
vector overlay and distortion of markers in the drawing, so I've changed 
the plotting output into a two step process, generating first a high 
resolution ".png" file and the using the Python image library to compress 
it into a much smaller .jpeg image output, which produces a browser 
friendly file or input source for Adobe .pdf editors like OpenOffice. 

Source:

import Image
# size for jpg and png output (16000 x 12000 pixel)
w = 80
h = 60
#
dpi_resolution  = 400
fig.set_size_inches(w,h)
DPI = fig.get_dpi()
print "DPI:", DPI
Size = fig.get_size_inches()
print "Size in Inches", Size
myformats = plt.gcf().canvas.get_supported_filetypes()
print "Supported formats are: " + str(myformats)
mybackend = plt.get_backend()
print "Backend used is: " + str(mybackend)
# save screen copy
fig.savefig('myplot.png', format='png', dpi= (dpi_resolution))
# JPEG compression with quality of 10
myimage = Image.open('myplot.png')
myimage = myimage.resize((16000, 12000), Image.ANTIALIAS)
#quality = 10% .. very high compression with few blurs
quality_val = 10
myimage.save('myplot.jpg', 'JPEG', quality=quality_val)

The visual result looks acceptable with no distortion. This process gives 
some control about compression and quality. 

Hope this is useful. 

Regards, 
Claude

Claude Falbriard 
Certified IT Specialist L2 - Middleware
AMS Hortolândia / SP - Brazil
phone:    +55 13 9 9760 0453
cell:         +55 13 9 8117 3316
e-mail:    cl...@br...

From:   Jouni K. Seppänen <jk...@ik...>
To:     mat...@li..., 
Date:   02/05/2014 12:55
Subject:        Re: [Matplotlib-users] Millions of data points saved to 
pdf

nertskull <ner...@gm...> writes:

> If I change that line the "if True:" then I get MUCH better results. 
> But I also get enormous file sizes.

That's interesting! It means that your pdf viewing program (which one,
by the way? Adobe Reader or some alternative?) is slow at compositing a
large number of prerendered markers, or perhaps it just renders each of
them again and again instead of prerendering, and does so more slowly
than if they were part of the same path.

> I've taken a subset of 10 of my 750 graphs.
>
> Those 10, before changing the backend, would make file sizes about about
> 290KiB.  After changing the backend, if I use plot(x, y, '-') I still
> get a file size about 290KiB.
>
> But after changing the backend, if I use plot(x, y, '.') for my markers,
> my file size is no 21+ MB.  Just for 10 of my graphs.  I'm afraid making
> all 750 in the same pdf may be impossible at those size.

Does using ',' (comma) instead of '.' (full stop) as the marker help?  I
think the '.' marker is a circle, just at a small size, while the ','
marker is just two very short lines in the pdf backend. If the ','
marker produces an acceptable file size but its shape is not good
enough, we could experiment with creating a marker of intermediate
complexity.

One thing that I never thought about much is the precision in the
numbers the pdf backend outputs in the file. It seems that they are
being output with a fixed precision of ten digits after the decimal
point, which is probably overkill. There is currently no way to change
this except by editing the source code - the critical line is

        r = ("%.10f" % obj).encode('ascii')

where 10 is the number of digits used. The same precision is used for
all floating-point numbers, including various transformation matrices,
so I can't offer a simple rule for how large deviations you will cause
by reducing the precision - you could experiment by making one figure
with the existing code and another with '%.3f', and see if the latter
looks good enough at the kind of zoom levels you are going to use (and
if it really reduces the file size much - there's a compression layer on
top of the ASCII representation).

That reminds me: one thing that could have an effect is the
pdf.compression setting, which defaults to 6 but you can set it to 9 
to make the compressed size a little bit smaller, at the expense of
spending more time when writing the file. That's not going to be a major
difference, though.

> Is there anyway to have reasonable pdf sizes as well as this improved
> performance for keeping them in vector format?

Like others have recommended, rendering huge clouds of single points is
a problematic task. I think it's an entirely valid thing to ask for, but
it's not likely that there will be a perfect solution, and some other
way of visualizing the data may be needed. Bokeh (suggested by Benjamin
Root) looks like something that could fit your needs better than a pdf
file in a viewer.

-- 
Jouni K. Seppänen
http://www.iki.fi/jks

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
unparalleled scalability from the best Selenium testing platform 
available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Matplotlib-users mailing list
Mat...@li...
https://lists.sourceforge.net/lists/listinfo/matplotlib-users