PhpGedView / Feature Requests / #657 Use MySQL blob-based tables for images

KosherJava - 2006-08-18

Logged In: YES
user_id=634811

This was discussed on the developer list in Jan 2004. Here
is a post from John:

By: John Finlay - yalnifjProject AdminAccepting Donations
RE: mySQL Blobs for images?
2004-01-08 19:36
There are some disadvantages to storing the images as blobs
in MySQL and it can be difficult to make work.

The big disadvantage is that the queries are slow. When I
originally did the Employee Scheduler application I stored
the employee's pictures in a blob field in the same user
table. This slowed down the database queries on the order of
100 times slower.

As long as you use a seperate table just for storing blobs I
don't think there will be any major database problems.

The other problems you will run into is setting the MIME
type headers correctly and using PHP to print out the binary
data. If you get those right it should work great.

--John

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bill Conlon - 2006-09-18

Logged In: YES
user_id=67152

I've done a lot of applications that store images in mysql,
and I find many advantages to this approach.

Any suggestions on how to get involved?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Greis - 2006-09-18

Logged In: YES
user_id=1578309

The PhpGedView team could best respond on how you could get
involved.

Thinking out loud, it seems that this can be divided into
two pieces - the data schema for the images and the changes
to the PHP coding that manages the images.

One possible schema would have several tables:
- One table with the full size images
- One table with the thumbnails and a pointer to the table
entry for the full size images, along with information about
the image (including text for searching for the image).
- One table with pairwise entries, with each entry
consisting of a family/individual identifier and a media
identifier, so you can easily query and select all the
images associated with an individual or family

Each individual or family can have a primary (thumbnail)
image that is displayed on its page and on charts with it.
You'd have to give some thought to how best to represent
that, since you'd want fast lookup and retrieval of the
thumbnails for display (hence putting the full-size images
in a separate table).

If there were a one-to-one relationship between
indviduals/families and PRIMARY images, you could include a
field in the "thumbnail" table designating that image as the
PRIMARY image for a specific individual/family. However,
there is nothing that currently prohibits you from having
the same image serve as the primary image for multiple
individuals. You could either enforce that one-to-one
(probably less desirable) or put those relationships elsewhere.

Another question would be how to make the names of all the
people linked to the image searchable as well (i.e., when
you search the photos, you'd like to be able to use the name
of a person who has been linked to the image, presumably as
he or she is in it).

The second part would involve changes to the PHP so the
images are retrieved from MySQL rather than a directory on
the disk. The "manage media" portion would be relatively
self-contained, but the changes for the use of the images
would be more extensive throughout the code and would, I
imagine, require the cooperation of a number of the coders
working on PHP.

Perhaps one of the owners/coders could offer some thoughts.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bill Conlon - 2006-09-18

Logged In: YES
user_id=67152

Having done this many times, generally all files get stored
in a single table. Doesn't matter what size they are.

A related tables manages image meta data (size, description,
copyright, etc.).

Related tables allow many to many relationships of
images/GED entities -- eg., many people can be in one image;
one person can be in many images.

Choosing primary image to be displayed is presumably already
done. Presumably, we only need to change the table/key
relationship in existing code that manages iamges.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Finlay - 2006-09-18

Logged In: YES
user_id=300048

I see no advantage to storing images in the database as
opposed to storing them on the file system.

Why would we want to implement this?

--John

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bill Conlon - 2006-09-18

Logged In: YES
user_id=67152

Here are a few reasons.

1. unified access control
Images stored in the file system rely on completely
different authentication and access control mechanism, such
as .htacess. all of the admin controls that apply to
existing data elements would apply to image data. (As
opposed to merely controlling a file system pointer).
Setting up .htaccess control for nested file systems and
multiple users, with different privileges is not for the
faint of heart.

2. scalability
Though less of an issue with modern file systems,
directories of many thousands of files would get slow.
Databases don't have this issue. Also I would rather use
mySQL's replication services than mirror the file systems.

3. The many-to-many relationships of images to persons
seems to me to be a logical extension of the GedView
capabilities. Of course, there is the issue of how to
re-export GEDCOMs.

4. File ssytem security. Don't know about you, but I'm
awful reluctant to give the php user access to anything
other than a specified database. I don't want it traversing
my file system, I certainly don't want it writing to my file
system, and I especially don't want want it executing
anything. In any event, it's hard to execute a trojan
stored in the database, as opposed to getting it into the
file system.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Greis - 2006-09-19

Logged In: YES
user_id=1578309

In addition to those system/access reasons, it just seems a
lot cleaner to have all that data in a managed environment,
where it is more easily backed up and more portable.

The limitations of the GEDCOM format and standard have been
worked around by creating references to external files for
images, which is a real kluge. I haven't followed this long
enough to have a sense whether an XML standard will really
be implemented, but there is an inherent logic to defining
and using containers for objects that are not bound by the
constraint of a file system. (If you think about it, the
idea of long text passages implemented as continuation lines
in a GEDCOM file could potentially be better implemented in
an external environment as well.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Greis - 2006-09-19

Logged In: YES
user_id=1578309

Question for wmconlon -

I suggested separating the images from the thumbnails
because the images themselves are only displayed when
specifically requested, whereas the thumbnails are used all
over the PhpGedView screens. I'm not a SQL performance
person, but it didn't seem to make sense to be putting
potentially very large items in a table that has to have
fast lookup and response. (I'm thinking of the system
itself and how it handles and retrieves the data from the
tables; I know the keys will ensure fast searching.) And
since there's a one-to-one correspondence with the
thumbnails (you create a thumbnail for each image), it
seemed reasonable to separate the full-size images from the
thumbnails.

Would it in fact make little or no difference from a
performance standpoint?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Finlay - 2006-09-19

Logged In: YES
user_id=300048

1. unified access control
I can see this as a valid reason, however it can be acheived
without a database. Just put the image files in a context
outside of the web directory and use the same logic for
displaying them from a database for displaying them from
another location.

2. scalability
Having implemented blob storage for images before, I would
rather not have a bloated and slow database. Eventually all
storage goes back to the file system and the hard drive even
if it is in the database. All it is doing is shifting where
the resources are.

PGV allows you to have a hierarchy of folders in your media
directory. So not all files are in a single directory.

If you have the ability to replicate a database, then I
would also think that you could setup rsync to replicate the
filesystem.

3. The many-to-many relationships of images to persons...
This already exists and is supported by GEDCOM. You don't
need a database for it.

4. File sytem security.
Beyond what you listed in #1, I don't see it as being any
more of a security risk. If you allow users to put
something "anywhere" on the server, and then allow them to
get it back, then it is a security risk.

Here are some disadvantages to storing images in a database:
A. FTP access
You cannot upload images outside of the program. Most of
the PGV admins use FTP to upload/download files. Without
FTP access to their images we would have to also provide
clients for batch processing etc.

B. PHP upload limits
Related to "A" above, most PHP installations limit you to a
2MB upload size. If you have objects that are larger than
2MB you will have no way to get them into the database.
There are also things besides images that need to be linked
to people.

C. Memory limits
PGV is already pushing the memory limits of PHP. If we then
have to grab blobs out of the database just to show the
pictures, we will be increasing the required memory limits.

--John

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bill Conlon - 2006-09-19

Logged In: YES
user_id=67152

1. unified access control.

disagree. If you want to control access to the image, the
HTTP GET request must be processed by the applicaion. images
in the file system bypass the applicaiton (even if links to
those images are generated by the app) and are accessible
directly via a URI, subject only to apache access control,
which is not as granular nor as easy to manage as
application-based ACLs.

2. scalability.
Not al file systems are created equal. Your point is valid
about disk I/O -- certainly the image needs to come off the
disk no matter how it is stored. My experience is clearly
different, as I find that retrieving images from mySQL ISAM
tables is very fast.

Yes, we could run rsync on every update. Tell me how a web
farm would be scaled. User A uploads image A.jpg to Server
A and User B uploads image A.jpg to Server B. They might or
might not be the same image, but someone will have their
image overwritten during mirroring.

This brings up another issue -- naming. Images in the file
system depend on their path/name to be "unique" whereas
images in the db have a unique key. We can use an MD5 hash
to identify suspected duplicates.

3. Actually, I don't know anything about the GEDCOM
standard. I'm glad to hear it wouldn't be a problem to
deploy the many-to-many relationships back to a downloaded
GEDCOM.

4. security.
Assuming I can get a shell, I can ./my_trojan.jpg, but I
can't execute a BLOB

A. FTP.
ok. I don't run FTP on my systems for my own security
reasons. It seems to me that if the master image is
maintained at a GEDview server, the downloaded GEDCOM would
just reference the URI at the server:
http://myserver.com/image.php?_uid1=FE123
so there wouldn't be a need for bulk downloads. Regarding
uploads, yes, it would be convenient to bulk process images
into the db.

B. Upload limits.
Yes. And mysql max_allowed_packet comes into play. This
might be a reason to write a utility for uploading large
images. Speaking of which, I agree that any mime type could
be stored (that's what I do in my apps).

C. Memory
Hmm. That is a problem. I guess I need to install
phpGEDView to see this in action. But we could use mod_perl
to show the images:
http://myserver.com/cgi-perl/image.pl?_uid1=FE123
to relieve the php interpreter.

--bill

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Finlay - 2006-09-19

Logged In: YES
user_id=300048

Bill,

1. unified access control
I think you misunderstood me. If the image is on the file
system outside the web root, you cannot access it from the
webserver using a URL. Just as you would stream it from the
database and alter the HTTP headers to return it from a
database, you would do the same thing to stream it from
another location on the filesystem. You would have to do
the same thing to alter the HTTP headers to set the correct
MIME type etc that you do from a database. It is exactly
the same concept, but instead of getting the blob from the
database you get it from a non accessible location on disk
outside the web root. Then the program has to provide all
access to it and you don't have to deal with .htaccess.

2. Scalabilty
I used to work for a large library on their digital
collections department. We evaluated a lot of software to
help run our online digital collections. Those that stored
digital media on the file system stored them using a hash or
renamed the images.

But for 99% of PGV admins this is unnecessary. Most of them
will not be scaling to multiple servers and most of them
will only be confused if their images do not have the same
filename on the server, in the GEDCOM, and on their home
computers.

4. Security
If you can get a shell, you have a lot more to worry about
than someone executing a file they uploaded. It is slightly
more difficult to execute something from the db, but still
just as possible. Good security programming principles are
the best way to ensure security.

Also, only authenticated users may upload files. So you
would know who uploaded any file to your server.

From my perspective, I still don't see a reason to put
development resources towards this when there are so many
other things that need to be worked on. We still have a
great deal of user-friendliness issues to work out when it
comes to media.

But if someone from the open source community wants to work
on it as an optional feature, then we could include it in
the project.

--John

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Greis - 2006-09-19

Logged In: YES
user_id=1578309

This is a great discussion and I really appreciate the back
and forth. I'm learning a lot and it's great to dig in to
some of these questions and issues that have applicability
beyond this specific "feature" request.

Follow-up on a few of the points:

- "There are also things besides images that need to be
linked to people."

Absolutely. Since you can store MIME-type information, you
can potentially make available many forms of information
under "media", and they would be stored along with
information about how to "access" them. Might that not make
it easier, down the road, to allow a user to launch another
program to use or update information (images, video, voice,
text)? The items are also nicely encapsulated in the
database where they can't be accessed outside of the program.

2. "scalability
Having implemented blob storage for images before, I would
rather not have a bloated and slow database. Eventually all
storage goes back to the file system and the hard drive even
if it is in the database. "

That makes sense and was why I was thinking that the
full-scale "images" should be in their own table - they
won't actually be accessed that often.

3. FTP access

I'm not sure I understand why this is a problem. Couldn't
you just continue to use whatever you use now to get the
files onto the system and then run the procedure that puts
them into the database?

That eliminates the entire problem of figuring out the
hierarchy and naming convention for the files. Having
loaded a small number of images into PhpGedView, I would
like nothing better than to load images without having to
worry about creating unique file names for them.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bill Conlon - 2006-09-19

Logged In: YES
user_id=67152

1. unified access control
sure. I was thinking you meant Apache Alias directives.
But this now involves a php file read into $blob as opposed
to sql_select into $blob. Memory use is roughly the same.
I need to think more about this, as it seems like taking
image control away from the app opens up potential database
inconsistencies.

2. scalability
yes. but the filename is part of the http header, so as long
as it is metadata in a table, it doesn't change for the client.

4. Security
Ahh, but the web server (say user apache) now has write
access, and I presume execute so it can create a new
directory. So I have to be sure there are no potential
flaws in the php code I'm running. But now I see why you
keep the uploads out of the web root.

Resources.
My first post was "how do I get involved?" Can you tell me
whether there is an architecture in place for 'mods' since
the ideal approach would be to have a media_store mod that
just supplanted to file system store.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Finlay - 2006-09-19

Logged In: YES
user_id=300048

> Resources.
> My first post was "how do I get involved?" Can you tell me
> whether there is an architecture in place for 'mods' since
> the ideal approach would be to have a media_store mod that
> just supplanted to file system store.
>

PGV does not have an abstraction for where uploads are
stored. The changes to implement one would be minimal since
there are only a few plages where files can be uploaded from.

PGV does have an architecture for plugins or modules similar
to the *Nuke CMS systems. Others have successfully ported
Gallery and other image manipulation programs over to PGV
modules. It would probably be a good idea to build on an
existing open source app like Gallery to do this. You can
get the latest PGV and the Gallery mod from the SVN.

I haven't looked at Gallery for some time, so I don't know
if they store the images in the DB. When I last used
gallery they did not use a DB but that was several years ago.

--John

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Use MySQL blob-based tables for images

Group

Searches

Help

#657 Use MySQL blob-based tables for images

Discussion