Re: [Pagekit-users] Character set or parsing issue?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Thanks, Boris.

It's a fairly large application, so modifying the entire app to use the 
"trick" to detect the encoding that the browser is using would work, but 
might be a bit of a pain to build-in to the entire app.

Actually, using Encode::encode and encoding to UTF8 before outputting via 
$model->fillinform() seemed to work and "fix" the output.  I'm not sure that 
this is the "proper" way of doing things though.  Rather than modifying the 
entire app and each "fillinform()" call, I was considering modifying the 
PageKit code that calls HTML::FillInForm and wrap the Encode::encode call 
into there.  I'm not sure if this is a great way to go, or if it is safe in 
general.  Actually, I don't pretend to be all that good when it comes to 
character sets and encoding in general.

What are your thoughts on this?

Thanks,
Russell

----- Original Message ----- 
From: "Boris Zentner" <bz...@2b...>
To: "Russell D. Weiss" <rw...@in...>
Cc: <pag...@li...>
Sent: Thursday, June 15, 2006 5:32 PM
Subject: Re: [Pagekit-users] Character set or parsing issue?

Hi,

Am 14.06.2006 um 22:10 schrieb Russell D. Weiss:

> Hello all,
>
> Long time no post :-).
>
> We've encountered the following problem in our PageKit
> application.  When
> pulling data from a database and using $model->fillinform to
> populate form
> fields, we're seeing problems when the data contains certain
> international
> characters, such as "á" -- as well as potentially some others.
>
> Basically, if the word "Testá" is pulled from the database, the
> HTML form
> field will look like:
>
> <input type="text" value="Test? name="blah">  Obviously, this causes
> problems, as the value is not terminated properly with a quote, and
> the true
> value is not shown in the form field.  I tested HTML::FillInForm
> separately
> and this problem does not appear.  The problem may be due to some
> parsing
> that pagekit does after it runs the the page through HTML::FillInForm.
>
> Boris and others, do you have any idea as to what might cause this?
>

I remember that problem. The reason is, that you lost the encoding of
your string somewhere. mysql for example does it always wrong.
Pagekit try to ship around the problem by removeing the utf8 flag
before we pass the data to fillinform. and force utf8 on the result.
There might be a bug or some strings in your page are not in the
propper encoding. The other source of such errors are
Apache::Request. Since anything stored there lost the utf8 flag.

I think your input is somewhere not your default_input_charset. The
source of the problem is your database, or inputparams. Pagekit does
the right thing as far as I know for all sorts of input unless you
mix the charsets.

I have working solutions for any input but it takes a while to
explain all of them ;-) One is to convert all my inputs from the
database to default_input_charset with Encode::decode or I use
postgres with pg_enable_utf8 ;-) the other one is the right charset
from __from__ fields since browsers answer in a different charset
from time to time. The trick is to send a hidden field with known
chars like 'á' and check that first. if it is the same as 'á' in
latin1 you know the encoding for all other fields easy enough other
wise compare to another charset. The third point is to use a own
Apacje::Request object to handle the utf8 flag correctly. I can show
a example if you like. It is really hard to handle the charset issue
correct. If there is a mistake you get a '?' for the char in
question. The form trick is explained somehow with my answers to this
tread on pm:

   http://www.perlmonks.com/index.pl?node_id=401315

I really know it is confusing, what version of pagekit do you use? I
remember there was a change to handle more wrong cases. Feel free to
ask more specific and I try to came up with a better description ;-)

The basic problem is this:

use Encode;
use DBI;

# setup a test database
my $dbh = DBI->connect( "dbi:SQLite:dbname=/tmp/dbfile",
                         "", "", { PrintError => 0, RaiseError => 1 } );
eval { $dbh->do(q{ CREATE TABLE t_storage ( id INTEGER, str VARCHAR
(255) ) }) };
eval { $dbh->do(q{ DELETE FROM  t_storage }) };

# and our test stringss
my $str      = 'test' . chr(0xe1);              #latin1 string
my $utf8_str = decode( 'iso-8859-1', $str );    # same string, but utf8

compare( "compare  \$str, \$utf8_str:\n", $str, $utf8_str );

# serialize the data into a database removes the utf8 flag only postgres
# can handle this correct on request
$dbh->do( q{ INSERT INTO t_storage VALUES ( 1, ? ) }, {}, $str );
my ($str_from_db) =
   $dbh->selectrow_array(q{ SELECT str FROM t_storage WHERE id = 1 });
$dbh->do( q{ INSERT INTO t_storage VALUES ( 2, ? ) }, {}, $utf8_str );
my ($utf8_str_from_db) =
   $dbh->selectrow_array(q{ SELECT str FROM t_storage WHERE id = 2 });

# compare again
compare( "compare \$str_from_db, \$utf8_str_from_db:\n",
          $str_from_db, $utf8_str_from_db );

# compare again
compare( "compare \$utf8_str, \$utf8_str_from_db:\n",
          $utf8_str, $utf8_str_from_db );

{
   use bytes;
   print "compare binary \$utf8_str, \$utf8_str_from_db:\n";
   print $utf8_str eq $utf8_str_from_db ? "same" : "different", $/, $/;
}

#

########
## Subs
########
sub compare {
   print shift;
   my ( $s1, $s2 ) = @_;

   # compare
   {
     use bytes;
     print length $s1, $/;    # length $str
     print length $s2, $/;    # length $utf8_str
   }

   # supprise for most people
   print $s1 eq $s2 ? "same" : "different", $/, $/;
}

> Thanks,
> Russell
>
>
>
> _______________________________________________
> Pagekit-users mailing list
> Pag...@li...
> https://lists.sourceforge.net/lists/listinfo/pagekit-users

--
Boris

_______________________________________________
Pagekit-users mailing list
Pag...@li...
https://lists.sourceforge.net/lists/listinfo/pagekit-users