Thread: [sqlmap-users] Back-end DBMS charset encoding
Brought to you by:
inquisb
From: Miroslav S. <mir...@gm...> - 2011-01-17 14:52:09
|
Hi all. I have a general question to all those pentesters that are retrieving data from sites with "funny" charset encodings (...russian, chinese...). What's should be the general "consensus" for data retrieval: A) assume that the backend DBMS uses the "utf8" charset encoding or B) treat data retrieved with the same encoding as used in the page or C) find out the proper collation used and use that one? (i am not a fan of this one :) or D) don't care (some people tend to use mixed collations which is quite romantic) Also, I would like to ask you all to try out the latest revision with cases that could be problematic and report impressions. Kind regards |
From: Miroslav S. <mir...@gm...> - 2011-01-18 11:11:38
|
hi mitchell. thank you for your answer. i thought that nobody would :) we've done some serious work these days in this field and would like to have it "stabilized". plz report any "strange" behavior in this field if you encounter it. kr On Tue, Jan 18, 2011 at 12:01 PM, mitchell <mit...@tu...> wrote: > Hi Miroslav, > > In say 80% of the cases I delt with Bulgarian sites, the data in the > database used the same encoding as the encoding announced on the webpage, > usually CP-1251. The rest use UTF. > > # mitchell > > On 17 Jan 2011 16:52, "Miroslav Stampar" <mir...@gm...> wrote: >> Hi all. >> >> I have a general question to all those pentesters that are retrieving data >> from sites with "funny" charset encodings (...russian, chinese...). >> >> What's should be the general "consensus" for data retrieval: >> >> A) assume that the backend DBMS uses the "utf8" charset encoding >> or >> B) treat data retrieved with the same encoding as used in the page >> or >> C) find out the proper collation used and use that one? (i am not a fan of >> this one :) >> or >> D) don't care (some people tend to use mixed collations which is quite >> romantic) >> >> Also, I would like to ask you all to try out the latest revision with >> cases >> that could be problematic and report impressions. >> >> Kind regards > -- Miroslav Stampar E-mail / Jabber: miroslav.stampar (at) gmail.com Mobile: +385921010204 (HR 0921010204) PGP Key ID: 0xB5397B1B Location: Zagreb, Croatia |
From: Miroslav S. <mir...@gm...> - 2011-01-20 11:30:47
|
i am sending you all update regarding this field together with some screenshots for example: "latin for latin blind" means: blind inference used, for retrieving of latin data via latin (and latin connection to the backend DBMS) page "latin for utf8 error" means: error approach used, for retrieving of utf8 data via latin (and latin connection to the backend DBMS) page "utf8 for latin error" means: error approach used, for retrieving of latin data via utf8 (and utf8 connection to the backend DBMS) page ... all data that you see with '???' are lost irreversibly in cases when utf8 data was retrieved via latin connection/pages, and they are inherently incompatible (we can't do a shit in those cases as connection charset is hard coded in the web pages code - like "mysql_set_charset("latin1", $link)"). so, all in all sqlmap is doing a great job right now in this field :) p.s. there was a "really nasty" problem when -o switch was used (--null-connection part) and page encoding was just reset to 'utf8' which potentially lead to messy results. fixed in last commit. On Tue, Jan 18, 2011 at 12:11 PM, Miroslav Stampar <mir...@gm...> wrote: > hi mitchell. > > thank you for your answer. i thought that nobody would :) > > we've done some serious work these days in this field and would like > to have it "stabilized". plz report any "strange" behavior in this > field if you encounter it. > > kr > > On Tue, Jan 18, 2011 at 12:01 PM, mitchell <mit...@tu...> wrote: >> Hi Miroslav, >> >> In say 80% of the cases I delt with Bulgarian sites, the data in the >> database used the same encoding as the encoding announced on the webpage, >> usually CP-1251. The rest use UTF. >> >> # mitchell >> >> On 17 Jan 2011 16:52, "Miroslav Stampar" <mir...@gm...> wrote: >>> Hi all. >>> >>> I have a general question to all those pentesters that are retrieving data >>> from sites with "funny" charset encodings (...russian, chinese...). >>> >>> What's should be the general "consensus" for data retrieval: >>> >>> A) assume that the backend DBMS uses the "utf8" charset encoding >>> or >>> B) treat data retrieved with the same encoding as used in the page >>> or >>> C) find out the proper collation used and use that one? (i am not a fan of >>> this one :) >>> or >>> D) don't care (some people tend to use mixed collations which is quite >>> romantic) >>> >>> Also, I would like to ask you all to try out the latest revision with >>> cases >>> that could be problematic and report impressions. >>> >>> Kind regards >> > > > > -- > Miroslav Stampar > > E-mail / Jabber: miroslav.stampar (at) gmail.com > Mobile: +385921010204 (HR 0921010204) > PGP Key ID: 0xB5397B1B > Location: Zagreb, Croatia > -- Miroslav Stampar E-mail / Jabber: miroslav.stampar (at) gmail.com Mobile: +385921010204 (HR 0921010204) PGP Key ID: 0xB5397B1B Location: Zagreb, Croatia |
From: mitchell <mit...@tu...> - 2011-01-18 11:20:09
|
Will do :) # mitchell On 18 Jan 2011 13:11, "Miroslav Stampar" <mir...@gm...> wrote: > hi mitchell. > > thank you for your answer. i thought that nobody would :) > > we've done some serious work these days in this field and would like > to have it "stabilized". plz report any "strange" behavior in this > field if you encounter it. > > kr > > On Tue, Jan 18, 2011 at 12:01 PM, mitchell <mit...@tu...> wrote: >> Hi Miroslav, >> >> In say 80% of the cases I delt with Bulgarian sites, the data in the >> database used the same encoding as the encoding announced on the webpage, >> usually CP-1251. The rest use UTF. >> >> # mitchell >> >> On 17 Jan 2011 16:52, "Miroslav Stampar" <mir...@gm...> wrote: >>> Hi all. >>> >>> I have a general question to all those pentesters that are retrieving data >>> from sites with "funny" charset encodings (...russian, chinese...). >>> >>> What's should be the general "consensus" for data retrieval: >>> >>> A) assume that the backend DBMS uses the "utf8" charset encoding >>> or >>> B) treat data retrieved with the same encoding as used in the page >>> or >>> C) find out the proper collation used and use that one? (i am not a fan of >>> this one :) >>> or >>> D) don't care (some people tend to use mixed collations which is quite >>> romantic) >>> >>> Also, I would like to ask you all to try out the latest revision with >>> cases >>> that could be problematic and report impressions. >>> >>> Kind regards >> > > > > -- > Miroslav Stampar > > E-mail / Jabber: miroslav.stampar (at) gmail.com > Mobile: +385921010204 (HR 0921010204) > PGP Key ID: 0xB5397B1B > Location: Zagreb, Croatia |
From: mitchell <mit...@tu...> - 2011-01-18 11:27:57
|
Hi Miroslav, In say 80% of the cases I delt with Bulgarian sites, the data in the database used the same encoding as the encoding announced on the webpage, usually CP-1251. The rest use UTF. # mitchell On 17 Jan 2011 16:52, "Miroslav Stampar" <mir...@gm...> wrote: > Hi all. > > I have a general question to all those pentesters that are retrieving data > from sites with "funny" charset encodings (...russian, chinese...). > > What's should be the general "consensus" for data retrieval: > > A) assume that the backend DBMS uses the "utf8" charset encoding > or > B) treat data retrieved with the same encoding as used in the page > or > C) find out the proper collation used and use that one? (i am not a fan of > this one :) > or > D) don't care (some people tend to use mixed collations which is quite > romantic) > > Also, I would like to ask you all to try out the latest revision with cases > that could be problematic and report impressions. > > Kind regards |
From: Miroslav S. <mir...@gm...> - 2011-01-19 14:26:00
|
hi all. as i was really interested into this issue i had to set up a testing environment to find out what's going on :))) i've choose simplest (disposable) testing environment: XAMPP two tables: users_utf8 & users_latin two vulnerable GET pages: get_int_utf8.php & get_int_latin.php well, conclusion and my answer to the given question: "What's should be the general "consensus" for data retrieval": priority among all charsets is the encoding of the web page, and that's because three reasons: 1) connection from the web server to the backend DBMS will be most certainly set to some "compatible" charset with the one at the page itself - that means that all the data from DBMS to the web server will be automatically converted to connection's charset 2) once the web server has replied with the data, in case that the data is not compatible with it's current character set it will in most cases just do a simple replacement with '?' for problematic characters (like in case from latin1 -> utf8) - which means a big screw up for our data in "error" and "union" techniques as the data is irreversibly lost 3) finding out "proper" collation is a futile in a sense that in MySQL for example you can put collation to everything (column, table, connection, user, ...), and there is no "magic" bullet to know the final collation of the retrieved data in a "time constrained" manner. interesting thing that should be pointed out is that you'll most probably have problems with character sets of retrieved data here and there for one obvious reason: web page's connection to the backend DBMS dictates character set used for retrieved data, we "violently" use it in sql injection attacks for different tables with different character sets/collations which were most probably not "meant" to be "compatible" with web page itself, hence you'll lose information irreversibly during the conversion process. kr On Tue, Jan 18, 2011 at 12:13 PM, mitchell <mit...@tu...> wrote: > Will do :) > > # mitchell > > On 18 Jan 2011 13:11, "Miroslav Stampar" <mir...@gm...> wrote: >> hi mitchell. >> >> thank you for your answer. i thought that nobody would :) >> >> we've done some serious work these days in this field and would like >> to have it "stabilized". plz report any "strange" behavior in this >> field if you encounter it. >> >> kr >> >> On Tue, Jan 18, 2011 at 12:01 PM, mitchell <mit...@tu...> wrote: >>> Hi Miroslav, >>> >>> In say 80% of the cases I delt with Bulgarian sites, the data in the >>> database used the same encoding as the encoding announced on the webpage, >>> usually CP-1251. The rest use UTF. >>> >>> # mitchell >>> >>> On 17 Jan 2011 16:52, "Miroslav Stampar" <mir...@gm...> >>> wrote: >>>> Hi all. >>>> >>>> I have a general question to all those pentesters that are retrieving >>>> data >>>> from sites with "funny" charset encodings (...russian, chinese...). >>>> >>>> What's should be the general "consensus" for data retrieval: >>>> >>>> A) assume that the backend DBMS uses the "utf8" charset encoding >>>> or >>>> B) treat data retrieved with the same encoding as used in the page >>>> or >>>> C) find out the proper collation used and use that one? (i am not a fan >>>> of >>>> this one :) >>>> or >>>> D) don't care (some people tend to use mixed collations which is quite >>>> romantic) >>>> >>>> Also, I would like to ask you all to try out the latest revision with >>>> cases >>>> that could be problematic and report impressions. >>>> >>>> Kind regards >>> >> >> >> >> -- >> Miroslav Stampar >> >> E-mail / Jabber: miroslav.stampar (at) gmail.com >> Mobile: +385921010204 (HR 0921010204) >> PGP Key ID: 0xB5397B1B >> Location: Zagreb, Croatia > -- Miroslav Stampar E-mail / Jabber: miroslav.stampar (at) gmail.com Mobile: +385921010204 (HR 0921010204) PGP Key ID: 0xB5397B1B Location: Zagreb, Croatia |
From: Miroslav S. <mir...@gm...> - 2011-01-19 14:30:47
|
addendum: most simple explanation for the "priority among all charsets is the encoding of the web page" is that, as we need to choose one, let it be the most obvious one :))) On Wed, Jan 19, 2011 at 3:25 PM, Miroslav Stampar <mir...@gm...> wrote: > hi all. > > as i was really interested into this issue i had to set up a testing > environment to find out what's going on :))) > > i've choose simplest (disposable) testing environment: XAMPP > > two tables: users_utf8 & users_latin > two vulnerable GET pages: get_int_utf8.php & get_int_latin.php > > well, conclusion and my answer to the given question: "What's should > be the general "consensus" for data retrieval": > > priority among all charsets is the encoding of the web page, and > that's because three reasons: > > 1) connection from the web server to the backend DBMS will be most > certainly set to some "compatible" charset with the one at the page > itself - that means that all the data from DBMS to the web server will > be automatically converted to connection's charset > 2) once the web server has replied with the data, in case that the > data is not compatible with it's current character set it will in most > cases just do a simple replacement with '?' for problematic characters > (like in case from latin1 -> utf8) - which means a big screw up for > our data in "error" and "union" techniques as the data is irreversibly > lost > 3) finding out "proper" collation is a futile in a sense that in MySQL > for example you can put collation to everything (column, table, > connection, user, ...), and there is no "magic" bullet to know the > final collation of the retrieved data in a "time constrained" manner. > > interesting thing that should be pointed out is that you'll most > probably have problems with character sets of retrieved data here and > there for one obvious reason: > web page's connection to the backend DBMS dictates character set used > for retrieved data, we "violently" use it in sql injection attacks for > different tables with different character sets/collations which were > most probably not "meant" to be "compatible" with web page itself, > hence you'll lose information irreversibly during the conversion > process. > > kr > > On Tue, Jan 18, 2011 at 12:13 PM, mitchell <mit...@tu...> wrote: >> Will do :) >> >> # mitchell >> >> On 18 Jan 2011 13:11, "Miroslav Stampar" <mir...@gm...> wrote: >>> hi mitchell. >>> >>> thank you for your answer. i thought that nobody would :) >>> >>> we've done some serious work these days in this field and would like >>> to have it "stabilized". plz report any "strange" behavior in this >>> field if you encounter it. >>> >>> kr >>> >>> On Tue, Jan 18, 2011 at 12:01 PM, mitchell <mit...@tu...> wrote: >>>> Hi Miroslav, >>>> >>>> In say 80% of the cases I delt with Bulgarian sites, the data in the >>>> database used the same encoding as the encoding announced on the webpage, >>>> usually CP-1251. The rest use UTF. >>>> >>>> # mitchell >>>> >>>> On 17 Jan 2011 16:52, "Miroslav Stampar" <mir...@gm...> >>>> wrote: >>>>> Hi all. >>>>> >>>>> I have a general question to all those pentesters that are retrieving >>>>> data >>>>> from sites with "funny" charset encodings (...russian, chinese...). >>>>> >>>>> What's should be the general "consensus" for data retrieval: >>>>> >>>>> A) assume that the backend DBMS uses the "utf8" charset encoding >>>>> or >>>>> B) treat data retrieved with the same encoding as used in the page >>>>> or >>>>> C) find out the proper collation used and use that one? (i am not a fan >>>>> of >>>>> this one :) >>>>> or >>>>> D) don't care (some people tend to use mixed collations which is quite >>>>> romantic) >>>>> >>>>> Also, I would like to ask you all to try out the latest revision with >>>>> cases >>>>> that could be problematic and report impressions. >>>>> >>>>> Kind regards >>>> >>> >>> >>> >>> -- >>> Miroslav Stampar >>> >>> E-mail / Jabber: miroslav.stampar (at) gmail.com >>> Mobile: +385921010204 (HR 0921010204) >>> PGP Key ID: 0xB5397B1B >>> Location: Zagreb, Croatia >> > > > > -- > Miroslav Stampar > > E-mail / Jabber: miroslav.stampar (at) gmail.com > Mobile: +385921010204 (HR 0921010204) > PGP Key ID: 0xB5397B1B > Location: Zagreb, Croatia > -- Miroslav Stampar E-mail / Jabber: miroslav.stampar (at) gmail.com Mobile: +385921010204 (HR 0921010204) PGP Key ID: 0xB5397B1B Location: Zagreb, Croatia |
From: Miroslav S. <mir...@gm...> - 2011-01-19 15:33:28
|
update regarding current status of sqlmap in this field: 1) if you are going to use ERROR or UNION based injections you'll be presented with the results retrieved from the given page with usage of it's encoding. if you are going to dump tables which have different collation/encoding than the page's you will most probably get here and there '???', but that's because the web server's connection to the backend DBMS is trying to convert retrieved data charset to the web pages one -> in case those two charsets are incompatible web server will most probably fallback to something like replacing it with character like '?' or similar. sqlmap can't do a squirt in this field because we can't (in normal cases) just change the web servers connection charset/encoding (e.g. mysql_set_charset("latin1", $link)). 2) in other "blind" cases (BOOLEAN, TIMED, STACKED) where characters are retrieved bit by bit, starting with the last commit, web servers "encoding" is used for proper decoding of the inferenced integer. last tests show great improvement in this field. best regards. On Wed, Jan 19, 2011 at 3:30 PM, Miroslav Stampar <mir...@gm...> wrote: > addendum: most simple explanation for the "priority among all charsets > is the encoding of the web page" is that, as we need to choose one, > let it be the most obvious one :))) > > On Wed, Jan 19, 2011 at 3:25 PM, Miroslav Stampar > <mir...@gm...> wrote: >> hi all. >> >> as i was really interested into this issue i had to set up a testing >> environment to find out what's going on :))) >> >> i've choose simplest (disposable) testing environment: XAMPP >> >> two tables: users_utf8 & users_latin >> two vulnerable GET pages: get_int_utf8.php & get_int_latin.php >> >> well, conclusion and my answer to the given question: "What's should >> be the general "consensus" for data retrieval": >> >> priority among all charsets is the encoding of the web page, and >> that's because three reasons: >> >> 1) connection from the web server to the backend DBMS will be most >> certainly set to some "compatible" charset with the one at the page >> itself - that means that all the data from DBMS to the web server will >> be automatically converted to connection's charset >> 2) once the web server has replied with the data, in case that the >> data is not compatible with it's current character set it will in most >> cases just do a simple replacement with '?' for problematic characters >> (like in case from latin1 -> utf8) - which means a big screw up for >> our data in "error" and "union" techniques as the data is irreversibly >> lost >> 3) finding out "proper" collation is a futile in a sense that in MySQL >> for example you can put collation to everything (column, table, >> connection, user, ...), and there is no "magic" bullet to know the >> final collation of the retrieved data in a "time constrained" manner. >> >> interesting thing that should be pointed out is that you'll most >> probably have problems with character sets of retrieved data here and >> there for one obvious reason: >> web page's connection to the backend DBMS dictates character set used >> for retrieved data, we "violently" use it in sql injection attacks for >> different tables with different character sets/collations which were >> most probably not "meant" to be "compatible" with web page itself, >> hence you'll lose information irreversibly during the conversion >> process. >> >> kr >> >> On Tue, Jan 18, 2011 at 12:13 PM, mitchell <mit...@tu...> wrote: >>> Will do :) >>> >>> # mitchell >>> >>> On 18 Jan 2011 13:11, "Miroslav Stampar" <mir...@gm...> wrote: >>>> hi mitchell. >>>> >>>> thank you for your answer. i thought that nobody would :) >>>> >>>> we've done some serious work these days in this field and would like >>>> to have it "stabilized". plz report any "strange" behavior in this >>>> field if you encounter it. >>>> >>>> kr >>>> >>>> On Tue, Jan 18, 2011 at 12:01 PM, mitchell <mit...@tu...> wrote: >>>>> Hi Miroslav, >>>>> >>>>> In say 80% of the cases I delt with Bulgarian sites, the data in the >>>>> database used the same encoding as the encoding announced on the webpage, >>>>> usually CP-1251. The rest use UTF. >>>>> >>>>> # mitchell >>>>> >>>>> On 17 Jan 2011 16:52, "Miroslav Stampar" <mir...@gm...> >>>>> wrote: >>>>>> Hi all. >>>>>> >>>>>> I have a general question to all those pentesters that are retrieving >>>>>> data >>>>>> from sites with "funny" charset encodings (...russian, chinese...). >>>>>> >>>>>> What's should be the general "consensus" for data retrieval: >>>>>> >>>>>> A) assume that the backend DBMS uses the "utf8" charset encoding >>>>>> or >>>>>> B) treat data retrieved with the same encoding as used in the page >>>>>> or >>>>>> C) find out the proper collation used and use that one? (i am not a fan >>>>>> of >>>>>> this one :) >>>>>> or >>>>>> D) don't care (some people tend to use mixed collations which is quite >>>>>> romantic) >>>>>> >>>>>> Also, I would like to ask you all to try out the latest revision with >>>>>> cases >>>>>> that could be problematic and report impressions. >>>>>> >>>>>> Kind regards >>>>> >>>> >>>> >>>> >>>> -- >>>> Miroslav Stampar >>>> >>>> E-mail / Jabber: miroslav.stampar (at) gmail.com >>>> Mobile: +385921010204 (HR 0921010204) >>>> PGP Key ID: 0xB5397B1B >>>> Location: Zagreb, Croatia >>> >> >> >> >> -- >> Miroslav Stampar >> >> E-mail / Jabber: miroslav.stampar (at) gmail.com >> Mobile: +385921010204 (HR 0921010204) >> PGP Key ID: 0xB5397B1B >> Location: Zagreb, Croatia >> > > > > -- > Miroslav Stampar > > E-mail / Jabber: miroslav.stampar (at) gmail.com > Mobile: +385921010204 (HR 0921010204) > PGP Key ID: 0xB5397B1B > Location: Zagreb, Croatia > -- Miroslav Stampar E-mail / Jabber: miroslav.stampar (at) gmail.com Mobile: +385921010204 (HR 0921010204) PGP Key ID: 0xB5397B1B Location: Zagreb, Croatia |