Thread: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

Status: Alpha

Brought to you by: glynos, mikedickey, mikhail_beris

cpp-netlib-devel

[cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: Dean M. B. <mik...@gm...> - 2009-08-14 19:46:54

Hi Guys,

I've run into a little conundrum. Here's the problem:

I want to be able to parse optional username and password in an HTTP
url as in 'http://user:password@host/' with Boost.Spirit2x (the one in
Boost trunk). So far every attempt I've tried has brought me
compile-time errors or if I get past the compile errors I see the host
being put in the user field. So far here is my grammar:

                bool ok = phrase_parse(
                        start_, end_,
                        (
                         lit("//")
                         >> -lexeme[*(char_ - ':')]
                         >> -lexeme[':' >> *(char_ - '@')]
                         >> -lexeme['@']
                         >> +(char_ - '/')
                         >> -lexeme['/' >> *(char_ - '?')]
                         >> -lexeme['?' >> *(char_ - '#')]
                         >> -lexeme['#' >> *char_]
                        ),
                        space,
                        result
                        );

I have committed the failing tests and the grammar to the repository
(revision 149, in branches/urllib-dean).

Any Spirit2x users out there willing to lend a hand? Thanks in advance.

-- 
Dean Michael Berris
blog.cplusplus-soup.com | twitter.com/mikhailberis
linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Re: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: Kim G. <kim...@gm...> - 2009-08-14 20:21:58

Hi Dean,

I know I struggled with this when I did it with Spirit + Phoenix, I
don't know how Spirit 2 is different. The only way I could find was to
keep a buffer variable containing the possible user info, and then
commit to it once an @ was found.

See my confusion in action here:
http://cpp-netlib.svn.sourceforge.net/viewvc/cpp-netlib/branches/uri/boost/network/uri.hpp?revision=143&view=markup

I don't know if that helps at all, but maybe you can find inspiration
somehow... It looks like the Spirit 2 grammar has an entirely
different form, so I don't really see how it ties into the Spirit 1
model.

Cheers,
- Kim

On Fri, Aug 14, 2009 at 21:46, Dean Michael
Berris<mik...@gm...> wrote:
> Hi Guys,
>
> I've run into a little conundrum. Here's the problem:
>
> I want to be able to parse optional username and password in an HTTP
> url as in 'http://user:password@host/' with Boost.Spirit2x (the one in
> Boost trunk). So far every attempt I've tried has brought me
> compile-time errors or if I get past the compile errors I see the host
> being put in the user field. So far here is my grammar:
>
>                bool ok = phrase_parse(
>                        start_, end_,
>                        (
>                         lit("//")
>                         >> -lexeme[*(char_ - ':')]
>                         >> -lexeme[':' >> *(char_ - '@')]
>                         >> -lexeme['@']
>                         >> +(char_ - '/')
>                         >> -lexeme['/' >> *(char_ - '?')]
>                         >> -lexeme['?' >> *(char_ - '#')]
>                         >> -lexeme['#' >> *char_]
>                        ),
>                        space,
>                        result
>                        );
>
> I have committed the failing tests and the grammar to the repository
> (revision 149, in branches/urllib-dean).
>
> Any Spirit2x users out there willing to lend a hand? Thanks in advance.
>
> --
> Dean Michael Berris
> blog.cplusplus-soup.com | twitter.com/mikhailberis
> linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> _______________________________________________
> Cpp-netlib-devel mailing list
> Cpp...@li...
> https://lists.sourceforge.net/lists/listinfo/cpp-netlib-devel
>

Re: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: Dean M. B. <mik...@gm...> - 2009-08-15 15:24:11

Hi Kim!

On Sat, Aug 15, 2009 at 4:21 AM, Kim Gräsman<kim...@gm...> wrote:
>
> I know I struggled with this when I did it with Spirit + Phoenix, I
> don't know how Spirit 2 is different. The only way I could find was to
> keep a buffer variable containing the possible user info, and then
> commit to it once an @ was found.
>

Yeah, that's one way.

The other way I was struggling with was with doing a "longest match"
ala-regex where if you found an @ character, what you've seen before
it is something you deal with differently.

I tried doing something with the 'lexeme' parser with multiple nested
lexemes -- this seemed to have worked, except that I can't seem to do
the parsing grammar correctly. I've avoided trying to create my own
parser type and just try and do everything in-lined to keep it simple,
but it proves to be a pretty hard thing to do.

> See my confusion in action here:
> http://cpp-netlib.svn.sourceforge.net/viewvc/cpp-netlib/branches/uri/boost/network/uri.hpp?revision=143&view=markup
>
> I don't know if that helps at all, but maybe you can find inspiration
> somehow... It looks like the Spirit 2 grammar has an entirely
> different form, so I don't really see how it ties into the Spirit 1
> model.
>

Thanks for the link, yes I see the approach that seems to work --
however I'm not very keen on using Spirit 1 anymore at the moment
having seen that the performance and expressiveness of Spirit 2x seems
to be better. For instance, it's more efficient not having to use
Phoenix and just have direct storage for assigning resulting values.

Maybe you want to have a hand at Spirit 2x, and translating the logic
you have there but without having to use Phoenix explicitly? Maybe you
can express it as a normal "longest match" parser? :D

Thanks again Kim. :)

-- 
Dean Michael Berris
blog.cplusplus-soup.com | twitter.com/mikhailberis
linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Re: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: Divye K. <div...@gm...> - 2009-08-15 21:12:30

Hi Dean,
    I went through your code and some of the documentation. However, while
tracing the code flow, i was able to determine that the string "
http://www.boost.org" was being passed using the range represented by
(start_, end_). I couldn't find where the string "http:"  was being struck
off from that range.

>
>                bool ok = phrase_parse(
>                        start_, end_,
>                        (
>                         lit("//")
>                         >> -lexeme[*(char_ - ':')]
>                         >> -lexeme[':' >> *(char_ - '@')]
>                         >> -lexeme['@']
>                         >> +(char_ - '/')
>                         >> -lexeme['/' >> *(char_ - '?')]
>                         >> -lexeme['?' >> *(char_ - '#')]
>                         >> -lexeme['#' >> *char_]
>                        ),
>                        space,
>                        result
>                        );
>

As there is nothing before the lit("//"). Probably, the first lexeme is
picking up the "http" and the userinfo is getting all the rest of the URL
://www.boost.org (as there is no @ around). Unfortunately, I don't have an
updated boost installation to test this out just yet (no Spirit 2 just yet).
Why the grammar is ignoring the lit("//") is a mystery to me.

Hope this helps somewhat (or I might be completely off track on this).

Sincerely,
Divye

Re: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: Dean M. B. <mik...@gm...> - 2009-08-16 04:27:28

Hi Divye,

On Sun, Aug 16, 2009 at 5:12 AM, Divye Kapoor<div...@gm...> wrote:
>
>     I went through your code and some of the documentation. However, while
> tracing the code flow, i was able to determine that the string
> "http://www.boost.org" was being passed using the range represented by
> (start_, end_). I couldn't find where the string "http:"  was being struck
> off from that range.

Actually, there are two places which does the parsing:

  - boost/network/uri/detail/url_parser.hpp -- function parse_url<>(...)
  - boost/network/uri/http_url.hpp -- function parse_special<>(...)

The 'http' is parsed by the function parse_url, which takes the scheme
('http') and the scheme-specific part ('//www.boost.org') and then
delegates the special parsing of the scheme-specific part to
parse_special. What happens then is the range [start_,end_) is just
'//www.boost.org' when it's passed in parse_special.

The problem becomes that because of the grammar I already have in
there, www.boost.org seems to be parsed as the user instead of the
host. Basically I need something regex-like:

  //~([user]:[password]@)[host]~(:[port])

(where '~' denotes optional).

Right now I'm trying a lot of things with a "longest-match" kind of
parser, maybe having lexemes of lexemes.

[snip]
>
> As there is nothing before the lit("//"). Probably, the first lexeme is
> picking up the "http" and the userinfo is getting all the rest of the URL
> ://www.boost.org (as there is no @ around). Unfortunately, I don't have an
> updated boost installation to test this out just yet (no Spirit 2 just yet).
> Why the grammar is ignoring the lit("//") is a mystery to me.
> Hope this helps somewhat (or I might be completely off track on this).

You might want to check out the latest boost trunk and let me know if
you get any farther with testing things out. :)

-- 
Dean Michael Berris
blog.cplusplus-soup.com | twitter.com/mikhailberis
linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Re: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: Dean M. B. <mik...@gm...> - 2009-08-16 05:27:08

Update: I think I got it! :) Please check out the source in
branches/urllib-dean -- I think it just has something to do with
understanding how to use the new primitive parsers in Spirit 2x.

Thanks to those who responded and gave me an idea of how to go about things. :)

-- 
Dean Michael Berris
blog.cplusplus-soup.com | twitter.com/mikhailberis
linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Re: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: John P. F. <jf...@ov...> - 2009-08-16 14:40:10

Sorry to chime in late as always,

I just dumped the in-progress work that I was doing into 
http_integration_jf. This was done a number of months ago, prior to my 
life-forcing break from this project. The only thing that is functional 
-is the uri parser. This spirit implementation is inspired from Braden 
McDaniel's uri-grammar. The gist of the design is that there was a main 
class which did structural grammar checking of the url, and then a 
family of re-usable grammar classes which corresponded to the http 
components. To see this in action: build the tests under 
libs/uri/test.For now I hope this lends some insight. I should get 
around to cleaning this branch up in the near future.

John

Dean Michael Berris wrote:
> Update: I think I got it! :) Please check out the source in
> branches/urllib-dean -- I think it just has something to do with
> understanding how to use the new primitive parsers in Spirit 2x.
>
> Thanks to those who responded and gave me an idea of how to go about things. :)
>
>

Re: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: Dean M. B. <mik...@gm...> - 2009-08-16 16:00:48

Hey John!

On Sun, Aug 16, 2009 at 6:26 PM, John P. Feltz<jf...@ov...> wrote:
> Sorry to chime in late as always,

No worries, better late than never. ;-)

>
> I just dumped the in-progress work that I was doing into
> http_integration_jf. This was done a number of months ago, prior to my
> life-forcing break from this project. The only thing that is functional
> -is the uri parser. This spirit implementation is inspired from Braden
> McDaniel's uri-grammar. The gist of the design is that there was a main
> class which did structural grammar checking of the url, and then a
> family of re-usable grammar classes which corresponded to the http
> components. To see this in action: build the tests under
> libs/uri/test.For now I hope this lends some insight. I should get
> around to cleaning this branch up in the near future.
>

Cool! Are you using Boost.Spirit 2x? I haven't been looking at these
changes closely.

What I've already started doing is have a base URL class from which
all specific URL families (HTTP, FTP, etc.) will derive from. I've
based my implementation on RFC 1738. There's a two-step parsing
approach I use which first does a generic parse that parses the scheme
from the scheme specific part; then invokes a 'parse_special' function
that parses the scheme specific part.

The basic_url<tags::default_> implementation is a bare basic_url<>
that just supports the protocol(...) and rest(...) function. The
specialization of the basic_url<...> for the HTTP urls is
basic_url<tags::http> -- and the parsing specific to HTTP URLs is
encapsulated in parse_special<traits::string<tags::http>::type,
tags::http>(...). This allows anyone to create a specialization of
basic_url<...> for the special parsing of FTP, "mailto", etc.

Maybe we can merge the work together in a branch just for the URL
parsing, then make the http_message implementation use the new URL
library instead of the adhoc implementation that it's using at the
moment? Personally I really want to be using Spirit 2x because I also
intend to use Karma for the HTTP Message generation/encoding for MIME
messages.

Of course that's a lot of work down the road, but the current
(not-so-well-tested) implementation seems to be able to identify
between HTTP and HTTPS ports. From there we should be able to write
the stuff that allows the HTTP client to create its own connections
based on the protocol(http_message.url()) -- if it's "https" then use
the ssl::socket and if it's just "http" use the normal tcp::socket.
That needs to be ironed out and refactored into a separate logic for
connection handling.

-- 
Dean Michael Berris
blog.cplusplus-soup.com | twitter.com/mikhailberis
linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Re: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: John P. F. <jf...@ov...> - 2009-08-16 17:22:55

Dean Michael Berris wrote:
> Hey John!
>
> On Sun, Aug 16, 2009 at 6:26 PM, John P. Feltz<jf...@ov...> wrote:
>   
>> Sorry to chime in late as always,
>>     
>
> No worries, better late than never. ;-)
>
>   
>> I just dumped the in-progress work that I was doing into
>> http_integration_jf. This was done a number of months ago, prior to my
>> life-forcing break from this project. The only thing that is functional
>> -is the uri parser. This spirit implementation is inspired from Braden
>> McDaniel's uri-grammar. The gist of the design is that there was a main
>> class which did structural grammar checking of the url, and then a
>> family of re-usable grammar classes which corresponded to the http
>> components. To see this in action: build the tests under
>> libs/uri/test.For now I hope this lends some insight. I should get
>> around to cleaning this branch up in the near future.
>>
>>     
>
> Cool! Are you using Boost.Spirit 2x? I haven't been looking at these
> changes closely.
>   
To be truthful I haven't even bothered to determine that. Spirit 2 was 
and still is for me- ambiguous. I simply chose to base that work off the 
boost_139 spirit docs.

> Of course that's a lot of work down the road, but the current
> (not-so-well-tested) implementation seems to be able to identify
> between HTTP and HTTPS ports. From there we should be able to write
> the stuff that allows the HTTP client to create its own connections
> based on the protocol(http_message.url()) -- if it's "https" then use
> the ssl::socket and if it's just "http" use the normal tcp::socket.
> That needs to be ironed out and refactored into a separate logic for
> connection handling.
>
>   
That seems rational. Actually, after a stint of researching some Java 
and Python based networking libraries, I myself have come the conclusion 
that presenting the user with a configurable connection object for a 
particular protocol is preferred, in addition to a client facade for 
common use-cases. As a side note, I have also come to the conclusion 
that a mailing list is not my preferred forum for this sort of 
discussion, which is better suited by collaborative specifications and 
conferencing. I'm curious as to what the opinions of the other 
developers are on this.

John

Re: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: Dean M. B. <mik...@gm...> - 2009-08-17 12:26:55

On Sun, Aug 16, 2009 at 10:22 PM, John P. Feltz<jf...@ov...> wrote:
>
> Dean Michael Berris wrote:
>>
>> Cool! Are you using Boost.Spirit 2x? I haven't been looking at these
>> changes closely.
>>
>
> To be truthful I haven't even bothered to determine that. Spirit 2 was
> and still is for me- ambiguous. I simply chose to base that work off the
> boost_139 spirit docs.
>

Ah, okay. Well it should be alright -- it should be Spirit 2 if it's
Boost 1.39. Although I may be wrong.

>> Of course that's a lot of work down the road, but the current
>> (not-so-well-tested) implementation seems to be able to identify
>> between HTTP and HTTPS ports. From there we should be able to write
>> the stuff that allows the HTTP client to create its own connections
>> based on the protocol(http_message.url()) -- if it's "https" then use
>> the ssl::socket and if it's just "http" use the normal tcp::socket.
>> That needs to be ironed out and refactored into a separate logic for
>> connection handling.
>>
>>
> That seems rational. Actually, after a stint of researching some Java
> and Python based networking libraries, I myself have come the conclusion
> that presenting the user with a configurable connection object for a
> particular protocol is preferred, in addition to a client facade for
> common use-cases.

Right. But my only reservation against this is that that's too much
work for the user.

I want to be able to do something like:

http::request normal("http://www.boost.org");
http::request https("https://www.boost.org");
http::client c;
http::response normal_response = c.get(normal);
http::response http_response = c.get(https);

And it should "just work".

> As a side note, I have also come to the conclusion
> that a mailing list is not my preferred forum for this sort of
> discussion, which is better suited by collaborative specifications and
> conferencing. I'm curious as to what the opinions of the other
> developers are on this.
>

While we're on the subject, I don't like making documents for one,
which explains why I can't get myself to put a roadmap document. ;)
Nor do I like writing specification documents -- I feel that's a waste
of my time.

I'd rather show client code that works and hide the plumbing so that I
(and everyone else working on the project) can just "make it work"
without having to burden the client (or the person reading the
documentation) with too many details.

Although for our sake, I think we need a coherent place to put the
information in -- so that we don't just put the details in mailing
list archives. However, I am not the best person to write that
document; although I feel like I should be the one doing it. :|

At any rate, I agree that mailing lists aren't the best means for
ironing out specifications or design documents -- however I feel
discussions can be best held here about the approach. It's (for me)
the medium of least resistance as far as collaboration goes. I don't
mind a Wiki page that says what we mean to say in one place, but
before we put anything up to a Wiki I think there should be some sort
of discussion that we can keep going on a mailing list -- then later
we can lift the results of the discussion into a Wiki.

This has worked for me in my time as a developer, because it kills two
birds with one stone -- the rationale is ironed out in the mailing
list while the outcome is put in the Wiki.

I hope this makes sense. :-)

-- 
Dean Michael Berris
blog.cplusplus-soup.com | twitter.com/mikhailberis
linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Re: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: Allister L. S. <all...@gm...> - 2009-08-17 08:26:21

Hi everyone,

On Sun, Aug 16, 2009 at 4:22 PM, John P. Feltz <jf...@ov...>wrote:

> As a side note, I have also come to the conclusion
> that a mailing list is not my preferred forum for this sort of
> discussion, which is better suited by collaborative specifications and
> conferencing. I'm curious as to what the opinions of the other
> developers are on this.
>

Do we all have Google Wave accounts?  You might find it very useful for
collaborating on specs :-)

Cheers,
Allister

Re: [cpp-netlib-devel] Help! Grammar for parsing HTTP URLs

From: Dean M. B. <mik...@gm...> - 2009-08-17 12:28:01

On Mon, Aug 17, 2009 at 4:26 PM, Allister Levi
Sanchez<all...@gm...> wrote:
>
> Do we all have Google Wave accounts?  You might find it very useful for
> collaborating on specs :-)
>

Oh, do you have one? How do you get one?

I'd like to try it out first hand too -- maybe in lieu of that we use
Google Documents first?

-- 
Dean Michael Berris
blog.cplusplus-soup.com | twitter.com/mikhailberis
linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com