Re: [Parseperl-discuss] [PATCH] preserve newlines

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Chris Dolan wrote:
> On Oct 4, 2006, at 11:43 PM, Adam Kennedy wrote:
> 
>> Unfortunately, I don't only work with real documents.
> 
> Sorry, I was unclear.  By "real documents" I meant anything that the  
> Tokenizer can emit.  I do not believe it is possible for the  
> tokenizer to emit a PPI::Token::HereDoc where the _terminator_line is  
> non-null where the line for the token lacks a newline.  The only way  
> that's possible is via generated content.  The latter is what I meant  
> by non-real.

Hmm... possibly if the terminator is the last line in the program.

PPI::Document->new( \"<<FOO\ninside\nFOO" );

>> The issue wasn't so much making it easier on the programmer, as  
>> making it sane. If I didn't localise the newlines, it would become  
>> a major gotcha, and an enormous source of bugs.
> 
> Hmm, enormous?  You are obviously much more familiar with the gotchas  
> than I am, but my patch wasn't that hard to write and works well with  
> the existing test suite.  Perhaps my confidence is unwarranted?

Lets imagine that we do native newlines. What is the first thing that 
people are going to do when they do minor generation. Mostly likely, 
they'll end up using native newlines.

What I don't want to happen is for everyone to make this mistake once 
(the gotcha) and then have to go and do extra things just to get back to 
normal behaviour.

>> I see a couple of solutions.
>>
>> Firstly, I really want to keep things localised internally.
>>
>> So pre-scanning the document text (which we do anyway for unicode  
>> checks, or at least we used to before the latin-1 improvements) to  
>> pick up 100% unix/win32/mac, storing that newline type in a top  
>> level document accessor, and then writing back out to the same  
>> type, is probably ok.
> 
> We'd have to change the code in add_element, remove, and replace to  
> correct the newlines on entry for new tokens.  That calls for a  
> set_newline method on PPI::Element and PPI::Node (and  
> PPI::Token::HereDoc).

Hmm... or we just do it automatically.

When you add, scan upwards to the document, find the right thing, and 
... yeah I get what you are saying.

>> That leaves us only with the case of mixed newlines. Personally,  
>> outside of binary files I am not away of ANY cases in which mixed  
>> newlines in a text file is allowed, even in __DATA__ segments.
> 
> Well, certainly my goal is to get rid of the mixed newlines!  That's  
> why I was writing a Perl::Critic policy against that.  :-)

Well, you know it's possible to detect mixed newlines from the raw 
source right?

Does critic have access to the original file/string?

>> In THAT case, perhaps we either localise, or we flatten to the  
>> first newline in the file.
>>
>> I'd be happy to implement that as a first step towards full native  
>> mixed newlines, as the functionality seems fairly containable.
>>
>> It also matches what some of the better editors do... localise  
>> internally, but remember and save out as the same input type.
> 
> Not Emacs.  It picks one newline to work on and treats the others  
> like binary.  So, if you're in \n mode and there is a single \r\n in  
> the file, you see a "^M" character at the end of that line.

Well, I mean for the non-mixed case, they deal with it internally.

Emacs et al doing binary items inside a text file is one example of what 
I mean by handling it wrong.

>> But I'm honestly not aware of ANYTHING that handles mixed newlines  
>> properly. I've done tons of unix/win32/mac cross-over work and I've  
>> seen just about every screwed up case there is.
>>
>> Even Dreamweaver, which inspired PPI in the first place, doesn't  
>> handle fucked up broken newlines.
>>
>> So we'd have to invent the solution.
> 
> You must be a step ahead of me or something.  Doesn't "handle" simply  
> mean serialization and de-serialization for PPI?  Maybe I've spent  
> too much time just in the Perl::Critic case?

There's two levels here.

One is the parser/serializer, the second is the API layer. If we provide 
add_before, for example, and it's mixed, what do we do?

Say it's mixed unix/mac running on Win32, do we insert with the Win32 
newlines and now have 3-platform mix?

We need to define the "correct" behaviour for working with mixed 
newlines... and I don't know that anyone else has done this yet.

>> And I'm still not (yet) convinced that native mixed newlines is the  
>> answer... if only because how the hell do we guarentee round-trip  
>> safety for them? If we do it, it needs to be 100%.
> 
> I feel like I must be missing some crucial point.  For the read-only  
> case, isn't round-trip safety just ensuring that we spit out exactly  
> what the Tokenizer took in?  With the exception of the HereDoc stuff  
> already mentioned, we've already achieved that with my patch, I think.
> 
> That leaves just the generated code case to worry about.  In that  
> case, we decide on a dominant line ending, like Emacs does, and  
> ensure that all added tokens inherit that line ending.  If the  
> generator *wants* to make mixed newlines, that's the only really hard  
> case, and that can be worked around with set_newline.

I don't think anyone ever really WANTs to make mixed newlines. As I 
said, I don't know of any legitimate real-world cases of this.

Adam K