Re: [mpeg2-dev] [PATCH] motion_comp_arm_s.S is not PIC enough

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Le 2012-07-17 23:08, John Reiser a écrit :
> On 07/17/2012 10:19 AM, Rafaël Carré wrote:
>> Hello,
>>
>> Le 2012-07-16 00:09, John Reiser a écrit :
>>> Functions MC_put_o_16_arm, MC_put_o_8_arm, MC_put_x_16_arm, MC_put_x_8_arm
>>> in libmpeg2/motion_comp_arm_s.S have addresses in .text, which is bad
>>> for shared libraries.  Some environments demand that .text actually be
>>> read-only all the time, yet MC_put_o_16_arm etc require that the addresses
>>> be modified by the dynamic linking mechanism (dlopen, LoadLibrary, etc.)
>>> Even in those environments which permit the dynamic linker to modify the
>>> .text segment, the runtime cost of doing the relocation can be noticeable.
>>>
>>> The attached patch rewrites the linkage, discarding the tables of addresses
>>> in favor of tables of offsets.  All transfers are local within each individual
>>> function, so there can be no interference by processing that occurs
>>> after assembly, such as link-time re-ordering (even of individual functions.)
>>>
>>> -- John Reiser
>>>
>>>
>>> libmpeg2.patch
>>>
>>>
>>> Index: libmpeg2/motion_comp_arm_s.S
>>> ===================================================================
>>> --- libmpeg2/motion_comp_arm_s.S	(revision 1205)
>>> +++ libmpeg2/motion_comp_arm_s.S	(working copy)
>>> @@ -29,9 +29,13 @@
>>>  	pld [r1]
>>>          stmfd sp!, {r4-r11, lr} @ R14 is also called LR
>>>  	and r4, r1, #3
>>> -	adr r5, MC_put_o_16_arm_align_jt
>>> -	add r5, r5, r4, lsl #2
>>> -	ldr pc, [r5]
>>> +	ldrb r4, [pc, r4]
>>> +	add pc, pc, r4, lsl #2
>>
>> Is this instruction available on all ARM variants?
> 
> The "add pc, pc, r4, lsl #2" has the same _form_ as the replaced
>     "add r5, r5, r4, lsl #2".
> The patched code will assemble correctly for all variants where the
> unpatched code will assemble correctly.
> In particular, all ARM CPU back to at least armv4 have both instructions
> in ARM mode.  The code also executes correctly in ARM mode on armv4 and later.
> Using armv5tel I ran "make check" successfully against all the streams
> when the working directory was  libmpeg2/trunk/test .
> 
> The unpatched file motion_comp_arm_s.S uses
>         stmfd sp!, {r4-r11, lr} @ R14 is also called LR
> which because of the use of 'r11' and 'lr' is ARM-only, not Thumb1, not Thumb2.
> Thus we don't need to consider _any_ Thumb variants for the patch.
> 
> However, *just* for the sake of complete analysis:
> ----- begin analysis of Thumb modes; *NOT NEEDED* by patch
> The unpatched "add r5, r5, r4, lsl #2" does not exist in Thumb ("Thumb1"),
> but does exist in Thumb2.  It is not available on armv5t, but is available
> on all higher armv?t CPU because Thumb2 is very desirable and not
> too expensive (in any of chip area, power, licensing fees.)
> 
> The remaining question is whether "add pc, pc, r4, lsl #2" executes correctly
> in Thumb2.  What value is read from register r15, as input to the 'add'?
> My reference is:
> 
>   ARM Architecture Reference Manual, ARM DDI 0100E, July 2000
>   Section 6.1 About the Thumb instruction set
> 
>   When R15 is read, bit[0] is zero and bits[31:1] contain the PC. When R15
>   is written, bit[0] is IGNORED and    bits[31:1] are written to the PC.
>   Depending on how it is used, the value of the PC is either the address
>   of the instruction plus 4 or is UNPREDICTABLE.
> 
> Because the Thumb sequence
>   L99:
>      mov lr, pc
>      b.n  foo
>   L100:
> may be used to record a continuation address (it sets r14 to &L100, which is
> (4 + &L99)), then I believe that the value fetched from r15 is 4+(&opcode & ~1),
> and not UNPREDICTABLE.
> This also agrees with the exposed 3-stage pipelining of original ARM, where
> the value fetched from r15 is is two words (2*2 in Thumb mode, 2*4 in ARM
> mode) ahead.
> 
> So, in Thumb2 mode I believe that the value fetched from r15 by
> "add pc, pc, r4, lsl #2" is the address of the byte which immediately follows
> the 'add' instruction, namely byte 0 of the table.  However, the address that
> the patch wants is the address of the "0:" just _beyond_ the table.  Therefore
> *IF* we want the same code to be correct for both ARM and Thumb2 at the same time,
> then we must use another register to handle the different value fetched from r15
> by Thumb2 vs ARM:
> 
> 	adr r5, 0f
>         ldrb r4, [r5, r4]
>         add pc, r5, r4, lsl #2
> 0:
>         .byte (MC_put_o_16_arm_align0 - 0b)>>2
>         .byte (MC_put_o_16_arm_align1 - 0b)>>2
>         .byte (MC_put_o_16_arm_align2 - 0b)>>2
>         .byte (MC_put_o_16_arm_align3 - 0b)>>2
> 
> In Thumb2 mode only (not Thumb1, not ARM), the sequence
> 	adr r5, 0f
> 	tbb [r5, r4]  # Table Branch Byte
> 0:
> 	.byte ...
> is equivalent, and shorter by 4 bytes and faster by two [?] cycles.
> ----- end analysis of Thumb modes; *NOT NEEDED* by patch
> 
> 
>>
>> I have to ask because I found some restrictions on:
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/BABDCBAB.html
> 
> I see a background watermark "Superseded" on that page.  Also, the document is
> for Thumb1 and not Thumb2.  Therefore I believe that the document does not apply,
> because the unpatched "add r5, r5, r4, lsl #2" would require Thumb2.  [Remember
> also that the unpatched code won't assemble for Thumb2 anyway because of the
> references to r11 and lr in the 'stml '.]
> 
>>
>> Although here it should be the form "ADD Rd, Rn, #imm8m" which works
>> everywhere.

Thanks for the details, patch committed to SVN.