|
From: Rafaël C. <fu...@vi...> - 2012-07-18 09:15:00
|
Le 2012-07-17 23:08, John Reiser a écrit :
> On 07/17/2012 10:19 AM, Rafaël Carré wrote:
>> Hello,
>>
>> Le 2012-07-16 00:09, John Reiser a écrit :
>>> Functions MC_put_o_16_arm, MC_put_o_8_arm, MC_put_x_16_arm, MC_put_x_8_arm
>>> in libmpeg2/motion_comp_arm_s.S have addresses in .text, which is bad
>>> for shared libraries. Some environments demand that .text actually be
>>> read-only all the time, yet MC_put_o_16_arm etc require that the addresses
>>> be modified by the dynamic linking mechanism (dlopen, LoadLibrary, etc.)
>>> Even in those environments which permit the dynamic linker to modify the
>>> .text segment, the runtime cost of doing the relocation can be noticeable.
>>>
>>> The attached patch rewrites the linkage, discarding the tables of addresses
>>> in favor of tables of offsets. All transfers are local within each individual
>>> function, so there can be no interference by processing that occurs
>>> after assembly, such as link-time re-ordering (even of individual functions.)
>>>
>>> -- John Reiser
>>>
>>>
>>> libmpeg2.patch
>>>
>>>
>>> Index: libmpeg2/motion_comp_arm_s.S
>>> ===================================================================
>>> --- libmpeg2/motion_comp_arm_s.S (revision 1205)
>>> +++ libmpeg2/motion_comp_arm_s.S (working copy)
>>> @@ -29,9 +29,13 @@
>>> pld [r1]
>>> stmfd sp!, {r4-r11, lr} @ R14 is also called LR
>>> and r4, r1, #3
>>> - adr r5, MC_put_o_16_arm_align_jt
>>> - add r5, r5, r4, lsl #2
>>> - ldr pc, [r5]
>>> + ldrb r4, [pc, r4]
>>> + add pc, pc, r4, lsl #2
>>
>> Is this instruction available on all ARM variants?
>
> The "add pc, pc, r4, lsl #2" has the same _form_ as the replaced
> "add r5, r5, r4, lsl #2".
> The patched code will assemble correctly for all variants where the
> unpatched code will assemble correctly.
> In particular, all ARM CPU back to at least armv4 have both instructions
> in ARM mode. The code also executes correctly in ARM mode on armv4 and later.
> Using armv5tel I ran "make check" successfully against all the streams
> when the working directory was libmpeg2/trunk/test .
>
> The unpatched file motion_comp_arm_s.S uses
> stmfd sp!, {r4-r11, lr} @ R14 is also called LR
> which because of the use of 'r11' and 'lr' is ARM-only, not Thumb1, not Thumb2.
> Thus we don't need to consider _any_ Thumb variants for the patch.
>
> However, *just* for the sake of complete analysis:
> ----- begin analysis of Thumb modes; *NOT NEEDED* by patch
> The unpatched "add r5, r5, r4, lsl #2" does not exist in Thumb ("Thumb1"),
> but does exist in Thumb2. It is not available on armv5t, but is available
> on all higher armv?t CPU because Thumb2 is very desirable and not
> too expensive (in any of chip area, power, licensing fees.)
>
> The remaining question is whether "add pc, pc, r4, lsl #2" executes correctly
> in Thumb2. What value is read from register r15, as input to the 'add'?
> My reference is:
>
> ARM Architecture Reference Manual, ARM DDI 0100E, July 2000
> Section 6.1 About the Thumb instruction set
>
> When R15 is read, bit[0] is zero and bits[31:1] contain the PC. When R15
> is written, bit[0] is IGNORED and bits[31:1] are written to the PC.
> Depending on how it is used, the value of the PC is either the address
> of the instruction plus 4 or is UNPREDICTABLE.
>
> Because the Thumb sequence
> L99:
> mov lr, pc
> b.n foo
> L100:
> may be used to record a continuation address (it sets r14 to &L100, which is
> (4 + &L99)), then I believe that the value fetched from r15 is 4+(&opcode & ~1),
> and not UNPREDICTABLE.
> This also agrees with the exposed 3-stage pipelining of original ARM, where
> the value fetched from r15 is is two words (2*2 in Thumb mode, 2*4 in ARM
> mode) ahead.
>
> So, in Thumb2 mode I believe that the value fetched from r15 by
> "add pc, pc, r4, lsl #2" is the address of the byte which immediately follows
> the 'add' instruction, namely byte 0 of the table. However, the address that
> the patch wants is the address of the "0:" just _beyond_ the table. Therefore
> *IF* we want the same code to be correct for both ARM and Thumb2 at the same time,
> then we must use another register to handle the different value fetched from r15
> by Thumb2 vs ARM:
>
> adr r5, 0f
> ldrb r4, [r5, r4]
> add pc, r5, r4, lsl #2
> 0:
> .byte (MC_put_o_16_arm_align0 - 0b)>>2
> .byte (MC_put_o_16_arm_align1 - 0b)>>2
> .byte (MC_put_o_16_arm_align2 - 0b)>>2
> .byte (MC_put_o_16_arm_align3 - 0b)>>2
>
> In Thumb2 mode only (not Thumb1, not ARM), the sequence
> adr r5, 0f
> tbb [r5, r4] # Table Branch Byte
> 0:
> .byte ...
> is equivalent, and shorter by 4 bytes and faster by two [?] cycles.
> ----- end analysis of Thumb modes; *NOT NEEDED* by patch
>
>
>>
>> I have to ask because I found some restrictions on:
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/BABDCBAB.html
>
> I see a background watermark "Superseded" on that page. Also, the document is
> for Thumb1 and not Thumb2. Therefore I believe that the document does not apply,
> because the unpatched "add r5, r5, r4, lsl #2" would require Thumb2. [Remember
> also that the unpatched code won't assemble for Thumb2 anyway because of the
> references to r11 and lr in the 'stml '.]
>
>>
>> Although here it should be the form "ADD Rd, Rn, #imm8m" which works
>> everywhere.
Thanks for the details, patch committed to SVN.
|