From: Rafaël C. <fu...@vi...> - 2012-07-18 09:15:00
|
Le 2012-07-17 23:08, John Reiser a écrit : > On 07/17/2012 10:19 AM, Rafaël Carré wrote: >> Hello, >> >> Le 2012-07-16 00:09, John Reiser a écrit : >>> Functions MC_put_o_16_arm, MC_put_o_8_arm, MC_put_x_16_arm, MC_put_x_8_arm >>> in libmpeg2/motion_comp_arm_s.S have addresses in .text, which is bad >>> for shared libraries. Some environments demand that .text actually be >>> read-only all the time, yet MC_put_o_16_arm etc require that the addresses >>> be modified by the dynamic linking mechanism (dlopen, LoadLibrary, etc.) >>> Even in those environments which permit the dynamic linker to modify the >>> .text segment, the runtime cost of doing the relocation can be noticeable. >>> >>> The attached patch rewrites the linkage, discarding the tables of addresses >>> in favor of tables of offsets. All transfers are local within each individual >>> function, so there can be no interference by processing that occurs >>> after assembly, such as link-time re-ordering (even of individual functions.) >>> >>> -- John Reiser >>> >>> >>> libmpeg2.patch >>> >>> >>> Index: libmpeg2/motion_comp_arm_s.S >>> =================================================================== >>> --- libmpeg2/motion_comp_arm_s.S (revision 1205) >>> +++ libmpeg2/motion_comp_arm_s.S (working copy) >>> @@ -29,9 +29,13 @@ >>> pld [r1] >>> stmfd sp!, {r4-r11, lr} @ R14 is also called LR >>> and r4, r1, #3 >>> - adr r5, MC_put_o_16_arm_align_jt >>> - add r5, r5, r4, lsl #2 >>> - ldr pc, [r5] >>> + ldrb r4, [pc, r4] >>> + add pc, pc, r4, lsl #2 >> >> Is this instruction available on all ARM variants? > > The "add pc, pc, r4, lsl #2" has the same _form_ as the replaced > "add r5, r5, r4, lsl #2". > The patched code will assemble correctly for all variants where the > unpatched code will assemble correctly. > In particular, all ARM CPU back to at least armv4 have both instructions > in ARM mode. The code also executes correctly in ARM mode on armv4 and later. > Using armv5tel I ran "make check" successfully against all the streams > when the working directory was libmpeg2/trunk/test . > > The unpatched file motion_comp_arm_s.S uses > stmfd sp!, {r4-r11, lr} @ R14 is also called LR > which because of the use of 'r11' and 'lr' is ARM-only, not Thumb1, not Thumb2. > Thus we don't need to consider _any_ Thumb variants for the patch. > > However, *just* for the sake of complete analysis: > ----- begin analysis of Thumb modes; *NOT NEEDED* by patch > The unpatched "add r5, r5, r4, lsl #2" does not exist in Thumb ("Thumb1"), > but does exist in Thumb2. It is not available on armv5t, but is available > on all higher armv?t CPU because Thumb2 is very desirable and not > too expensive (in any of chip area, power, licensing fees.) > > The remaining question is whether "add pc, pc, r4, lsl #2" executes correctly > in Thumb2. What value is read from register r15, as input to the 'add'? > My reference is: > > ARM Architecture Reference Manual, ARM DDI 0100E, July 2000 > Section 6.1 About the Thumb instruction set > > When R15 is read, bit[0] is zero and bits[31:1] contain the PC. When R15 > is written, bit[0] is IGNORED and bits[31:1] are written to the PC. > Depending on how it is used, the value of the PC is either the address > of the instruction plus 4 or is UNPREDICTABLE. > > Because the Thumb sequence > L99: > mov lr, pc > b.n foo > L100: > may be used to record a continuation address (it sets r14 to &L100, which is > (4 + &L99)), then I believe that the value fetched from r15 is 4+(&opcode & ~1), > and not UNPREDICTABLE. > This also agrees with the exposed 3-stage pipelining of original ARM, where > the value fetched from r15 is is two words (2*2 in Thumb mode, 2*4 in ARM > mode) ahead. > > So, in Thumb2 mode I believe that the value fetched from r15 by > "add pc, pc, r4, lsl #2" is the address of the byte which immediately follows > the 'add' instruction, namely byte 0 of the table. However, the address that > the patch wants is the address of the "0:" just _beyond_ the table. Therefore > *IF* we want the same code to be correct for both ARM and Thumb2 at the same time, > then we must use another register to handle the different value fetched from r15 > by Thumb2 vs ARM: > > adr r5, 0f > ldrb r4, [r5, r4] > add pc, r5, r4, lsl #2 > 0: > .byte (MC_put_o_16_arm_align0 - 0b)>>2 > .byte (MC_put_o_16_arm_align1 - 0b)>>2 > .byte (MC_put_o_16_arm_align2 - 0b)>>2 > .byte (MC_put_o_16_arm_align3 - 0b)>>2 > > In Thumb2 mode only (not Thumb1, not ARM), the sequence > adr r5, 0f > tbb [r5, r4] # Table Branch Byte > 0: > .byte ... > is equivalent, and shorter by 4 bytes and faster by two [?] cycles. > ----- end analysis of Thumb modes; *NOT NEEDED* by patch > > >> >> I have to ask because I found some restrictions on: >> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/BABDCBAB.html > > I see a background watermark "Superseded" on that page. Also, the document is > for Thumb1 and not Thumb2. Therefore I believe that the document does not apply, > because the unpatched "add r5, r5, r4, lsl #2" would require Thumb2. [Remember > also that the unpatched code won't assemble for Thumb2 anyway because of the > references to r11 and lr in the 'stml '.] > >> >> Although here it should be the form "ADD Rd, Rn, #imm8m" which works >> everywhere. Thanks for the details, patch committed to SVN. |