Andrew Bird - 2014-04-25

Attached DOS executable, run like 'fbench 100000', check out the 'scaled value' printed.

fbench.exe

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bart Oldeman - 2014-04-25

Interesting: the simulator is faster than the JIT here. As this is an FPU benchmark there may be some expensively-emulated FPU instructions in tight loops. I'll have a look.

For CPU benchmarks you could have a look also at emulators.com, e.g. here:
http://www.emulators.com/docs/nx11_flags.htm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Bird - 2014-04-25

Hi Bart,
Thanks for the link, I found its content very interesting. I'm wondering if some sort of automated test benchmark should be added to Dosemu. Do you think there's any value in me adding one? Would the tests have to be compiled at build time, or is it acceptable to have DOS executables shipped as is?

Thanks ,

Andrew

Sent from Samsung tablet

-------- Original message --------
From Bart Oldeman bartoldeman@users.sf.net
Date: 25/04/2014 20:37 (GMT+00:00)
To "[dosemu:support-requests]" 264@support-requests.dosemu.p.re.sf.net
Subject [dosemu:support-requests] #264 CPU benchmarks

Interesting: the simulator is faster than the JIT here. As this is an FPU benchmark there may be some expensively-emulated FPU instructions in tight loops. I'll have a look.

For CPU benchmarks you could have a look also at emulators.com, e.g. here:
http://www.emulators.com/docs/nx11_flags.htm

[support-requests:#264] CPU benchmarks

Status: open
Group: v1.0_(example)
Created: Fri Apr 25, 2014 01:16 PM UTC by Andrew Bird
Last Updated: Fri Apr 25, 2014 01:18 PM UTC
Owner: nobody

Hi there,
I compiled fbench (http://www.fourmilab.ch/fbench/fbench.html) for Linux(with GCC) and DOS (with Turbo C++) and compared the runtime results in various Dosemu modes. The reason for doing this is that at present I run Dosemu on 64bit hardware and use the 32 bit kernel so I can use the cpuemu off mode. At some point I hope that CPU emulation will reach an acceptable level of performance such that having the machine in 64 bit mode will become possible.

It seems that emulation is still a couple of orders of magnitude worse than the cpuemu off.

On my AMD 5600+ (2.8GHz) (32bit kernel) (lower time is better):

Linux

Native: 0.0052

DOS (Git branch devel)

cpuemu off : 0.0065
cpuemu vm86: 0.7100
cpuemu full: 0.7000
cpu_emu vm86sim: 0.4000
cpu_emu fullsim: 0.4000

DOS (Git branch simx86-no-mprotect with self merged devel)

cpuemu off : 0.0060
cpuemu vm86: 0.6700
cpuemu full: 0.6700
cpu_emu vm86sim: 0.4000
cpu_emu fullsim: 0.3900

Hog threshhold made no difference.

Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/dosemu/support-requests/264/

To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

Related

Support Requests: #264

alternate

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bart Oldeman - 2014-04-27

There is a problem with the fwait instruction -- it was incorrectly marked as being interpreted, but in effect forced the JIT to recompile over and over again. (Note: fwait should cause an FPU exception if appropriate but FPU exceptions are not correctly implemented by cpu_emu at the moment).

Could you try the attached patch?

As for test cases, there is one adapted from QEMU in src/tests, which needs to be compiled by DJGPP. Of course any more tests are always welcome as long as the license is ok. It would be best not to have the binaries shipped with dosemu source code.

tables.c.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andrew Bird - 2014-04-27
  
  Hi Bart,
  Here's a couple of runs of my, as yet unfinished, (ab)use of the python
  unittest suite. A value of 1 == native, anything else is the factor with
  respect to it i.e. 2x is half speed.
  
  devel branch (0ff16be5bc75f8e07ed20d54a4b8e782258a43d1) without patch
  
  bash-4.2$ python test/test_bench.py
  TestFbench ... FAIL
  
  ======================================================================
  FAIL: TestFbench
  
  CPUEMU
  off : OK target = 1.5x, result = 1.2x
  vm86 : FAIL target = 75.0x, result = 132.0x
  full : FAIL target = 75.0x, result = 136.0x
  vm86sim : OK target = 150.0x, result = 80.0x
  fullsim : OK target = 150.0x, result = 80.0x
  
  Ran 1 test in 119.546s
  
  FAILED (failures=1)
  
  devel branch (0ff16be5bc75f8e07ed20d54a4b8e782258a43d1) with patch
  
  bash-4.2$ python test/test_bench.py
  TestFbench ... ok
  
  ======================================================================
  PASS: TestFbench
  
  CPUEMU
  off : OK target = 1.5x, result = 1.2x
  vm86 : OK target = 75.0x, result = 24.0x
  full : OK target = 75.0x, result = 24.0x
  vm86sim : OK target = 150.0x, result = 80.0x
  fullsim : OK target = 150.0x, result = 80.0x
  
  Ran 1 test in 64.919s
  
  OK
  
  So it looks good to me, timing is now only 24x slower than native, whereas it
  was 132x/136x before. Well done!
  
  Regarding the benchmarking:
  I figured including binaries would be a problem. Is there such a thing as
  a C cross compiler that runs on Linux and produces DOS binaries?
  Does the binary rule work also for FreeDOS objects like command.com and
  kernel.sys? Currently I'm running dosemu from the development directory and
  not installing. I create a tmp-c directory and populate it with clean
  autoexec.bat, config.sys, the FreeDOS objects, and the dosemu derived tools for
  each test.
  
  Thanks,
  
  Andrew
  
  On Sunday 27 April 2014 18:03:49 Bart Oldeman wrote:
  
  There is a problem with the fwait instruction -- it was incorrectly marked
  as being interpreted, but in effect forced the JIT to recompile over and
  over again. (Note: fwait should cause an FPU exception if appropriate but
  FPU exceptions are not correctly implemented by cpu_emu at the moment).
  
  Could you try the attached patch?
  
  As for test cases, there is one adapted from QEMU in src/tests, which needs
  to be compiled by DJGPP. Of course any more tests are always welcome as
  long as the license is ok. It would be best not to have the binaries
  shipped with dosemu source code.
  
  Attachment: tables.c.diff (396 Bytes; text/x-patch)
  
  [support-requests:#264] CPU benchmarks
  
  Status: open
  Group: v1.0_(example)
  Created: Fri Apr 25, 2014 01:16 PM UTC by Andrew Bird
  Last Updated: Fri Apr 25, 2014 07:37 PM UTC
  Owner: nobody
  
  Hi there,
  I compiled fbench (http://www.fourmilab.ch/fbench/fbench.html) for
  Linux(with GCC) and DOS (with Turbo C++) and compared the runtime results
  in various Dosemu modes. The reason for doing this is that at present I run
  Dosemu on 64bit hardware and use the 32 bit kernel so I can use the cpuemu
  off mode. At some point I hope that CPU emulation will reach an acceptable
  level of performance such that having the machine in 64 bit mode will
  become possible.
  
  It seems that emulation is still a couple of orders of magnitude worse
  than the cpuemu off.
  
  On my AMD 5600+ (2.8GHz) (32bit kernel) (lower time is better):
  
  Linux
  
  Native: 0.0052
  
  DOS (Git branch devel)
  
  cpuemu off : 0.0065
  cpuemu vm86: 0.7100
  cpuemu full: 0.7000
  cpu_emu vm86sim: 0.4000
  cpu_emu fullsim: 0.4000
  
  DOS (Git branch simx86-no-mprotect with self merged devel)
  
  cpuemu off : 0.0060
  cpuemu vm86: 0.6700
  cpuemu full: 0.6700
  cpu_emu vm86sim: 0.4000
  cpu_emu fullsim: 0.3900
  
  Hog threshhold made no difference.
  
  Sent from sourceforge.net because dosemu-notify@lists.sourceforge.net is
  subscribed to https://sourceforge.net/p/dosemu/support-requests/
  
  To unsubscribe from further messages, a project admin can change settings at
  https://sourceforge.net/p/dosemu/admin/support-requests/options. Or, if
  this is a mailing list, you can unsubscribe from the mailing list.
  
  Related
  
  Support Requests: #264
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi Bart,
I'm not sure if you were able to read test results (I notice you didn't apply your patch to git) but I'll repost them here rather than via email and perhaps I can format them properly. So you can see your patch really helps as the timing is now only 24x slower than native, whereas it was 132x/136x before. I wonder what's going on with the DJGPP compiled version hitting near native speeds in the vm86 CPUEMU, or perhaps it's something the TURBO C++ version does that hurts badly?

TURBO C++ compiled version (attached in initial post) current devel branch

TestFbench ... FAIL

======================================================================
FAIL: TestFbench
----------------------------------------------------------------------
CPUEMU
       off :  OK  target = 1.5x, result = 1.2x
      vm86 : FAIL target = 75.0x, result = 132.0x
      full : FAIL target = 75.0x, result = 136.0x
   vm86sim :  OK  target = 150.0x, result = 80.0x
   fullsim :  OK  target = 150.0x, result = 80.0x

----------------------------------------------------------------------

TURBO C++ compiled version (attached in initial post) current devel branch with your FWAIT patch

TestFbench ... ok

======================================================================
PASS: TestFbench
----------------------------------------------------------------------
CPUEMU
       off :  OK  target = 1.5x, result = 1.2x
      vm86 :  OK  target = 75.0x, result = 24.0x
      full :  OK  target = 75.0x, result = 24.0x
   vm86sim :  OK  target = 150.0x, result = 80.0x
   fullsim :  OK  target = 150.0x, result = 80.0x

----------------------------------------------------------------------

DJGPP GCC 4.9 compiled version (attached here) current devel branch with your FWAIT patch

TestFbench ... ok

======================================================================
PASS: TestFbench
----------------------------------------------------------------------
CPUEMU
       off :  OK  target = 1.5x, result = 1.2x
      vm86 :  OK  target = 75.0x, result = 1.2x
      full :  OK  target = 75.0x, result = 4.0x
   vm86sim :  OK  target = 150.0x, result = 1.2x
   fullsim :  OK  target = 150.0x, result = 61.2x

----------------------------------------------------------------------

fbench.exe.djgpp

Bart Oldeman - 2014-04-29

Hi,

I just haven't got around to committing the patch yet but I will tonight.
DJGPP with vm86 IS native (DPMI), still with "full" it uses the JIT and the 4x slowdown is much better than with Turbo C. If you can compile the Turbo C++ version with native FPU (the default is to try emulation, then native which involves some self-modifying code), perhaps you see something better too.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Bird - 2014-04-30

Hi Bart,
I'm still working on making the benchmark test runner reusable for other tests, and I've taken on board your comments about not shipping binaries.
I didn't get the chance to rebuild the TURBO C++ version with -fp87, but I'm primarily interested in helping the performance of existing programs rather than to tweak new code (not that you implied that!). Is there a way of analysing an EXE under DOSEMU or otherwise to determine how many times a particular instruction gets run, sort of like gprof but at the instruction level?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bart Oldeman - 2014-05-01

Hi Andrew,
attached is a patch to improve performance for the Turbo C version to be similar to DJGPP in JIT mode (from 24x to 4.5x in my test). It's a bit dirty though so I won't apply it to git as is.

Long explanation: the code is full of instructions such as
int 39 (there are 8 ints for this, 34-3b, see also Ralf Brown's list)
where FP instructions such as fwait and fld would be if -fp87 were used. Now int39's interrupt handler will emulate the FP ins if there is no copro, but will patch the int39 into (for example) fwait; fld ... if there is a copro.

The latter is what happens. The JIT creates a new translation block for every patched int 3x with jmps in between the blocks. It's better if there were just one block containing many FPU instructions, which is what the attached source patch does: it forces retranslation if the "int 3x" is patched.

As for gprof style functionality, no it's not there unless of course you add some code to DOSEMU itself.

As for 16-bit compilers, that is an old issue. OpenWatcom can produce DOS binaries directly from Linux but many distributions do not like its license as not being free enough.

cd.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi Bart,
Your patch helped with the Turbo C++ compiled version of fbench, on my hardware I got 5.2x from an initial 26.0x, a substantial improvement. Regarding the vx86sim/fullsim timing is there any possibility of speed up, only I have another benchmark (integer) that is really weak there?

Current devel branch - no patch

==============================================================
FAIL: TestFbenchTc
-------------------------------------------------------------- 
CPUEMU
       off :  OK  target <= 2.0x, result = 1.2x
      vm86 : FAIL target <= 2.0x, result = 26.0x
      full : FAIL target <= 5.0x, result = 26.0x
   vm86sim : FAIL target <= 2.0x, result = 79.6x
   fullsim : FAIL target <= 75.0x, result = 78.8x

Current devel branch + cd.diff applied

============================================================== 
FAIL: TestFbenchTc
-------------------------------------------------------------- 
CPUEMU
       off :  OK  target <= 2.0x, result = 1.2x
      vm86 : FAIL target <= 2.0x, result = 5.2x
      full : FAIL target <= 5.0x, result = 5.2x
   vm86sim : FAIL target <= 2.0x, result = 78.0x
   fullsim : FAIL target <= 75.0x, result = 78.0x

Hi Bart,
I retested with your latest devel c1ddb275b8ca54fe66b8b6144cf0bb5c861d8f76 and these are the results. It's looking a lot better. Have you reached the point yet where there's no low hanging fruit?

===============================================================
PASS: TestFbenchDjgpp
---------------------------------------------------------------
CPUEMU
       off :  OK  target <= 2.0x, result = 1.0x
      vm86 :  OK  target <= 2.0x, result = 1.2x
      full :  OK  target <= 5.0x, result = 4.0x
   vm86sim :  OK  target <= 2.0x, result = 1.2x
   fullsim :  OK  target <= 75.0x, result = 61.2x

===============================================================
PASS: TestFbenchTcc
---------------------------------------------------------------
CPUEMU
       off :  OK  target <= 2.0x, result = 1.0x
      vm86 :  OK  target <= 8.0x, result = 4.3x
      full :  OK  target <= 8.0x, result = 4.3x
   vm86sim :  OK  target <= 80.0x, result = 64.7x
   fullsim :  OK  target <= 80.0x, result = 65.0x

---------------------------------------------------------------
Ran 2 tests in 618.609s

OK

CPU benchmarks

Group

Searches

Help

#264 CPU benchmarks

Linux

DOS (Git branch devel)

DOS (Git branch simx86-no-mprotect with self merged devel)

Related

Discussion

Related

devel branch (0ff16be5bc75f8e07ed20d54a4b8e782258a43d1) without patch

devel branch (0ff16be5bc75f8e07ed20d54a4b8e782258a43d1) with patch

Linux

DOS (Git branch devel)

DOS (Git branch simx86-no-mprotect with self merged devel)

Related

TURBO C++ compiled version (attached in initial post) current devel branch

TURBO C++ compiled version (attached in initial post) current devel branch with your FWAIT patch

DJGPP GCC 4.9 compiled version (attached here) current devel branch with your FWAIT patch

Current devel branch - no patch

Current devel branch + cd.diff applied