pascal p5c Blog

fast, portable Pascal compiler using gnu c as intermediate code

Brought to you by: trevorbl

Speed up your pascal programs with profiling

Profiling allows you to improve the speed of your programs for comparatively little work.

The idea is to identify so called hot-spots in your program - ie those parts where the program spends most of its time.
When we know where they are, you can make speed improvments to the code where it matters.
There's no point doubling the speed of code that runs 1% of the time because it won't be noticed. On the other hand, doubling the speed of code that runs, say, 20% of the time should save 10% on your program's execution time.

Here, we'll be looking to make p5c run faster. I'm running on linux,and will be using the profile module of valgrind to collect the profiling data, and kcachegrind to examine it. (These are 2 amazing programs.) You can adapt the techniques here to your own setup. Users of other systems might find alternative tools here

step 1 - build a version of the program under test

Just like in source code debugging, we need the line numbers and symbol names in the compiled program so the results can be displayed in terms of program line numbers and function names (or procedure names).

Our program under test is p5c so to remove some confusion, we'll create a new version of p5c.pas and call it p5c.t.pas. So p5c is the compiler, p5c.t.pas is the program we are profiling - these are now separate programs.
Replace p5c.t with your program name in what follows.

Put these 2 lines at the top of p5c.t.pas:

{$n+,d- -- force line numbers, omit debug code }
{@@# line 1 "p5c.pas" @@}

This ensures that the intermediate c file knows the line numbers of p5c.pas, and will allow gcc to put these into the generated binary image.

Now build the image:

$P5CDIR/pc -g p5c.t.pas

The -g option tells gcc to put (pascal) line numbers and other symbolic information into the generated image, p5c.t

step 2 - gather profiling data

You may need to remove old profile data first:

rm -f callgrind.out.*

Now use valgrind to run your program.

It needs a command line of the form valgring options your-command command-line-args
In this example, the options specify the callgrind module.
The command in this example is p5c.t and its arguments are respectively the pascal source file that we will compile, and the gcc intermediate file:

valgrind --tool=callgrind p5c.t iso7185-tests/iso7185pat.pas iso7185-tests/iso7185pat.c

With the bash shell, this can be abbreviated:

valgrind --tool=callgrind p5c.t iso7185-tests/iso7185pat.{pas,c}

Since valgrind simulates p5c.t cycle-by-cycle, this could take a while. The payback is we get 100% accurate profiling.

We could run again, compiling different source files and have several sets of results if we wanted to.

step 3 - examine the results

Use kcachegrind for this. The results of the previous test or tests are in files called callgrind.out.xxxx, where xxxx is some id number (ie the program's PID).

kcachegrind callgrind.out.*

This shows the results in a page like this:

this

It's helpful to select "Source File" from the drop down box in the top left window, then select p5c.t.pas to hide data for library calls etc,
Then select "Source Code" and "Call Graph" in the tabs on the 2 right hand side windows.
Finally, select "% Relative" on the tool bar.

The bottom left window shows the detailed profiling data.
We can see that the program spends 99.99% of its time in main() and its children, then 99.74% in a procedure called programme (shown as programme_1) and so on. in order of decreasing time spent.

Now click on the "self" tab at the top of this window:

this

Now we see how much time is spent in each procedure or function, but excluding time spent calling any other functions.

It shows 46% of the time is spent in a single function called lcase, Click on that line.
The top right window shows the source code of the lcase function, and the bottom right window shows its call graph.

Similarly, 35 % of the time is spent in a function called strequvf.

Both of these are very small functions, and improvements to the eficiency of these will have a dramatic effect on the compile speed of p5c.

But wait! We also see that lcase is called 42 milion times, and the call graph shows that it is called nearly always from strequvf.

Given that p5c.t was compiling souce code about 130 kilobytes long, we would expect to be able to reduce the number of times lcase is called as well, so there's even more efficiency gains.

Now unhiglight "% Relative" and note of the total instruction count for main - the whole program used 1,476,946,509 instruction cycles.

this
Later we'll compare this to the improved version of p5c.

step 4 - optimise the hot spots

This is the thinking bit.

In this example, lcase is a function that downcases a character. The obvious optimisation is to replace the function with an array, so for example, lcase['C'] = 'c', and lcase[+] = '+'.

strequvf compares two strings, one fixed, the other dynamic or variable:

 { compare variable length id string to fixed }
  function strequvf(a: strvsp; var b: idstr): boolean;
  var m: boolean; i, j: integer; c: char;
  begin
    m := true; j := 1;
    for i := 1 to maxids do begin
      c := ' '; if a <> nil then begin c := a^.str[j]; j := j+1 end;
      if lcase(c) <> lcase(b[i]) then m := false;
      if j > varsqt then begin a := a^.next; j := 1 end
    end;
    strequvf := m
  end;

The for loop iterates maxids times (ie 250 times), even if the strings differ in their first character,
Also, the if j > varsqt ... statement could be moved into if a <> nil... statement.
Similarly, the char variable c needs to be down-cased only inside the if statement.

This suggests the following optimised code:

  { compare variable length id string to fixed }
  function strequvf(a: strvsp; var b: idstr): boolean;
  var m: boolean; i, j: integer; c: char;
  begin
    m := true; j := 1;
    i := 1;
    while m and (i <= maxids) do begin
      c := ' ';
      if a <> nil then begin
        c := lcase(a^.str[j]);
        j := j+1;
        if j > varsqt then begin a := a^.next; j := 1 end;
      end;
      if c <> lcase(b[i]) then
          m := false;
      end;
      i := i+1;
    end; {while}
    strequvf := m
  end;

Similar reasoning can be applied to the strequri function.

As a bonus, Pemberton's book suggests optimising nextch by skipping all the leading blanks in a line. See note 3 for the endofline procedure here.

step 5 - How much faster is it now?

Applying the above changes to the test file and rerunning the profile test gives the folowing results:

this

The total time is now down to 153,583,862 instructions - this is almost a 10x improvement, so yes, this a definitely a big speed-up for very little effort.
(This is almost too good to be true, I hope I haven't made a mistake somewhere.)

The versions of p5c.pas used before and after profiling are attached below. The file pv.sh is my bash shell script that automates the above steps (except the thinking bits).

Notes:
1. It's best to profile only mature code. Code likely to evolve will just result in your profiling work being undone.
2. There is a risk involved in optimising known working code. OTOH, looking carefully at it could reveal unknown bugs.
3. valgrind & kcachegrind show just how powerful the gcc ecosystem is. By using gcc as intermediate code, p5c is able to hook into this wider ecosystem.

Posted by 2017-07-25

Attachments: