Request for Comment: SFRONT Optimizations.

From: Robin Davies (rerdavies@msn.com)
Date: Thu Mar 09 2000 - 02:34:23 EST

Next message: Michael J McGonagle: "Re: sfront 0.57 03/08/00 released"
Previous message: John Lazzaro: "sfront 0.57 03/08/00 released"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I'm contemplating making the following changes to the current SFRONT sources on an experimental basis to see if it improves performance of the SFRONT compiler. If I can get them to work, I'll let John Lazzaro know how I make out, and discuss folding the results into the SFRONT distribution. Step 1, however, is to try it to see if it works.

I'm not sure whether SFRONT was written for x86 compilers or not; however, there are some definite structural improvements that could be made to the output that would substantially boost performance on x86 platforms. Here are the specific problems that come to minde.

The Problem
------------------

SFRONT output doesn't optimize well on the MSVC compiler.

(1) floats have a distinct drawback on x86 platforms: they're actually SLOWER than doubles. The reason for this is that float and double arithmetic operations are performed identically in the floating point processor. However, *storing* FPU results to memory as floats actually takes longer since the internal FPU values must be truncated, and that takes a couple of extra cycles during the store (3 cycles to store a float, 1 to store a double when fully pipelined).

(2) The state variable scheme used by sfront works very poorly on an x86 architecture, possibly slowing compiled code by as much as a factor of 10 in common cases. State variables must be loaded before being operated on by the FPU (2 cycles), and then MUST be stored back (another 2 cycles, plus probable additional penalties if immediately followed by a load). Furthermore, the MSVC compiler is not able to determine that loads and stores from the state array are not aliasing, so it is unable to perform out-of-order exeuction in order to take advantage of the super-scalar architecture of late-model x86s, nor to keep working temporary variables in the FPU. As a result, SFRONT-generated code takes an inordinate amount of time loading and storing values.

(3) The practice of performing a single A cycle across all instruments instead of doing a K-Cycles-worth of A cycles for each instrument denies the compiler opportunites to optimize temporary variable usage in the FPU, and in most cases will deny the compiler an opportunity to fully optimize the super-scalar execution of code.

The Solution.

(1) Convert the state variables from arrays to explicit structure members.

This change allows the compiler to determine that memory loads and stores are not aliased. As a result, temporary varibles can be held in the FPU while performing complex operations, and the compiler can use lazy-writeback, and out-of-order stores (or not write back values at all) to memory.

(2) Convert SFRONT output to generate an entire K-cycle-s worth of A values for each instrument.

In case this isn't clear, what I want to do is this:

   void Piano_Evaluate() {
        int i = kSamplesPerControlCycle;
        Piano_kpass();
        while (i & 3) {
            Piano_apass();
            --i;
        }
        while (i != 0) {
            Piano_apass();
            Piano_apass();
            Piano_apass();
            Piano_apass();
            i -= 4;
        }
  }

Why the unrolled loops? Experimental optimizations on a similar project reveals that this gains an additional 37% execution on a complex instrument due to further opportunities for the compiler to optimize (this just for loop unrolling).

Also it should be said that I do intend to inline the kpass and apass code, although they're represented as macros here.

(3) Eliminate *all* calls during kpass and apass operations.

Typically, the penaulty for a procedure call is considerably worse than just the execution time. An optimizing compiler has to write back all temporary results to memory, and must also commit contents of the FPU and EAX, EBX, ECX, EDX registers. This denies the compiler substantial optimization chances.

The solution is to inline all function calls (except for recursive ones... shudder).

From what I've seen of generated and compiled code using MSVC, this sort of code-shuffling should yeild a 10x or better performance increase.

Comments, anyone? In particular, I'd be interested in impressions from people who are familier with the gcc optimizer. Are these problems problems on gcc as well, or is this just an MSVC issue? I think not.

Regards,

Robin Davies

Next message: Michael J McGonagle: "Re: sfront 0.57 03/08/00 released"
Previous message: John Lazzaro: "sfront 0.57 03/08/00 released"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Jan 28 2002 - 12:03:53 EST