VEX compiler and tools

by **ODerin** » Sat Feb 25, 2006 3:41 pm

Hello,

When compiling imgpipe benchmark for two-clustered machine with 16 registers, the compilation never finishes. (specifically it keeps on compiling jpeg/jcmaster.c) We stopped it manually after one day. For 32 and more registers, there is no problem, compilation ends successfully. Below you can find the fmmdump and the compilation command.

Reason for this may be that for this particular code, it may be possible, at some point, to fill all the issue width with operations which may potentially use issuewidth*3 = 24 registers. And some more registers may be needed for intercluster copy operations. At the end it may be the case that we need more than 32 registers in total. But still, by inserting extra loads and stores or by comprimising from 100% usage of issue width, compiler should be able to compile it for even less number of registers.

Should we wait more for the compiler to finish? Is there a way to derive a safe number of registers which makes sure that there won't be such infinite compile times? For example for the given configuration, Issuewidth*3 doesn't work to guess this safe number.

This is the compilation command that doesn't end.

Code: Select all: /opt/vex/FC4/bin/cc -O3 -H3 -prefetch -DVEX_RESTRICT -DJAMMED -width 2 -fmm=auto.mm -fmmdump -c -o jpeg/jcmaster.o jpeg/jcmaster.c

This is the fmmdump of the mentioned configuration:

Code: Select all: RES: IssueWidth 8 RES: MemLoad 8 RES: MemStore 8 RES: MemPft 1 RES: IssueWidth.0 4 RES: Alu.0 4 RES: Mpy.0 2 RES: CopySrc.0 1 RES: CopyDst.0 1 RES: Memory.0 1 RES: IssueWidth.1 4 RES: Alu.1 4 RES: Mpy.1 2 RES: CopySrc.1 1 RES: CopyDst.1 1 RES: Memory.1 1 DEL: AluR.0 0 DEL: Alu.0 0 DEL: CmpBr.0 1 DEL: CmpGr.0 0 DEL: Select.0 0 DEL: Multiply.0 1 DEL: Load.0 2 DEL: LoadLr.0 3 DEL: Store.0 0 DEL: Pft.0 0 DEL: Asm1L.0 0 DEL: Asm2L.0 0 DEL: Asm3L.0 0 DEL: Asm4L.0 0 DEL: Asm1H.0 1 DEL: Asm2H.0 1 DEL: Asm3H.0 1 DEL: Asm4H.0 1 DEL: CpGrGR.0 1 DEL: CpGrBr.0 1 DEL: CpBrGr.0 0 DEL: CpGrLr.0 2 DEL: CpLrGr.0 0 DEL: Spill.0 0 DEL: Restore.0 2 DEL: RestoreLr.0 3 DEL: AluR.1 0 DEL: Alu.1 0 DEL: CmpBr.1 1 DEL: CmpGr.1 0 DEL: Select.1 0 DEL: Multiply.1 1 DEL: Load.1 2 DEL: LoadLr.1 3 DEL: Store.1 0 DEL: Pft.1 0 DEL: Asm1L.1 0 DEL: Asm2L.1 0 DEL: Asm3L.1 0 DEL: Asm4L.1 0 DEL: Asm1H.1 1 DEL: Asm2H.1 1 DEL: Asm3H.1 1 DEL: Asm4H.1 1 DEL: CpGrGR.1 1 DEL: CpGrBr.1 1 DEL: CpBrGr.1 0 DEL: CpGrLr.1 2 DEL: CpLrGr.1 0 REG: $r0 16 REG: $b0 8 REG: $b1 8 REG: $r1 16

Thank you,

Onur

by **frb** » Sun Mar 05, 2006 10:39 pm

ODerin wrote:Hello,

When compiling imgpipe benchmark for two-clustered machine with 16 registers, the compilation never finishes. (specifically it keeps on compiling jpeg/jcmaster.c) We stopped it manually after one day. For 32 and more registers, there is no problem, compilation ends successfully. Below you can find the fmmdump and the compilation command.

Reason for this may be that for this particular code, it may be possible, at some point, to fill all the issue width with operations which may potentially use issuewidth*3 = 24 registers. And some more registers may be needed for intercluster copy operations. At the end it may be the case that we need more than 32 registers in total. But still, by inserting extra loads and stores or by comprimising from 100% usage of issue width, compiler should be able to compile it for even less number of registers.

Should we wait more for the compiler to finish? Is there a way to derive a safe number of registers which makes sure that there won't be such infinite compile times? For example for the given configuration, Issuewidth*3 doesn't work to guess this safe number.

This is the compilation command that doesn't end.
Code: Select all
/opt/vex/FC4/bin/cc -O3 -H3 -prefetch -DVEX_RESTRICT -DJAMMED -width 2 -fmm=auto.mm -fmmdump -c -o jpeg/jcmaster.o jpeg/jcmaster.c

Onur

Your analysis is probably correct, what happens is that you're defining a machine that is a bit out-of-balance, and the compiler chokes. What I would try doing first is reducing the unrolling factor. You're using -H3, which is a pretty aggressive unrolling (don't remember the amount, but probably a lot). Try with -H1 (or omit -Hx flag), and if you have manual unrolling pragmas, reduce them until compilation succeeds. If it still doesn't work after that, let me know, and I'll take a better look.

-- Paolo

by **ODerin** » Sun Mar 05, 2006 11:15 pm

It compiles when we relax -Hx and -Ox flags. But this is kind of comprimise from code quality.

frb wrote:Your analysis is probably correct, what happens is that you're defining a machine that is a bit out-of-balance, and the compiler chokes. What I would try doing first is reducing the unrolling factor. You're using -H3, which is a pretty aggressive unrolling (don't remember the amount, but probably a lot). Try with -H1 (or emit -Hx flag), and if you have manual unrolling pragmas, reduce them until compilation succeeds. If it still doesn't work after that, let me know, and I'll take a better look.

-- Paolo

by **frb** » Mon Mar 06, 2006 9:20 am

ODerin wrote:It compiles when we relax -Hx and -Ox flags. But this is kind of comprimise from code quality.

Not really. It doesn't make much sense to unroll the code - say - 32 times, when unrolling it 4 times exposes enough ILP to take full advantage of the machine. Unfortunately, there's no magic formula that'll tell you how much to unroll, so that's why compiler have pragmas, flags, etc. to help you guide the unrolling. Relaxing unrolling aggressiveness is the right approach in the experiment you're running.

-- Paolo

VEX compiler and tools

Infinite compile time

Infinite compile time

Re: Infinite compile time

Re: Infinite compile time

Re: Infinite compile time

Who is online