Some users detected that there is a (small) mismatch between the reported number of executed/stalled cycles and the number that you would get by adding the cycles reported in the width[] histogram array.
The cause of that is the (mis-)counting of the xnop (multi-cycle no-op) cycles that are not accounted for in any of the width[] elements. Because these can be non-architectural cycles (for example, a scoreboarded machine would ignore the xnop directive), it is kind of hard to get the histogram right no matter what.
To get a more precise IPC histogram, you can turn off the generation of xnops (-fno-xnop compiler flag) and force the compiler to emit architectural empty cycles which will be appropriately accounted for in the width[] histogram.
However, keep in mind that libraries and initialization routines are precompiled with xnop generation turned on, so the code that executes in libraries (as well as crt0.o, etc.) will suffer from the same problem and a small mismatch is likely to always be there.
-- Paolo