PowerPC Twice As Fast AS Pentium? BYTEmark vs
SPECReaders Bite Back!
| Macintosh vs Windows Index
of whether you trust BYTEmark benchmarks more than SPEC benchmarks generated
a lot of mail and one thing is clear - benchmarks are part faith and
alchemy, and not all science. There are apparently many variables that
can effect how a benchmark program will perform and most programs can
only give you a rough approximation of comparative performance. Application
benchmarks are in general a better indicator than synthetic benchmarks
such as BYTEmark and SPEC. But even application benchmarks can give
a somewhat misleading picture of comparative performance depending on
what hardware or OS is being stressed. Benchmarks are a crude roadmap
But even a crude roadmap is better than none when you are trying to
make a purchasing decision. Consult as many sources of performance information
as possible when when considering a purchase (we have listed all the
sites we know of with benchmark and performance information on our links
page) and remember that the only benchmarks that really count are
the ones you perform when you run your hardware through the tasks you
need to get done. If you are not satisfied most products come with a
30 day money-back guarantee!
Below we post some of the responses we got to our BYTEmark
vs SPEC question
|Thanks for running some numbers. As you note
at the beginning, SPEC was designed to keep the processor turning.
In other words, these tests are designed not to cause a backlog
in the register-starved and low cache Pentiums and Pentium IIs,
so that the maximum amount of processor time is available. They
go up to the point, but do not cross the line where the x86 architecture
has trouble. This is done by keeping the data flow into the processor
at a low level and keeping a steady flow of instructions that
use the same data over and over again.
On the other hand, ByteMark tests are designed to cover a continuum
of cases from the above situation to where even 32 register
systems are stressed. These tests focus not on keeping the processor
going, but on how much traffic can get through to it. Traffic
jams in the feeding system, causes down time for the processor.
Real world tests show similarity to both types of tests, sometimes
favoring the SPEC and sometimes favoring the ByteMark, but mostly
Another thing to remember is that the weak point of the G3
is the FPU. The G3 is a major improvement over the 603 in the
integer suite, but not in the FPU. It also has improved caching
schemes to support the chip. It is integer tests, under Byte
Mark conditions that G3s perform at about twice the level as
the Pentium II. Floating point tests (like your Mathematica
3.0 tests), generally produce speeds of between 1 and 1.5 of
the Pentium II at the same MHz. Also, loading and storing of
registers is an "integer" function in Pentiums, but
G3s have a separate execution unit for loading and storing.
Loading and storing not only occurs more frequently for the
Pentiums, it also takes a bigger hit for these functions than
In answer to your question of which is a better test, it is
better to do both and understand that these are the extremes.
Each real world application will almost always fall in between,
depending on how much data needs to go to the processor.
The fact remains that under ideal situations, where data flows
easily to the processor, the G3 still gets about 50% better
integer performance and about the same FP performance. When
the data flows increase, the Pentium II's feeding system gets
overwealmed well before the G3s do. For integer tests, the G3
is about 50% faster during ideal flow situations, and often
more than 100% faster at higher data feeding rates.
A couple of more things about your tests you might think about.
Your 300 MHz systems both have 66 MHz front-end buses (ratio
of about 4.5:1), thus memory to processor/cache speeds are the
I don't think this is true of your 400 MHz systems. The Pentium
II/400s use a dual 50 MHz bus (similar to the 2x50 MHz bus on
the MACh 5 604e and in the big picture have almost the same
throughput as a 100 MHz bus). These have an effective 4:1 ratio.
The 400 MHz G3 probably still has a 66 MHz for a 6:1 ratio,
although they may have a 72 MHz for a 5:1 ratio. Either way
flow from memory to the cache/processor is faster on the Pentium
II than the G-3. Thus you can expect the G-3 to take a small
hit for this mis-match.
I understand that the next generation systems (Katmai Pentium
IIs and AltiVec G-4s) will have true 100 MHz buses early next
year. But while the Pentium II adds Katmai's New Instructions
(for 3-D FP calculations); the G-4, will also have a real FPU
(ala the 604e), still better feeding mechanisms (up to 2 Mb
L2 Cache), AltiVec (superior to KNI) and improved multiprocessor
capability. If the G-3s do better than the Pentium IIs, now
just wait until February and watch the gap widen, this time
particularly on the floating point side.
|Neither benchmark is particularly effective.
Especially for the last generation or two of processors, and
even more so in the future, processor speed is very tightly
linked to memory architecture. Memory bandwidth starvation easily
dominates all other factors when it is present. So for real
world performance results, a benchmark must move reasonably
large amounts of data to and from memory, even when focusing
on just the CPU, because memory bandwidth to CPU is the critical
performance factor right now.
How big is the memory footprint of Bytemark? Is it greater
than the 1 MB of many caches today? More than that, the Bytemark
is extraordinarily sensitive to compiler fiddlings. It is much
more a toy benchmark than SPEC, which is the result of many
years of industry benchmarking experience. Certain SPEC results
require standard compiler optimizations, and all SPEC results
require published compiler settings. This avoids the abuse of
compiler vendors recognizing the benchmark and generating special
code just for it (which would give it a benefit you couldn't
realize in real programs, unless your real programs exactly
correspond to the benchmark). There is a whole benchmarking
FAQ which you should read to get an idea of how people lie using
benchmarks (including Apple).
But is SPEC any better? Yes, in some ways. It has a larger
body of code, which are pieces of commonly used programs _under
Unix_. The SPEC body has a lot of rules to safeguard the credibility
of SPEC results.
No, in other ways. SPEC is expensive, so it isn't available
to the general hobbyist. And it only runs under Unix. Whose
SPEC results are you quoting? How many of the G3 SPECs were
produced on PowerMacs? None were produced under MacOS; how do
you know if the speed of the G3 under some flavor of Unix (AIX,
LinuxPPC, MkLinux) corresponds in any way to its speed under
MacOS? And if the SPEC was produced on an IBM workstation, how
can you use results generated with a completely different memory
hierarchy and hardware? If the SPEC is merely estimated, what
the heck does that mean?
Now, since we persist in wanting to compare processors, and
Macs against PCs, we need some method of comparison. Do what
most reputable people do: compare benchmarks, both Bytemarks
and SPEC and whatever else comes along. Note their deficiencies,
and balance those results with the following application comparisons.
Then compare performance in popular applications; try hard to
find parts of the applications that are performance bound. For
example, scrolling is often a poor test, because the scroll
speed is often limited not by the processor but by the application
code (to avoid scrolling too fast).
What are you asking for in a CPU benchmark? Perhaps this: given
equivalent supporting hardware (memory, logic set, etc), which
CPU performs best. This probably varies for different performance
parameters of the supporting hardware. So, then, maybe this:
which CPU has the most potential speed, given ideal supporting
hardware? But what if a CPU requires much more complex support
logic to provide the same speed? What if that level of support
logic is not available, except on reference machines, and most
machines use far inferior support logic?
What if comparing CPU benchmarks is like comparing NBA players
based upon their free-throw ability, or comparing swimmers based
upon their lung capacity, or comparing runners based on the
power of their hamstring muscles? In other words, it produces
results almost totally unrelated to the original question, and
ignores many dominant factors.
Food for thought: no current benchmark measures interface response
time, which is what we perceive as speed. We throw raw performance
at the problem and assume that will solve our problem, while
application and OS developers are making the problem worse with
There are 2 ways to evaluate a system: Low level & High level.
Low level analysis looks at moving bytes, modifying memory,
and very basic activities like sorting or complex math (integer,
floating point, array...).
High level analysis looks at things like how fast the system
boots, how fast a specific program like Mathematica or Word
runs, or how fast the Browser moves.
I think that the Bytemark is the better low level benchmark.
It seems to test more generic system activities like sorting
& bit twidling.
The SPEC benchmark seems more complex than the Bytemark and
I don't trust it as much because it seems so tied to the compliler
and algorithms. If I were on the Wintel side I'd dedicate substatial
resources to make sure that compilers would be optimised to
deliver maximum performance for these obscure problems. I'd
bet that the Weather Service doesn't use the same compiler &
hardware that this benchmark is run on. How many home users
would use this weather simulation? This specifice test, sexy
as it sounds, is irrelevant. I bet that given enough money any
of the tested platforms could be optimized to score highest.
A bad compliler & algorithm could kill even the best hardware.
It's like comparing cars: tell me the horsepower, the torque,
the type of suspention, not how fast driver X could drive it
through his favorite country road. I think the Bytemark is the
more accurate test of a systems raw potential.
Hi, I would just like you to know that I believe the BYTEmark
tests are more accurate than the SPEC tests. I own a 333 G3 and
compared it to a Hewlett Packard 333 Pentium MMX and there is
simply no comparison. It is literally like comparing night and
day. Everything from opening windows to performing complex tasks
in Photoshop are faster. Anyone who thinks that Wintel boxes are
faster, or that there is no noticable difference, should sit down
behind a new G3 and do some work. Not just scroll Word documents
or open windows, but acually do some "real" work. I
think that they would find that the G3 processor is far better
than anything Intel is producing, whether it's a PII or MMX.
.....from my real-life experience I can tell that the Pentium
MMX 200 (64 MB RAM) I use everyday on Win98 is really a dead snail
compared to the 604e 200 MHz (128 MB RAM) on OS 8.5 I use as well
every day. In some cases even Virtual PC on this Mac is faster
than the real Pentium on Windows (Director projectors play faster
on VPC). And since this is a real life comparison between 604e
and a Pentium MMX I guess that it won't be that different when
comparing the next generation processors on both platforms: a
750 with a Pentium II. Actually, I do believe that the performance
increase from 604e to 750 (maybe except for the FPU which seems
to be performing almost equally on 604e and 750) is greater than
from Pentium MMX to Pentium II.
I don't completely trust either test, but I am more skeptical
about the SPECMark bench.
The reason for my skeptisizm of the SPECMark suite is the fact
that it is easy for INTEL to write a compiler which is optimized
for this test suite. Intel has done this.
It has been noted by several magazines how Intels SPECMark
performance numbers suddenly jump over 20% !!! SPECMark numbers
for the Pentium 60 and 66MHz processors went up over 20% despite
the fact that they were no longer on the market. So how does
a processor jump 20% in performance even though it is not in
production? Simple, Intel started using an "internal"
compiler which was optimized for the SPECMark suite. It is interesting
to note that INTEL would not ANYONE to use the compiler or test
it against the results of another compiler.
I wanted to reply to some of the other quotes you posted on your
site, because I believe they were misinformed. I'll try this once,
and then probably let it drop.
Mike Johnson writes, > "SPEC was designed to keep the
processor turning. In other words, these > tests are designed
not to cause a backlog in the register-starved and > low
cache Pentiums and Pentium IIs, so that the maximum amount of
> processor time is available."
SPEC has been around long before the Pentiums; it was NOT designed
with the x86 family in mind, nor does it especially favor them.
Its roots are from Unix workstations, VAX machines (and an older
sorry benchmark, the MIP), and Cray supercomputers, far from
the WinTel world.
> "This is done by keeping the data flow into the processor
at a low level > and keeping a steady flow of instructions
that use the same data over > and over again."
This might be a general way of saying "cache-friendly"
or that it doesn't stress the memory hierarchy (L1 cache, backside/L2,
etc) too much. Neither benchmark does that particularly well,
but SPEC isn't worse than Bytemark in this regard. For example,
part of SPEC is the gcc compiler, which can stress the memory
hierarchy more than most programs. "A steady flow of instructions
that use the same data over and over again" is true on
parts of both benchmarks, and false on other parts.
I agree with Mr. Johnson's comments about FPU performance,
and also about the memory buses. In general, though, sharp PC
manufacturers such as Dell tend to incorporate higher speed
buses to memory sooner than Apple. If a 66 MHz is all Apple
is selling, but Dell already has a 100 MHz Bus (such as the
Dimension XPS-R systems right now), isn't it fair to use currently
"Real world performance" has little in common with
a CPU benchmark; in the real world it is both possible and common
for Pentium systems running Windows to far outperform the G3
PowerMac running MacOS--Microsoft Excel, for instance. The point
I'm making isn't that one system is faster, it's that you don't
use CPU benchmarks to compare systems. If you are trying to
compare CPUs, use CPU benchmarks. If you are comparing complete
computer systems running certain applications, compare the speed
of those applications on those systems.
Similarly, "Wade" and "Kilian" use their
real-world systems comparisons to validate a CPU benchmark?
This is nonsensical. Using a CPU benchmark to compare Macs vs.
PCs is misusing the benchmark. Use it to compare Pentiums vs.
PPCs only, and don't generalize the results. It's great that
in the applications they use daily the Macs are faster--that
is all the benchmark you need. Such a result is far more telling
than a toy CPU benchmark.
But don't assume that such experience means that the PPC is
generally faster than the Pentium or Pentium II. Perhaps the
applications or file formats or graphics/movie playing software
were better written for one platform. Perhaps they stress something
other than the processor. Perhaps the observers were noticing
things that have nothing to do with CPU speed at all (user interface
response time, graphics card drawing speed, disk speed, virtual
memory performance, and so on). Perhaps they are more skilled
at optimizing one system over the other, or perhaps they are
willing to spend more on one system than the other. Perhaps
they are so emotionally invested in the advantages of one system
that they minimize its shortcomings and key on its superiorities--ignore
the waits, and harp on the quicknesses. None of this has anything
to do with the performance of the CPU. But it is more relevant
than the CPU speed in making a computer system purchase.
Lastly, the section below my quote and jason h both claim that
SPEC is more sensitive to compiler opts than Bytemark. This
is not true. The "20%" Intel jump on SPEC results
was an Intel mistake, which they admitted and settled claims
for. The Intel reference compiler is available, despite jason
h's claims. If he can pony up the serious cash, he can buy a
It is not easy to write a compiler just for SPEC; in fact,
there are several provisions of SPEC which make this difficult,
as I mentioned in my first post. (System specification, using
published commandline switches on a commercially available compiler,
using real application cores, using many different cores, etc.)
On the other hand, if a benchmark is often used to compare systems,
it behooves the compiler writer and hardware manufacturer to
do as good as possible on the benchmark.
A more apt criticism is that _I_ can verify Bytemarks by downloading
and running it. Whether my verification is worth anything _to
a CPU benchmark_ using a cheap compiler and probably cheap hardware
and not being skilled in setting up either system is another
question. It may well be worth more to me in making a computer
_system_ purchase, though, since I'll be using my cheap compiler
(hoping the programs I use don't though) and cheap hardware
and lack of skill. But that doesn't mean the CPU is slower because
I can't take advantage of it, right?
Or another apt criticism, which the person quoted below me
seems to make, is that SPEC is composed of Unix application
cores. These _are_ commonly used Unix apps, or demanding Unix
apps, but have little to do with what the average PC or Mac
users are running on their boxes. But then, Bytemark isn't too
much more applicable.
Question: SPEC maintains the SPEC benchmark, updating it when
it becomes too small to adequately stress CPUs, protecting it
from outrageous abuses, and so on. Who is maintaining Bytemark?
The only perfect benchmark is the program you plan to run,
because ultimately the only metric which really matters is how
much time you spend waiting. If you're one of those few people
whose principle waster of time is a program of your own devising,
then you can really test different hardware (and OSs and compilers)
head to head. Similarly if you spend all of your time in Photoshop
or Premiere, that should be your benchmark.
Failing such direct tests, the best you can do is figure out
which of the many synthetic benchmarks is the most like your
application. If you do scientific programming, then SPEC is
a decent benchmark, because it's composed of little scientific
problems. For other large computing taks (where you leave the
machine alone for hours), SPEC is probably also a decent benchmark.
So when I'm buying workstations I pay attention to SPEC or even
ByteMarks are designed more like the little computing tasks
which happen between mouse clicks, so this might be a better
benchmark for "conventional" applications. However
in this regard its UI independance is a liability, because for
typical users it's the time to pull down the menu and redraw
the screen that is the limit, not a flurry of LU decompostions
(solving matrix equations). However if you built a benchmark
with UI, you're no longer testing the processor, but the whole
package (OS, disk, processor, graphics, etc.), so such benchmarks,
MacBench for example, are really only good for comparing machines
with the same OS. There is talk of using the cross-platformness
of Linux as test, but even then you're talking about different
levels of patching and optimization accross hardware lines.
Furthermore, it's unclear how Linux performance matches to say
that of Word.
In short, it's easy to say which machine runs program X faster,
whether program X is a real world application or some benchmark
made up to "simulate" a class of real world applications.
But since the answers are different for each X, the notion of
a faster machine depends heavily on what you do with it. There
is no "one size fits all" way to rank machines.
Benchmarks are very difficult things to compare - it's like comparing
houses or cars. There are so many different things to measure,
and a great deal of honest difference of opinion on what's important.
There's also a long history of fakery, mismeasurement, and puffery
The SPEC suite of benchmarks is a well-tested, public (more
or less) set that's been run on many, many processors, going
back into the 1980s. It was one of the first benchmark sets
to demand that everyone must report all the sub-measurements,
and document exactly what was tested. It's a benchmark of processor
+ compiler mostly, not testing, for example, graphics performance
or disk performance. The best use of SPEC is probably when you're
intending to select a processor for a board you're going to
design, or to get a rough idea of where a processor compares
to many others. SPECmarks can improve by using a better compiler
or memory system. This means that there really isn't one SPECmark
number for a given processor - you really must include the particular
motherboard and compiler used as part of the system that's being
tested. A historical problem with SPEC on Intel has been that
Intel measures CPUs on specially built motherboards, with extremely
expensive memory systems, and a compiler that is used for nothing
but benchmarks. I don't know what they do now. Other vendors
tend to use something more like what a user might buy.
ByteMarks are a different set that also attempts to measure
CPU performance, typically run on real systems that you can
purchase. It hasn't been as widely tested. The lack of testing
means two things - it may have some flaws that SPEC doesn't,
and the results haven't been cooked by manufacturers as much.
As an analogy, imagine benchmarks on who builds the fastest
car - Fevy, Chord, or Volksiat. Fevy says they have the fastest,
because their engineers have shown on a dynamometer that their
engine is the most powerful. Chord disagrees, saying that their
cars all can go from 0-60 in 4 seconds. Volksiat says both are
wrong, that the better aerodynamics of their cars means that,
even with less horsepower, it can reach a top speed higher than
either Fevy or Chord. Who's right? It depends on what your particular
need is, just as it does with computers. You can tell that all
the cars are fast, probably much faster than every economy car.
On the particular question of whether Pentiums or PowerPCs
are faster, both sets of benchmarks have some value. Both seem
to show that PowerPCs are somewhat faster at the same clock
speed, and that both Pentiums and PowerPCs can run mighty fast.
That's probably all you can confidently say.
Macintosh Hardware - Click