PowerPC Twice As Fast AS Pentium? BYTEmark vs SPECReaders Bite Back!

| Macintosh vs Windows Index |

Our question of whether you trust BYTEmark benchmarks more than SPEC benchmarks generated a lot of mail and one thing is clear - benchmarks are part faith and alchemy, and not all science. There are apparently many variables that can effect how a benchmark program will perform and most programs can only give you a rough approximation of comparative performance. Application benchmarks are in general a better indicator than synthetic benchmarks such as BYTEmark and SPEC. But even application benchmarks can give a somewhat misleading picture of comparative performance depending on what hardware or OS is being stressed. Benchmarks are a crude roadmap But even a crude roadmap is better than none when you are trying to make a purchasing decision. Consult as many sources of performance information as possible when when considering a purchase (we have listed all the sites we know of with benchmark and performance information on our links page) and remember that the only benchmarks that really count are the ones you perform when you run your hardware through the tasks you need to get done. If you are not satisfied most products come with a 30 day money-back guarantee!

Below we post some of the responses we got to our BYTEmark vs SPEC question

Thanks for running some numbers. As you note at the beginning, SPEC was designed to keep the processor turning. In other words, these tests are designed not to cause a backlog in the register-starved and low cache Pentiums and Pentium IIs, so that the maximum amount of processor time is available. They go up to the point, but do not cross the line where the x86 architecture has trouble. This is done by keeping the data flow into the processor at a low level and keeping a steady flow of instructions that use the same data over and over again.

On the other hand, ByteMark tests are designed to cover a continuum of cases from the above situation to where even 32 register systems are stressed. These tests focus not on keeping the processor going, but on how much traffic can get through to it. Traffic jams in the feeding system, causes down time for the processor.

Real world tests show similarity to both types of tests, sometimes favoring the SPEC and sometimes favoring the ByteMark, but mostly in between.

Another thing to remember is that the weak point of the G3 is the FPU. The G3 is a major improvement over the 603 in the integer suite, but not in the FPU. It also has improved caching schemes to support the chip. It is integer tests, under Byte Mark conditions that G3s perform at about twice the level as the Pentium II. Floating point tests (like your Mathematica 3.0 tests), generally produce speeds of between 1 and 1.5 of the Pentium II at the same MHz. Also, loading and storing of registers is an "integer" function in Pentiums, but G3s have a separate execution unit for loading and storing. Loading and storing not only occurs more frequently for the Pentiums, it also takes a bigger hit for these functions than G3s do.

In answer to your question of which is a better test, it is better to do both and understand that these are the extremes. Each real world application will almost always fall in between, depending on how much data needs to go to the processor.

The fact remains that under ideal situations, where data flows easily to the processor, the G3 still gets about 50% better integer performance and about the same FP performance. When the data flows increase, the Pentium II's feeding system gets overwealmed well before the G3s do. For integer tests, the G3 is about 50% faster during ideal flow situations, and often more than 100% faster at higher data feeding rates.

A couple of more things about your tests you might think about.

Your 300 MHz systems both have 66 MHz front-end buses (ratio of about 4.5:1), thus memory to processor/cache speeds are the same.

I don't think this is true of your 400 MHz systems. The Pentium II/400s use a dual 50 MHz bus (similar to the 2x50 MHz bus on the MACh 5 604e and in the big picture have almost the same throughput as a 100 MHz bus). These have an effective 4:1 ratio. The 400 MHz G3 probably still has a 66 MHz for a 6:1 ratio, although they may have a 72 MHz for a 5:1 ratio. Either way flow from memory to the cache/processor is faster on the Pentium II than the G-3. Thus you can expect the G-3 to take a small hit for this mis-match.

I understand that the next generation systems (Katmai Pentium IIs and AltiVec G-4s) will have true 100 MHz buses early next year. But while the Pentium II adds Katmai's New Instructions (for 3-D FP calculations); the G-4, will also have a real FPU (ala the 604e), still better feeding mechanisms (up to 2 Mb L2 Cache), AltiVec (superior to KNI) and improved multiprocessor capability. If the G-3s do better than the Pentium IIs, now just wait until February and watch the gap widen, this time particularly on the floating point side.

Mike Johnson

Neither benchmark is particularly effective.

Especially for the last generation or two of processors, and even more so in the future, processor speed is very tightly linked to memory architecture. Memory bandwidth starvation easily dominates all other factors when it is present. So for real world performance results, a benchmark must move reasonably large amounts of data to and from memory, even when focusing on just the CPU, because memory bandwidth to CPU is the critical performance factor right now.

How big is the memory footprint of Bytemark? Is it greater than the 1 MB of many caches today? More than that, the Bytemark is extraordinarily sensitive to compiler fiddlings. It is much more a toy benchmark than SPEC, which is the result of many years of industry benchmarking experience. Certain SPEC results require standard compiler optimizations, and all SPEC results require published compiler settings. This avoids the abuse of compiler vendors recognizing the benchmark and generating special code just for it (which would give it a benefit you couldn't realize in real programs, unless your real programs exactly correspond to the benchmark). There is a whole benchmarking FAQ which you should read to get an idea of how people lie using benchmarks (including Apple).

But is SPEC any better? Yes, in some ways. It has a larger body of code, which are pieces of commonly used programs _under Unix_. The SPEC body has a lot of rules to safeguard the credibility of SPEC results.

No, in other ways. SPEC is expensive, so it isn't available to the general hobbyist. And it only runs under Unix. Whose SPEC results are you quoting? How many of the G3 SPECs were produced on PowerMacs? None were produced under MacOS; how do you know if the speed of the G3 under some flavor of Unix (AIX, LinuxPPC, MkLinux) corresponds in any way to its speed under MacOS? And if the SPEC was produced on an IBM workstation, how can you use results generated with a completely different memory hierarchy and hardware? If the SPEC is merely estimated, what the heck does that mean?

Now, since we persist in wanting to compare processors, and Macs against PCs, we need some method of comparison. Do what most reputable people do: compare benchmarks, both Bytemarks and SPEC and whatever else comes along. Note their deficiencies, and balance those results with the following application comparisons. Then compare performance in popular applications; try hard to find parts of the applications that are performance bound. For example, scrolling is often a poor test, because the scroll speed is often limited not by the processor but by the application code (to avoid scrolling too fast).

What are you asking for in a CPU benchmark? Perhaps this: given equivalent supporting hardware (memory, logic set, etc), which CPU performs best. This probably varies for different performance parameters of the supporting hardware. So, then, maybe this: which CPU has the most potential speed, given ideal supporting hardware? But what if a CPU requires much more complex support logic to provide the same speed? What if that level of support logic is not available, except on reference machines, and most machines use far inferior support logic?

What if comparing CPU benchmarks is like comparing NBA players based upon their free-throw ability, or comparing swimmers based upon their lung capacity, or comparing runners based on the power of their hamstring muscles? In other words, it produces results almost totally unrelated to the original question, and ignores many dominant factors.

Food for thought: no current benchmark measures interface response time, which is what we perceive as speed. We throw raw performance at the problem and assume that will solve our problem, while application and OS developers are making the problem worse with each update.

-- MattLangford

There are 2 ways to evaluate a system: Low level & High level.

Low level analysis looks at moving bytes, modifying memory, and very basic activities like sorting or complex math (integer, floating point, array...).

High level analysis looks at things like how fast the system boots, how fast a specific program like Mathematica or Word runs, or how fast the Browser moves.

I think that the Bytemark is the better low level benchmark. It seems to test more generic system activities like sorting & bit twidling.

The SPEC benchmark seems more complex than the Bytemark and I don't trust it as much because it seems so tied to the compliler and algorithms. If I were on the Wintel side I'd dedicate substatial resources to make sure that compilers would be optimised to deliver maximum performance for these obscure problems. I'd bet that the Weather Service doesn't use the same compiler & hardware that this benchmark is run on. How many home users would use this weather simulation? This specifice test, sexy as it sounds, is irrelevant. I bet that given enough money any of the tested platforms could be optimized to score highest. A bad compliler & algorithm could kill even the best hardware.

It's like comparing cars: tell me the horsepower, the torque, the type of suspention, not how fast driver X could drive it through his favorite country road. I think the Bytemark is the more accurate test of a systems raw potential.

Hi, I would just like you to know that I believe the BYTEmark tests are more accurate than the SPEC tests. I own a 333 G3 and compared it to a Hewlett Packard 333 Pentium MMX and there is simply no comparison. It is literally like comparing night and day. Everything from opening windows to performing complex tasks in Photoshop are faster. Anyone who thinks that Wintel boxes are faster, or that there is no noticable difference, should sit down behind a new G3 and do some work. Not just scroll Word documents or open windows, but acually do some "real" work. I think that they would find that the G3 processor is far better than anything Intel is producing, whether it's a PII or MMX.


.....from my real-life experience I can tell that the Pentium MMX 200 (64 MB RAM) I use everyday on Win98 is really a dead snail compared to the 604e 200 MHz (128 MB RAM) on OS 8.5 I use as well every day. In some cases even Virtual PC on this Mac is faster than the real Pentium on Windows (Director projectors play faster on VPC). And since this is a real life comparison between 604e and a Pentium MMX I guess that it won't be that different when comparing the next generation processors on both platforms: a 750 with a Pentium II. Actually, I do believe that the performance increase from 604e to 750 (maybe except for the FPU which seems to be performing almost equally on 604e and 750) is greater than from Pentium MMX to Pentium II.


I don't completely trust either test, but I am more skeptical about the SPECMark bench.

The reason for my skeptisizm of the SPECMark suite is the fact that it is easy for INTEL to write a compiler which is optimized for this test suite. Intel has done this.

It has been noted by several magazines how Intels SPECMark performance numbers suddenly jump over 20% !!! SPECMark numbers for the Pentium 60 and 66MHz processors went up over 20% despite the fact that they were no longer on the market. So how does a processor jump 20% in performance even though it is not in production? Simple, Intel started using an "internal" compiler which was optimized for the SPECMark suite. It is interesting to note that INTEL would not ANYONE to use the compiler or test it against the results of another compiler.

jason h

I wanted to reply to some of the other quotes you posted on your site, because I believe they were misinformed. I'll try this once, and then probably let it drop.

Mike Johnson writes, > "SPEC was designed to keep the processor turning. In other words, these > tests are designed not to cause a backlog in the register-starved and > low cache Pentiums and Pentium IIs, so that the maximum amount of > processor time is available."

SPEC has been around long before the Pentiums; it was NOT designed with the x86 family in mind, nor does it especially favor them. Its roots are from Unix workstations, VAX machines (and an older sorry benchmark, the MIP), and Cray supercomputers, far from the WinTel world.

> "This is done by keeping the data flow into the processor at a low level > and keeping a steady flow of instructions that use the same data over > and over again."

This might be a general way of saying "cache-friendly" or that it doesn't stress the memory hierarchy (L1 cache, backside/L2, etc) too much. Neither benchmark does that particularly well, but SPEC isn't worse than Bytemark in this regard. For example, part of SPEC is the gcc compiler, which can stress the memory hierarchy more than most programs. "A steady flow of instructions that use the same data over and over again" is true on parts of both benchmarks, and false on other parts.

I agree with Mr. Johnson's comments about FPU performance, and also about the memory buses. In general, though, sharp PC manufacturers such as Dell tend to incorporate higher speed buses to memory sooner than Apple. If a 66 MHz is all Apple is selling, but Dell already has a 100 MHz Bus (such as the Dimension XPS-R systems right now), isn't it fair to use currently shipping boxes?

"Real world performance" has little in common with a CPU benchmark; in the real world it is both possible and common for Pentium systems running Windows to far outperform the G3 PowerMac running MacOS--Microsoft Excel, for instance. The point I'm making isn't that one system is faster, it's that you don't use CPU benchmarks to compare systems. If you are trying to compare CPUs, use CPU benchmarks. If you are comparing complete computer systems running certain applications, compare the speed of those applications on those systems.

Similarly, "Wade" and "Kilian" use their real-world systems comparisons to validate a CPU benchmark? This is nonsensical. Using a CPU benchmark to compare Macs vs. PCs is misusing the benchmark. Use it to compare Pentiums vs. PPCs only, and don't generalize the results. It's great that in the applications they use daily the Macs are faster--that is all the benchmark you need. Such a result is far more telling than a toy CPU benchmark.

But don't assume that such experience means that the PPC is generally faster than the Pentium or Pentium II. Perhaps the applications or file formats or graphics/movie playing software were better written for one platform. Perhaps they stress something other than the processor. Perhaps the observers were noticing things that have nothing to do with CPU speed at all (user interface response time, graphics card drawing speed, disk speed, virtual memory performance, and so on). Perhaps they are more skilled at optimizing one system over the other, or perhaps they are willing to spend more on one system than the other. Perhaps they are so emotionally invested in the advantages of one system that they minimize its shortcomings and key on its superiorities--ignore the waits, and harp on the quicknesses. None of this has anything to do with the performance of the CPU. But it is more relevant than the CPU speed in making a computer system purchase.

Lastly, the section below my quote and jason h both claim that SPEC is more sensitive to compiler opts than Bytemark. This is not true. The "20%" Intel jump on SPEC results was an Intel mistake, which they admitted and settled claims for. The Intel reference compiler is available, despite jason h's claims. If he can pony up the serious cash, he can buy a copy.

It is not easy to write a compiler just for SPEC; in fact, there are several provisions of SPEC which make this difficult, as I mentioned in my first post. (System specification, using published commandline switches on a commercially available compiler, using real application cores, using many different cores, etc.) On the other hand, if a benchmark is often used to compare systems, it behooves the compiler writer and hardware manufacturer to do as good as possible on the benchmark.

A more apt criticism is that _I_ can verify Bytemarks by downloading and running it. Whether my verification is worth anything _to a CPU benchmark_ using a cheap compiler and probably cheap hardware and not being skilled in setting up either system is another question. It may well be worth more to me in making a computer _system_ purchase, though, since I'll be using my cheap compiler (hoping the programs I use don't though) and cheap hardware and lack of skill. But that doesn't mean the CPU is slower because I can't take advantage of it, right?

Or another apt criticism, which the person quoted below me seems to make, is that SPEC is composed of Unix application cores. These _are_ commonly used Unix apps, or demanding Unix apps, but have little to do with what the average PC or Mac users are running on their boxes. But then, Bytemark isn't too much more applicable.

Question: SPEC maintains the SPEC benchmark, updating it when it becomes too small to adequately stress CPUs, protecting it from outrageous abuses, and so on. Who is maintaining Bytemark?



The only perfect benchmark is the program you plan to run, because ultimately the only metric which really matters is how much time you spend waiting. If you're one of those few people whose principle waster of time is a program of your own devising, then you can really test different hardware (and OSs and compilers) head to head. Similarly if you spend all of your time in Photoshop or Premiere, that should be your benchmark.

Failing such direct tests, the best you can do is figure out which of the many synthetic benchmarks is the most like your application. If you do scientific programming, then SPEC is a decent benchmark, because it's composed of little scientific problems. For other large computing taks (where you leave the machine alone for hours), SPEC is probably also a decent benchmark. So when I'm buying workstations I pay attention to SPEC or even LINPACK.

ByteMarks are designed more like the little computing tasks which happen between mouse clicks, so this might be a better benchmark for "conventional" applications. However in this regard its UI independance is a liability, because for typical users it's the time to pull down the menu and redraw the screen that is the limit, not a flurry of LU decompostions (solving matrix equations). However if you built a benchmark with UI, you're no longer testing the processor, but the whole package (OS, disk, processor, graphics, etc.), so such benchmarks, MacBench for example, are really only good for comparing machines with the same OS. There is talk of using the cross-platformness of Linux as test, but even then you're talking about different levels of patching and optimization accross hardware lines. Furthermore, it's unclear how Linux performance matches to say that of Word.

In short, it's easy to say which machine runs program X faster, whether program X is a real world application or some benchmark made up to "simulate" a class of real world applications. But since the answers are different for each X, the notion of a faster machine depends heavily on what you do with it. There is no "one size fits all" way to rank machines.


Benchmarks are very difficult things to compare - it's like comparing houses or cars. There are so many different things to measure, and a great deal of honest difference of opinion on what's important. There's also a long history of fakery, mismeasurement, and puffery as well.

The SPEC suite of benchmarks is a well-tested, public (more or less) set that's been run on many, many processors, going back into the 1980s. It was one of the first benchmark sets to demand that everyone must report all the sub-measurements, and document exactly what was tested. It's a benchmark of processor + compiler mostly, not testing, for example, graphics performance or disk performance. The best use of SPEC is probably when you're intending to select a processor for a board you're going to design, or to get a rough idea of where a processor compares to many others. SPECmarks can improve by using a better compiler or memory system. This means that there really isn't one SPECmark number for a given processor - you really must include the particular motherboard and compiler used as part of the system that's being tested. A historical problem with SPEC on Intel has been that Intel measures CPUs on specially built motherboards, with extremely expensive memory systems, and a compiler that is used for nothing but benchmarks. I don't know what they do now. Other vendors tend to use something more like what a user might buy.

ByteMarks are a different set that also attempts to measure CPU performance, typically run on real systems that you can purchase. It hasn't been as widely tested. The lack of testing means two things - it may have some flaws that SPEC doesn't, and the results haven't been cooked by manufacturers as much.

As an analogy, imagine benchmarks on who builds the fastest car - Fevy, Chord, or Volksiat. Fevy says they have the fastest, because their engineers have shown on a dynamometer that their engine is the most powerful. Chord disagrees, saying that their cars all can go from 0-60 in 4 seconds. Volksiat says both are wrong, that the better aerodynamics of their cars means that, even with less horsepower, it can reach a top speed higher than either Fevy or Chord. Who's right? It depends on what your particular need is, just as it does with computers. You can tell that all the cars are fast, probably much faster than every economy car.

On the particular question of whether Pentiums or PowerPCs are faster, both sets of benchmarks have some value. Both seem to show that PowerPCs are somewhat faster at the same clock speed, and that both Pentiums and PowerPCs can run mighty fast. That's probably all you can confidently say.

Bill Paulson

For Free Macintosh Hardware - Click Here

Internal Links

  • PowerPC Chip performance compared
  • Check out the Links Page for links to sites with more Processor information
  • More Windows vs Mac Comparisons
  • External Links

  • PC Magazine takes on the iMac's performance and BYTEmark claims
  • SPEC vs BYTEmark - which is the better, more accurate benchmark program
  • PentiumII vs the G3 / G4 - MaKiDo compares these two chips
  • Motorola - detailed information about the various PowerPC chips
  • BYTE Magazine review of G3 and Pentium II
  • BYTE Magazine review of the G3/266
  • BYTEmark faq
  • PowerPC - source data for the benchmarks above
  • Intel - source data for the benchmarks above
  • SPEC 95 FAQ - all your questions about SPEC 95 answered
  • BYTEmark documentation