An MMX coding example - by Ryan Geiss
---------------------
[COMMENT - 5/26/2001]
Please note that this article is a bit outdated by now, but was kept
because it still serves as a decent primer for using MMX in graphics
programming. It could be said that this code is somewhat unnecessary
now, given DirectDraw's Blt and BltFast functions, or GDI's BitBlt, which
automatically convert between pixel formats for you as optimally as
possible. *However*, some (actually, many) drivers don't properly
implement these color conversions, so if you want a 100% guarantee
it will work, you have to do it manually (Geiss and Drempels do).
Another advantage of doing it yourself is that you can sneak in
some cheap per-pixel post-processing effects!
I also have to say - I renounce my renunciation of the usefulness of MMX
(which appears at the end of this article) - I love MMX to pieces. Without
it, Geiss would be 40% slower and Drempels would be 50% slower! It also
works extremely well as a fast memcpy(), in audio processing, and in just
about any kind of image processing.
[ORIGINAL ARTICLE]
This little intro to MMX is explained via a program called Geiss. You
can get it on the main page.
In Geiss, three separate greyscale "screens" are calculated. Each one
represents a color channel: red, green, and blue. The problem is
displaying this on the screen. This little FAQ goes through the non-MMX
and the MMX ways of doing so. Please remember that the non-MMX discussion
revolves around the code involved to plot ONE pixel; the MMX discussion is
very similar, only it does 4 pixels simultaneously.
The idea is this:
B B B B B B B B R R R R R R R R G G G G G G G G
You have 3 buffers of red, green, and blue values, each value being 8 bits
(1 byte). You want to merge these to put them on the screen. Well,
the 320x240 video mode rarely supports 32-bit color, but rather, only
16-bit. The format for a 16-bit color value (in the hardware) is:
B B B B B G G G G G G R R R R R
It's 5, 6, and 5 bits of each color component. As you can imagine, getting
the 3 bytes into this 2-byte-chawed quantity is a real bitch. Here's the
code:
WORDVALUE = ((blue >> 3) << 11) | ((green >> 2) << 5) | (red >> 3);
This is like dicing each of the three into the following quantities:
B B B B B 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 G G G G G G 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 R R R R R
Then forming the final quantity by OR'ing all three, yielding
B B B B B G G G G G G R R R R R
just as desired. Notice that we've taken the MOST significant
bits of the 8 for each component to use.
Well, that's 5 bitshifts and 3 or's per pixel, plus a lot of crappy
other stuff like loop counters and gamma correction (it makes Geiss look
ten time better... gives you a "bright white" color zone). Basically,
that's SLOW.
With MMX, you can make it slightly better. You can do 4 pixels
at a time, with LESS overhead (due to a convenient way MMX gives
you to do the gamma-correction), and you end up with huge speed
gains.
Here's how it works. First, load four blue pixels at once, four
green, and four red. An MMX register is 64 bits, or 8 bytes. Our
goal is turn one MMX register into 4 16-bit color values that can
all be plotted together. We first have to use the Padded Unpack
Word instruction, which will load in 4 bytes in every other byte
of the MMX register:
PUNPCKHBW mm0, [eax] (Unpack High-Packed Data/Words)
PUNPCKHBW mm1, [ebx]
PUNPCKHBW mm2, [edx]
MM0: blueblue !$%^@#$! blueblue !$%^@#$! blueblue !$%^@#$! blueblue !$%^@#$!
MM1: greengre !$%^@#$! greengre !$%^@#$! greengre !$%^@#$! greengre !$%^@#$!
MM2: redredre !$%^@#$! redredre !$%^@#$! redredre !$%^@#$! redredre !$%^@#$!
You can probably see what is coming... the same thing as before...
shift all these registers around to align them, OR them, and you
have it. The !$%^@#$! is garbage that was in the register beforehand;
when we shift right to lop off the least significant 2/3 bits, we'll
be killing these too and adding nice zeros to the left side.
We're only using 3 of the 8 MMX registers here. Each one holds four 2-byte
quantities; the left byte is the high byte, the right byte is the low one.
Now we do some shifts. MMX allows us to bitshift an entire MMX register
as if it contained 8 bytes, 4 words, 2 dwords, or 1 quadword. We choose
4 words, obviously. Here's the code:
PSRLW mm0, 11 (Packed Shift Right/Logical/Words)
PSRLW mm1, 10
PSRLW mm2, 11
MM0: 00000000 000blueb 00000000 000blueb 00000000 000blueb 00000000 000blueb
MM1: 00000000 00greeng 00000000 00greeng 00000000 00greeng 00000000 00greeng
MM2: 00000000 000redre 00000000 000redre 00000000 000redre 00000000 000redre
Now we've lopped off the least significant bits of the values (this
is equivalent to the initial >> operations in the non-MMX equation) in
just three clock cycles. We now do the leftshift (<<) operations to
move the bits into place for OR'ing:
PSLLW mm0, 11 (Packed Shift Left/Logical/Words)
PSLLW mm1, 5
MM0: blueb000 00000000 blueb000 00000000 blueb000 00000000 blueb000 00000000
MM1: 00000gre eng00000 00000gre eng00000 00000gre eng00000 00000gre eng00000
MM2: 00000000 000redre 00000000 000redre 00000000 000redre 00000000 000redre
Now, just like before, you OR the three registers together.
POR mm0, mm1 (Packed Logical OR)
POR mm0, mm2
MM0: bluebgre engredre bluebgre engredre bluebgre engredre bluebgre engredre
See what we have? Four 16-bit color values. Re-written, they look like
MM0: BBBBBGGG GGGRRRRR BBBBBGGG GGGRRRRR BBBBBGGG GGGRRRRR BBBBBGGG GGGRRRRR
Now we store all 4 pixels to "video memory" (really a buffer) with one
command:
MOVQ [edi], mm0 (64-bit MMX <--> memory transfer)
Then increment our 4 pointers and loop to the top. The whole main loop,
unoptimized, looks like this:
-------------------------------------------------------------------
FourLoop:
PUNPCKHBW mm0, [eax] // load & expand the red, green, & blue values
PUNPCKHBW mm1, [ebx]
PUNPCKHBW mm2, [edx]
PADDUSB mm0, mm0 // doubles the brightness, capping at 255, for every value.
PADDUSB mm1, mm1 // this is the equiv. of the REMAP[] array's effect.
PADDUSB mm2, mm2
PSRLW mm0, 8+3 // move each byte into the -lower- part of the word
PSRLW mm1, 8+2 // also chop off some # of least significant bits
PSRLW mm2, 8+3
PSLLW mm0, 11 // shift back into position for combination process
PSLLW mm1, 5
POR mm0, mm1
POR mm0, mm2
// store result + increment pointers
MOVQ [edi], mm0
ADD eax, 4
ADD ebx, 4
ADD edx, 4
ADD edi, 8
LOOP FourLoop
-------------------------------------------------------------------
The great thing is that there are *plenty* of optimizations to the
above code... most involve interleaving instructions so they can
execute simultaneously (in the two pipes) and also by rearranging
the code so that there are no stalls.
... so that's about it. The key points that save time are in the mass
data transfers (stores & loads to/from memory) and in the reduced
number of instructions (1/4 as many). Also, before, the "gamma
correction" was done by an array of 256 bytes that remapped the byte.
The array made values 0..127 appear twice as bright, and 128..255 appear
"max'ed out" at 255. Well, with MMX, you can just add the register to
itself in "saturation mode," where it will cap at 255 if it overflows.
I never thought I'd find a use for MMX, but here it is. I'm truly
amazed. I get about 36-37 fps without the MMX, and about 49 with it!
(And this is not the only process going on... the three buffers are also
being crunched each frame, and that's expensive alone.) Go MMX!
-Ryan M. Geiss
BONUS
---------------------------------------------------------------------------------
This code will check to see if MMX is supported by the CPU. You should run this
code once at startup, then save the resulting bool. And remember: it's always a
good idea to keep your pre-MMX loops around so that you still support non-MMX
CPU's, or in case you need to port your code to a new architecture.
bool CheckMMXTechnology()
{
bool retval = true;
DWORD RegEDX;
__try {
__asm {
mov eax, 1
cpuid
mov RegEDX, edx
}
}
__except(EXCEPTION_EXECUTE_HANDLER)
{
retval = FALSE;
}
if (retval == FALSE) return FALSE; // processor does not support CPUID
if (RegEDX & 0x800000) // bit 23 is set for MMX technology
{
__try { __asm emms } // try executing the MMX instruction "emms"
__except(EXCEPTION_EXECUTE_HANDLER) { retval = FALSE; }
}
else
return FALSE; // processor supports CPUID but does not support MMX technology
// if retval == 0 here, it means the processor has MMX technology but
// floating-point emulation is on; so MMX technology is unavailable
return retval;
}
This document copyright (c)1998+ Ryan M. Geiss.
Return to FAQ page