John Tsiombikas firstname.lastname@example.org
20 May 2021
Last update: 22 May 2021
It all started on the 486 retro-pc, where one day I decided it would be fun to install Debian GNU/Linux 3.0 (woody) from 20something floppies. Well in truth, it was mainly to see if the ISA NIC I bought off ebay actually works, because the DOS packet drivers I found online have failed me, but it was also a bit of fun, so why not?
So I did. I installed debian, played around a bit, installed git from source (git didn't exist in 2002 and therefore is not in the woody repos), pulled and compiled termtris with gcc 2.95 and played some tetris on the console. The NIC worked fine. Then I decided to see if 16 MB of RAM is enough to run X, which it is... barely. The X server itself took up about 5 MB, the window manager (fvwm) just 500kb, it swapped a bit when running anything else, but all in all not bad. I even ran a modern web browser remotely on my main PC and had it appear on the 486 X server (X11 is awesome, those who would prefer wayland to finally become usable after a decade of sucking, so they can switch to it are woefully ignorant), in essence turning the 486 into a fine X terminal.
Drawing speed was not impressive though. Moving windows even with rubber banding was visibly redrawing very slowly. The main problem of course was that I was running X with the generic "vesa" driver, because as luck would have it the XFree86 "cirrus" driver supports Cirrus Logic graphics chips starting from GD-5430, while my Cirrus Logic VLB card has a GD-5429 chip.
This left me wondering, what's the upper bound on drawing on the framebuffer of this computer, and how much does all the device-independent graphics niceties of X11 impact drawing performance. Since the same machine also has DOS installed, I thought it would be fun to write a cross-platform benchmark, which simply throws pixels on screen as fast as possible under X11 and under DOS (so bare metal, DOS knows nothing nor cares about graphics hardware), and compare the performance.
First I wrote the X11 version. I wanted the drawing to be the bottleneck, and
not any graphics processing, so I went for something extremely simple: write a
XOR pattern (with varying offsets to make it more interesting) continuously onto
the screen and measure framerate. The fastest way I know of for drawing pixels
under X11 is using the X shared memory extension or XSHM. It works similarly
to drawing with
XPutImage, but instead of serializing all the pixels and
sending them through the X socket, you create and map a shared memory buffer to
draw pixels into, you instruct the X server to also map the same shared memory
buffer, and tell it to use the pixels in that buffer for drawing by issuing an
As expected, the performance was abysmal. Drawing 640x480 pixels in 16 bits-per-pixel (bpp) mode resulted in a pathetic 1.1 frames per second. I'm sure I could do better in DOS.
MS-DOS has no notion of graphics. Programs wanting to draw pretty pictures under DOS had to manipulate the video hardware directly. This makes it quite fun to code, but also a good test of the performance ceiling when it comes to drawing pixels on screen, since there's absolutely nothing between us and the video memory to add performance overhead.
All SVGA graphics cards on PC-compatibles support a common standard for using high-resolution and high color depth video modes (beyond the standard VGA modes), which is called VESA BIOS Extensions or VBE. This is exactly what the generic "vesa" XFree86 driver used for drawing on GNU/Linux earlier, and that's what we'll have to use under DOS.
Using VBE is based around issuing real-mode (16bit) software interrupts, which
are handled by the video BIOS to perform actions or return information about
available video modes. Calling a 16bit interrupt handler from 32bit protected
mode is not as simple as emitting an
int opcode. If we were really running on
bare metal we'd need to set aside a buffer for the register state in and out of
the interrupt, switch to real mode, set all the registers to whatever is
contained in the register state buffer, trigger the interrupt, then save the
registers back to the buffer, switch back to protected mode, and return to our
regular code. For an implementation of all this take a look at my
pcboot project, and the accompanying
bare metal programming article.
Under DOS we could handle all the protected mode setup ourselves, but the easier
way is to rely on a 3rd party "dos extender" like
DOS4/GW which comes with
watcom compilers, or
CWSDPMI which is used by DJGPP (GCC port for DOS). These
extenders implement an API called DPMI (DOS Protected Mode Interface), which
provide a number of protected mode interrupts for various useful functions,
including calling real mode interrupts, and mapping physical memory (which will
come in handy shortly).
Back to VBE: to query video hardware information, we have to call the "Get SVGA
Information" function by setting
ax to 4f00h, pointing
es:di to a low memory
buffer (allocated through the DPMI call 0100h "Allocate DOS Memory Block") big
enough to hold the information, and raising interrupt 10h (the video BIOS
interrupt). The call returns with our buffer full of information, including a
list of all the available video mode numbers. Then for each one of the available
video modes we need to call VBE function 4f01h (Get SVGA Mode Information) which
returns in a similar manner, all the necessary details for the video mode like
resolution, color depth, bytes per scanline, pixel packing masks, whether it
supports a linear framebuffer, and so on. If it does support a linear
framebuffer, it also specifies the physical address of that framebuffer, which
we can then map into virtual memory with DPMI call 0800h "Physical Address
Mapping", to be able to write pixels which will immediately appear on screen.
At this point I've hit a snag however. The Cirrus Logic graphics card on the 486 does not support VBE 2.0, which is a pre-requisite for using a linear framebuffer, but rather VBE 1.2, which allows access to the framebuffer through a movable 64kb window at physical address a0000h. My VBE display code does have a fallback for this eventuality, but being as it was relatively untested, it failed to work correctly.
Update: I fixed the VBE 1.2 fallback code, and added the 486 results. See the results table at the end.
Resolving to come back to fix that bug later, I decided to try the relative comparison first on a different computer which also happens to have both DOS and GNU/Linux installed: a Pentium 3 700MHz which I'm using as a late 90s win98 retro-pc, equipped in an astounding overkill with both an nvidia Geforce2 MX and a 3dfx Voodoo2, to cover a wide spectrum of late 90s/early 2000s usage (support for OpenGL, Direct3D and Glide). On a windows 98 system there are three ways to run DOS programs: from within windows itself as a vm86 task, by exiting windows selecting "Restart in MS-DOS mode" from the start menu, and by booting directly to DOS by pressing F8 during boot to pop up a boot menu, and selecting "Command prompt only".
So I compiled my benchmark with DJGPP under windows for convenience, then quit windows to run in pure DOS, and avoid any vm86 I/O overhead. At 640x480 16bpp on the pentium3 the benchmark ran at 79.2 fps. Not bad at all.
Then I rebooted to see what performance we can get on GNU/Linux under X with Xshm. It ran at 23 fps... I expected some performance hit when drawing under X, but 23 fps? That's horrible bordering on tragic. It's worth noting that the X server on that machine is running with the nouveau driver, because the proprietary nvidia driver no longer supports the geforce2 mx. The nouveau project is a worthy effort, providing a free software alternative to the proprietary nvidia drivers, and supporting really old cards abandoned by nvidia, but it's not famed for its performance unfortunately.
So at this point I decided to see how much of this horrendous performance is due to the X abstractions and the nouveau inefficiencies in handling the hardware, by porting the benchmark to fbdev.
Aside from running an X server with all its hardware-specific drivers, another
way to put graphics on screen with Linux is by using the framebuffer device
or fbdev. Fbdev is a brilliantly elegant interface: you open a device file
/dev/fb0), use a number of
ioctls for getting video mode information and
changing video modes (where possible), and
mmap it to gain access to the
framebuffer and write pixels directly to it. Handling keyboard input this time
is as simple as putting the terminal in raw mode and reading from stdin.
After porting the benchmark to fbdev (now building on GNU/Linux builds both an x11 and an fbdev binary), I run it again at 640x480 16bpp, and it ran at a much more respectable 41.3 fps. That's quite a difference from the measly 23 fps under X, which leads me to believe that nouveau might be partially at fault here, but also quite a difference from the manly 79.2 fps under DOS!
Attempting to figure out why there is such a huge discrepancy, I went back to DOS to try different resolutions and color depths to see if the performance is consistent, but I ran the baseline 640x480 16bpp test again to make sure. And lucky I did, because to my amazement, it ran at 43.9 fps instead of the previously recorded 79.2! What? Well for one, that's much more in line to what I would expect as a difference between DOS and Linux, very close with a slight edge to DOS due to the bare metal nature of the operation. 43.9 and 41.3 are perfectly reasonable results. But how come I got that astronomically larger framerate earlier? What changed?
Well, it turns out the first measurement was taken after I exited windows 98 by selecting "Restart to MS-DOS mode" from the start menu, while this time I booted directly to DOS from the F8 boot menu. How can that make a difference? I rebooted, started win98 normally again, and quit back to DOS. Ran the benchmark... and 79.2 fps!!!
A light bulb went off; vague memories of cache strategies and MTRRs popped to mind. The windows nvidia driver must be changing some caching property which results in much faster access to the framebuffer, and that setting remains when exiting back to DOS. I must try to implement that myself, and see if I can duplicate the higher performance without starting windows first.
Cache behavior on pentium pro and later processors are controlled by a set of Model-Specific Registers (MSR) in the CPU called Memory-Type Range Registers or MTRRs. After a bit of research on the web, it turns out the optimum strategy for writing to framebuffer memory is to enable "write-combining" where multiple writes are accumulated in a combining buffer, and written out all at once in a single high-speed burst to the graphics card.
Adding the code to set the framebuffer address range memory type to "write
combining" is as simple as grabbing the physical framebuffer address (which we have
already since we had to map it into virtual memory), calculate a "mask" which defines
the range of 4k pages of memory where the write combining type should apply, and
setting them both through a couple of
wrmsr instructions. It's slightly more
complicated than that, because we need to find an unused MTRR, and also ideally
query whether the current processor supports MTRRs at all, but close enough.
A slight complication is that
wrmsr are priviledged instructions,
which can only be executed from ring 0 (supervisor mode). By default DOS
extenders run our code in ring 3 (user mode), and if we attempt to execute
either instruction it would lead to a general protection exception. This is
obviously not a factor when running truly on bare metal, but it's also easy to
solve with djgpp and cwsdpmi, by swapping out the default cwsdpmi executable
with the alternative
cwsdpr0.exe which runs our code in ring 0 instead.
Having written the necessary code to change the memory type for the framebuffer range to write-combining, I rebooted to pure DOS to see the results of my efforts. And predictably the framerate shot up to exactly the same levels as it was after exiting windows 98, and after the nvidia driver had done the same thing to accelerate framebuffer writes for us: 79.2 fps.
Now if you remember, we left fbdev write performance at about the same level as
it was under DOS before we set up write-combining (about 43 fps). Therefore it
stands to reason that the particular framebuffer device driver (nouveaufb again)
fails to do so and we'll have to force its hand. Linux provides a convenient
facility for managing MTRRs through the proc filesystem, by writing to
/proc/mtrr. The framebuffer physical address and size are available through
fbdev ioctls, so the simplest experiment is to print those values from our
program, then set the MTRRs manually with a command of the form:
"base=0xd000000 size=0x8000000 type=write-combining" >/proc/mtrr, and
re-running our test to see the vast improvement.
Unfortunately it made no difference whatsoever, and after further investigation
(actually printing the
/proc/mtrr file first) it looks like the appropriate
range is already marked as write-combining without our intervention. I can't
believe that the similar framerates are a coincidence, but I have not yet
figured out if there's something else which overrides the MTRR setting. I'll
have to investigate further.
First of all we dispelled all the magic and attained the best possible performance for writing to the framebuffer when running on DOS or bare metal. Manipulating the MTRRs to set the framebuffer range as write-combining is well worth the trouble, as it nearly doubles performance in this simple test!
The X drawing result is troubling. I can't belive there so much inherent overhead in XShm drawing that deteriorates the framerate as much as it seems to do between that method and fbdev access on the same system. I think it's quite possible that with a faster X driver, like the proprietary nvidia driver, the gap should be much narrower. Unfortunately I can't test that hypothesis on that machine, but it's the next thing I want to try for sure.
Update: Out of curiosity I tried running the benchmark on a modern computer: my 2013 macbook pro retina, with an i5 3230M and an integrated i915 GPU. The results were much closer to what I would expect from a well-optimized current X driver. Of course on such a modern machine, any overheads incurred by the X server would be inconsequential, but in fact the performance is slightly better under X11 than under fbdev as can be seen in the results table below.
Also as I mentioned earlier I want to try and see if on other machines, with a different fbdev driver, Linux fbdev performance is closer to the bare metal upper limit with write-combining. I don't see any reason why the difference should be that large.
Finally I intend to expand the benchmark, both in scope (different graphics tests with different bottlenecks), but also in platform support. The next targets which would be fun to test I think, would be performance under windows with DirectDraw and GDI drawing.
Results (framerate, higher is better):
|Pentium3 / Geforce2 mx||23.1||41.3||79.2||tested at 640x480 16bpp|
|486 / GD-5429||1.9||?||2.9||tested at 640x480 16bpp|
|i5 3230M / i915||97||92||?||tested at 1920x1200 32bpp|
Discuss this post
Back to my blog