Multi-platform experiments in framebuffer drawing performance

John Tsiombikas nuclear@mutantstargoat.com

20 May 2021
Last update: 22 May 2021

Prelude

It all started on the 486 retro-pc, where one day I decided it would be fun to install Debian GNU/Linux 3.0 (woody) from 20something floppies. Well in truth, it was mainly to see if the ISA NIC I bought off ebay actually works, because the DOS packet drivers I found online have failed me, but it was also a bit of fun, so why not?

So I did. I installed debian, played around a bit, installed git from source (git didn't exist in 2002 and therefore is not in the woody repos), pulled and compiled termtris with gcc 2.95 and played some tetris on the console. The NIC worked fine. Then I decided to see if 16 MB of RAM is enough to run X, which it is... barely. The X server itself took up about 5 MB, the window manager (fvwm) just 500kb, it swapped a bit when running anything else, but all in all not bad. I even ran a modern web browser remotely on my main PC and had it appear on the 486 X server (X11 is awesome, those who would prefer wayland to finally become usable after a decade of sucking, so they can switch to it are woefully ignorant), in essence turning the 486 into a fine X terminal.

Drawing speed was not impressive though. Moving windows even with rubber banding was visibly redrawing very slowly. The main problem of course was that I was running X with the generic "vesa" driver, because as luck would have it the XFree86 "cirrus" driver supports Cirrus Logic graphics chips starting from GD-5430, while my Cirrus Logic VLB card has a GD-5429 chip.

This left me wondering, what's the upper bound on drawing on the framebuffer of this computer, and how much does all the device-independent graphics niceties of X11 impact drawing performance. Since the same machine also has DOS installed, I thought it would be fun to write a cross-platform benchmark, which simply throws pixels on screen as fast as possible under X11 and under DOS (so bare metal, DOS knows nothing nor cares about graphics hardware), and compare the performance.

X11 benchmark

First I wrote the X11 version. I wanted the drawing to be the bottleneck, and not any graphics processing, so I went for something extremely simple: write a XOR pattern (with varying offsets to make it more interesting) continuously onto the screen and measure framerate. The fastest way I know of for drawing pixels under X11 is using the X shared memory extension or XSHM. It works similarly to drawing with XPutImage, but instead of serializing all the pixels and sending them through the X socket, you create and map a shared memory buffer to draw pixels into, you instruct the X server to also map the same shared memory buffer, and tell it to use the pixels in that buffer for drawing by issuing an XShmPutImage call.

As expected, the performance was abysmal. Drawing 640x480 pixels in 16 bits-per-pixel (bpp) mode resulted in a pathetic 1.1 frames per second. I'm sure I could do better in DOS.

DOS (bare metal) benchmark

MS-DOS has no notion of graphics. Programs wanting to draw pretty pictures under DOS had to manipulate the video hardware directly. This makes it quite fun to code, but also a good test of the performance ceiling when it comes to drawing pixels on screen, since there's absolutely nothing between us and the video memory to add performance overhead.

All SVGA graphics cards on PC-compatibles support a common standard for using high-resolution and high color depth video modes (beyond the standard VGA modes), which is called VESA BIOS Extensions or VBE. This is exactly what the generic "vesa" XFree86 driver used for drawing on GNU/Linux earlier, and that's what we'll have to use under DOS.

Using VBE is based around issuing real-mode (16bit) software interrupts, which are handled by the video BIOS to perform actions or return information about available video modes. Calling a 16bit interrupt handler from 32bit protected mode is not as simple as emitting an int opcode. If we were really running on bare metal we'd need to set aside a buffer for the register state in and out of the interrupt, switch to real mode, set all the registers to whatever is contained in the register state buffer, trigger the interrupt, then save the registers back to the buffer, switch back to protected mode, and return to our regular code. For an implementation of all this take a look at my pcboot project, and the accompanying bare metal programming article. Under DOS we could handle all the protected mode setup ourselves, but the easier way is to rely on a 3rd party "dos extender" like DOS4/GW which comes with watcom compilers, or CWSDPMI which is used by DJGPP (GCC port for DOS). These extenders implement an API called DPMI (DOS Protected Mode Interface), which provide a number of protected mode interrupts for various useful functions, including calling real mode interrupts, and mapping physical memory (which will come in handy shortly).

Back to VBE: to query video hardware information, we have to call the "Get SVGA Information" function by setting ax to 4f00h, pointing es:di to a low memory buffer (allocated through the DPMI call 0100h "Allocate DOS Memory Block") big enough to hold the information, and raising interrupt 10h (the video BIOS interrupt). The call returns with our buffer full of information, including a list of all the available video mode numbers. Then for each one of the available video modes we need to call VBE function 4f01h (Get SVGA Mode Information) which returns in a similar manner, all the necessary details for the video mode like resolution, color depth, bytes per scanline, pixel packing masks, whether it supports a linear framebuffer, and so on. If it does support a linear framebuffer, it also specifies the physical address of that framebuffer, which we can then map into virtual memory with DPMI call 0800h "Physical Address Mapping", to be able to write pixels which will immediately appear on screen.

Comparing performance

At this point I've hit a snag however. The Cirrus Logic graphics card on the 486 does not support VBE 2.0, which is a pre-requisite for using a linear framebuffer, but rather VBE 1.2, which allows access to the framebuffer through a movable 64kb window at physical address a0000h. My VBE display code does have a fallback for this eventuality, but being as it was relatively untested, it failed to work correctly.

Update: I fixed the VBE 1.2 fallback code, and added the 486 results. See the results table at the end.

Resolving to come back to fix that bug later, I decided to try the relative comparison first on a different computer which also happens to have both DOS and GNU/Linux installed: a Pentium 3 700MHz which I'm using as a late 90s win98 retro-pc, equipped in an astounding overkill with both an nvidia Geforce2 MX and a 3dfx Voodoo2, to cover a wide spectrum of late 90s/early 2000s usage (support for OpenGL, Direct3D and Glide). On a windows 98 system there are three ways to run DOS programs: from within windows itself as a vm86 task, by exiting windows selecting "Restart in MS-DOS mode" from the start menu, and by booting directly to DOS by pressing F8 during boot to pop up a boot menu, and selecting "Command prompt only".

So I compiled my benchmark with DJGPP under windows for convenience, then quit windows to run in pure DOS, and avoid any vm86 I/O overhead. At 640x480 16bpp on the pentium3 the benchmark ran at 79.2 fps. Not bad at all.

Then I rebooted to see what performance we can get on GNU/Linux under X with Xshm. It ran at 23 fps... I expected some performance hit when drawing under X, but 23 fps? That's horrible bordering on tragic. It's worth noting that the X server on that machine is running with the nouveau driver, because the proprietary nvidia driver no longer supports the geforce2 mx. The nouveau project is a worthy effort, providing a free software alternative to the proprietary nvidia drivers, and supporting really old cards abandoned by nvidia, but it's not famed for its performance unfortunately.

So at this point I decided to see how much of this horrendous performance is due to the X abstractions and the nouveau inefficiencies in handling the hardware, by porting the benchmark to fbdev.

Linux framebuffer device

Aside from running an X server with all its hardware-specific drivers, another way to put graphics on screen with Linux is by using the framebuffer device or fbdev. Fbdev is a brilliantly elegant interface: you open a device file (/dev/fb0), use a number of ioctls for getting video mode information and changing video modes (where possible), and mmap it to gain access to the framebuffer and write pixels directly to it. Handling keyboard input this time is as simple as putting the terminal in raw mode and reading from stdin.

After porting the benchmark to fbdev (now building on GNU/Linux builds both an x11 and an fbdev binary), I run it again at 640x480 16bpp, and it ran at a much more respectable 41.3 fps. That's quite a difference from the measly 23 fps under X, which leads me to believe that nouveau might be partially at fault here, but also quite a difference from the manly 79.2 fps under DOS!

DOS round 2

Attempting to figure out why there is such a huge discrepancy, I went back to DOS to try different resolutions and color depths to see if the performance is consistent, but I ran the baseline 640x480 16bpp test again to make sure. And lucky I did, because to my amazement, it ran at 43.9 fps instead of the previously recorded 79.2! What? Well for one, that's much more in line to what I would expect as a difference between DOS and Linux, very close with a slight edge to DOS due to the bare metal nature of the operation. 43.9 and 41.3 are perfectly reasonable results. But how come I got that astronomically larger framerate earlier? What changed?

Well, it turns out the first measurement was taken after I exited windows 98 by selecting "Restart to MS-DOS mode" from the start menu, while this time I booted directly to DOS from the F8 boot menu. How can that make a difference? I rebooted, started win98 normally again, and quit back to DOS. Ran the benchmark... and 79.2 fps!!!

A light bulb went off; vague memories of cache strategies and MTRRs popped to mind. The windows nvidia driver must be changing some caching property which results in much faster access to the framebuffer, and that setting remains when exiting back to DOS. I must try to implement that myself, and see if I can duplicate the higher performance without starting windows first.

Caching and Memory-Type Range Registers

Cache behavior on pentium pro and later processors are controlled by a set of Model-Specific Registers (MSR) in the CPU called Memory-Type Range Registers or MTRRs. After a bit of research on the web, it turns out the optimum strategy for writing to framebuffer memory is to enable "write-combining" where multiple writes are accumulated in a combining buffer, and written out all at once in a single high-speed burst to the graphics card.

Adding the code to set the framebuffer address range memory type to "write combining" is as simple as grabbing the physical framebuffer address (which we have already since we had to map it into virtual memory), calculate a "mask" which defines the range of 4k pages of memory where the write combining type should apply, and setting them both through a couple of wrmsr instructions. It's slightly more complicated than that, because we need to find an unused MTRR, and also ideally query whether the current processor supports MTRRs at all, but close enough.

A slight complication is that rdmsr and wrmsr are priviledged instructions, which can only be executed from ring 0 (supervisor mode). By default DOS extenders run our code in ring 3 (user mode), and if we attempt to execute either instruction it would lead to a general protection exception. This is obviously not a factor when running truly on bare metal, but it's also easy to solve with djgpp and cwsdpmi, by swapping out the default cwsdpmi executable with the alternative cwsdpr0.exe which runs our code in ring 0 instead.

DOS round 3

Having written the necessary code to change the memory type for the framebuffer range to write-combining, I rebooted to pure DOS to see the results of my efforts. And predictably the framerate shot up to exactly the same levels as it was after exiting windows 98, and after the nvidia driver had done the same thing to accelerate framebuffer writes for us: 79.2 fps.

Back to GNU/Linux and fbdev

Now if you remember, we left fbdev write performance at about the same level as it was under DOS before we set up write-combining (about 43 fps). Therefore it stands to reason that the particular framebuffer device driver (nouveaufb again) fails to do so and we'll have to force its hand. Linux provides a convenient facility for managing MTRRs through the proc filesystem, by writing to /proc/mtrr. The framebuffer physical address and size are available through fbdev ioctls, so the simplest experiment is to print those values from our program, then set the MTRRs manually with a command of the form: echo "base=0xd000000 size=0x8000000 type=write-combining" >/proc/mtrr, and re-running our test to see the vast improvement.

Unfortunately it made no difference whatsoever, and after further investigation (actually printing the /proc/mtrr file first) it looks like the appropriate range is already marked as write-combining without our intervention. I can't believe that the similar framerates are a coincidence, but I have not yet figured out if there's something else which overrides the MTRR setting. I'll have to investigate further.

Conclusions so far

First of all we dispelled all the magic and attained the best possible performance for writing to the framebuffer when running on DOS or bare metal. Manipulating the MTRRs to set the framebuffer range as write-combining is well worth the trouble, as it nearly doubles performance in this simple test!

The X drawing result is troubling. I can't belive there so much inherent overhead in XShm drawing that deteriorates the framerate as much as it seems to do between that method and fbdev access on the same system. I think it's quite possible that with a faster X driver, like the proprietary nvidia driver, the gap should be much narrower. Unfortunately I can't test that hypothesis on that machine, but it's the next thing I want to try for sure.

Update: Out of curiosity I tried running the benchmark on a modern computer: my 2013 macbook pro retina, with an i5 3230M and an integrated i915 GPU. The results were much closer to what I would expect from a well-optimized current X driver. Of course on such a modern machine, any overheads incurred by the X server would be inconsequential, but in fact the performance is slightly better under X11 than under fbdev as can be seen in the results table below.

Also as I mentioned earlier I want to try and see if on other machines, with a different fbdev driver, Linux fbdev performance is closer to the bare metal upper limit with write-combining. I don't see any reason why the difference should be that large.

Finally I intend to expand the benchmark, both in scope (different graphics tests with different bottlenecks), but also in platform support. The next targets which would be fun to test I think, would be performance under windows with DirectDraw and GDI drawing.

Results (framerate, higher is better):

	XShm	fbdev	DOS
Pentium3 / Geforce2 mx	23.1	41.3	79.2	tested at 640x480 16bpp
486 / GD-5429	1.9	?	2.9	tested at 640x480 16bpp
i5 3230M / i915	97	92	?	tested at 1920x1200 32bpp