So you want to write a GUI framework

Through several recent discussions of GUI programming in Rust, I have been left with the impression that the term ‘GUI’ means significantly different things to different people.

I would like to try and clarify this point somewhat, first by describing some of the different things that people refer to as GUI frameworks/toolkits, and then by exploring in detail the necessary components of one of these, the classic desktop GUI framework.

Although this post is not especially specific to Rust, it does have its genesis in Rust: it is largely informed by my experience working on Druid, a Rust GUI toolkit of the desktop variety.

Once we have a shared understanding of the problem, we will be better situated to talk about the status of this work in Rust, which will be the topic of a follow-up post.

What we talk about when we talk about GUI

A GUI framework can be a lot of different things, with different use cases and different deployment targets. A framework intended for building embedded applications is not going to also trivially work on the desktop; a framework for building desktop applications is not going to trivially work on the web.

Regardless of the specifics, there is one major dividing line to recognize, and this is whether or not a framework is expected to integrate closely into an existing platform or environment.

On one side of this line, then, are tools for building games, embedded applications, and (to a lesser degree) web apps. In this world, you are responsible for providing almost everything your applications will need, and you will be interacting closely with the underlying hardware: accepting raw input events, and outputting your UI to some sort of buffer or surface. (The web is different; here the browser vendors have done that integration work for you.)

On the other side of this line are tools for building traditional desktop applications. In this world, you must integrate tightly into a large number of existing platform APIs, design patterns, and conventions, and it is this integration that is the source of most of your design complexity.

Games and embedded GUIs

Before we start digging into all the integrations expected of a desktop application framework, let’s talk briefly about the first case.

Games and GUI for embedded applications (think of the infotainment system in the back of a taxi, or the interface on a medical device) are different from desktop GUIs in a number of ways, most of which can be thought of in terms of system integration: games and embedded applications don’t have to do as much of it. In general, a game or an embedded application is a self-contained world; there is a single ‘window’, and the application is responsible for drawing everything in it. The application doesn’t need to worry about menus or sub-windows; it doesn’t need to worry about the compositor, or integrating with the platform’s IME system. Although they maybe should, they often don’t support complex scripts. They can ignore rich text editing. They likely don’t need to support font enumeration or fallback. They often ignore accessibility.

Of course, they do have additional challenges of their own. Embedded applications have to think much more carefully about resource constraints, and may need to avoid allocation altogether. When they do need features like complex scripts or text input, they have to implement these features on their own, without being able to rely on anything provided by the system.

Games are similar, and additionally have their own unique performance concerns and considerations that I am not qualified to talk about in any real detail.

Games and embedded are certainly interesting domains. Embedded in particular is a place where I think Rust GUI could really make a lot of sense, for many of the same reasons that Rust generally has a strong value proposition for embedded use.

It is unlikely, however, that a project that is intended for game or embedded development is going to tackle the whole list of capabilities we expect in desktop applications.

Anatomy of a ‘native desktop application’

The principal distinguishing feature of a desktop application is its close integration into the platform. Unlike a game or an embedded application, a desktop application is expected to interoperate intimately with the host OS, as well as with other software.

I’d like to try and go through some of the major required integration points, and some of the possible approaches available for providing them.

Windowing

An application has to instantiate and manage windows. The API should allow for customization of window appearance and behaviour, including things like whether the window is resizeable, whether it has a titlebar, etc. The API should allow for multiple windows, and it should also support modal and child-windows in a way that respects platform conventions. This means supporting both application-modal windows (for instance alerts that steal focus from the entire application until dealt with) as well as window-modal windows (an alert that steals focus from a given window until dealt with). Modal windows are used to implement a large number of common features, including open/save dialogs (which may be special-cased by the platform) alerts, confirmation dialogs, as well as standard UI elements such as combo boxes and other drop-down menus (think a list of completions for a text field).

The API must allow subwindows to be positioned precisely, relative to the position of the parent window. For instance in the case of a combo box, when showing the list of options you may wish to draw the currently selected item at the same baseline position used when the list is closed, as in macOS:

An NSPopUpButton.

Similarly, there needs to be an API that provides information about screens and the positions of windows within them, so that a combo box can be positioned appropriately to use available space: if the box is at the bottom of the screen it should position the popup above itself, and otherwise below.

Tabs

You’re also going to want to support tabs. You should be able to drag a tab out of a tab group to create a new window, as well as drag tabs between windows. Ideally you would like use the platform’s native tabbing infrastructure, but… that’s complicated. The browsers all roll their own implementations, and this is probably for a good reason. You would like to respect the user’s preferences around tabs (macOS let’s the user choose to open new windows as tabs, system-wide) but that will be an additional complication. I forgive you if you skip it, but if your framework sees much use you’re going to get someone reporting it as a bug every month until you die, and they aren’t wrong.

Safari Chrome Firefox

Difference and appearance between "native" tabs (in Safari) with custom implementations in Chrome and Firefox.

Menus

Closely related to windows are menus; a desktop application should respect platform conventions around window and application menus. On Windows (the operating system family) menus are a component of the window. On macOS, the menu is a property of the application, which is updated to reflect the commands available for the active window. On linux, things are slightly less clear cut. If you’re using GTK then there are both window and application menus, although the latter are deprecated. If you’re directly targeting x11 or wayland, you’ll need to implement menus on your own, and you can theoretically do whatever you want, although the easy path is Windows-style window menus.

Generally there are explicit conventions around what menus you should provide, and what commands should be present in them; a well-behaved desktop application should respect these conventions.

Painting

To draw the content of your app, you need (at least) a basic 2D graphics API. This should provide the ability to fill and stroke paths (with colors, including transparency, as well as with radial and linear gradients), to lay out text, to draw images, to define clip regions, and to apply transformations. Ideally your API also provides a few more advanced features such as blend modes and blurs, for things like drop shadows.

These APIs exist, in subtly different form, on the various platforms. on macOS, there is CoreGraphics, on windows Direct2D, and on linux there is Cairo. One approach, then, is to present a common API abstraction over top of these platform APIs, puttying over the rough edges and filling in the gaps. (This is the approach we have currently taken, with the piet library.)

This does have its downsides. These API are different enough (especially in trickier areas, such as text) that designing a good abstraction can be challenging, and requires some jumping through hoops. Subtly different platform behaviour can cause rendering irregularities.

It would be simpler to just use the same renderer everywhere. One option might be something like Skia, the rendering engine used in Chrome and Firefox. This has the advantage of portability and consistency, at the cost of binary size and compile time costs; a Rust binary using skia-safe crate has a baseline size of about 17M for a release build (my methodology wasn’t great for this, but I think it’s a reasonable baseline.)

Skia is still a fairly traditional software renderer, although it does now have significant GPU support. Ultimately, though, the most exciting prospects are those that move even more of the rendering task to the GPU.

An initial challange here is the diversity of APIs for GPU programming, even for identical hardware. The same physical GPU can be interfaced with via Metal on Apple platforms, DirectX on Windows, and Vulkan on many other platforms. Making code portable across these platforms requires either duplicate implementations, some form of cross compilation or else an abstraction layer. The problem with these latter cases is that it is genuinely hard to write an abstraction that provides adequate control of advanced GPU features (such as the compute capabilities) across subtly different low-level APIs.

Once you’ve figured out how you want to talk to the hardware, you then need to figure out how to efficiently and correctly rasterize 2D scenes on the GPU. This is also probably more complicated than you might initially suspect. Since GPUs are good at drawing 3D scenes, and since 3D scenes seem “more complicated” than 2D scenes, it may feel like a natural conclusion that GPUs should handle 2D trivially. They do not. The rasterization techniques used in 3D are poorly suited to 2D tasks like clipping to vector paths or antialiasing, and those that produce the best results have the worst performance. Worse, these traditional techniques can start to perform very badly in 2D once there are lots of blend groups or clip regions involved, since each needs its own temporary buffer and draw call.

There is some promising new work (such as piet-gpu) that use compute shaders and can draw scenes in the 2D imaging model with smoothly consistent performance. This is an area of active research. One potential limitation is that compute shaders are a relatively new feature, and are only available in GPUs made in the last five-or-so years. Other renderers, including WebRender as used by Firefox, use more traditional techniques and have wider compatibility.

In any case, you have options, all with various trade-offs, and none of them clearly the winner.

Animation

Oh, also: whatever approach you choose, you are going to also need to provide an ergonomic, performant animation API. It’s worth thinking about this early; it will be annoying to try and add it in later.

Text

Regardless of how you paint, you are going to need to render text. A GUI framework should at the very least support rich text, complex scripts, text layout (including things like line breaking, alignment, and justification, and ideally things like line-breaking within arbitrary paths). You need to support emoji. You also need to support text editing, including support for right-to-left and BiDi. Suffice to say that this is a very large undertaking. Realistically, you have two options: either you bundle HarfBuzz, or you use the platform text APIs: CoreText on macOS, DirectWrite on Windows, and likely Pango + HarfBuzz on linux. There are a few other alternatives, including some promising Rust projects (such as Allsorts, rustybuzz, and swash) but none of these are quite complete enough to fully replace HarfBuzz or the platform text APIs just yet.

The compositor

2D graphics are a major part of the drawing that might be done by a desktop application, but they are not the only part. There are two other common cases worth mentioning: video, and 3D graphics. In both of these cases, we want to be able to take advantage of available hardware: for video, the hardware H.264 decoder, and for 3D the GPU. What this comes down to is instructing the operating system to embed a video or 3D view in some region of our window, and this means interacting with the compositor. The compositor is the component of the operating system that is responsible for taking display data from various sources (different windows from different programs, video playback, GPU output) and assembling it into a coherent picture of your desktop.

Perhaps the best way to think about why this matters to us is to think about interactions with scrolling. If you have a scrollable view, and that view contains a video, you would like to have the video move in sync with the view’s other content when the view is scrolled. This is harder than it sounds. You can’t just define a region of your window and embed a video in it; you need to somehow tell the OS to move the video in sync with your scrolling.

Web views

Let’s not forget these: sooner or later, someone is going to want to display some HTML (or an actual website!) within their application. We’d really rather not bundle an entire browser engine to accomplish this, but making use of a platform webview also implicates the compositor and overall significantly complicates our lives. Maybe your users don’t really need that web view after all? In any case, something to think about.

Handling input

Once you have figured out how to manage windows and how you are going to draw your content, you need to handle user input. We can roughly divide input into pointer, keyboard, and other, where other is stuff like joysticks, gamepads, and other HID devices. We will ignore this last category, except to say that this would be nice to have, but doesn’t need to be a priority. Finally, there are input events that originate from system accessibility features; we will deal with these when we talk about accessibility.

For both pointer and keyboard events, there is a relatively easy approach, and then there is a principled, correct approach that is significantly harder to get right.

Pointer input

For pointer events, the easy approach is to present an API that sends mouse events, and then sends trackpad events in a way that makes them look like mouse events: ignoring multiple touches, pressure, or other features of touch gestures that do not have obvious analogs to the mouse. The hard approach is to implement some equivalent of the web’s PointerEvent API, where you are able to fully represent information on multi-touch (both from a trackpad as well as a touch-sensitive display) and stylus input events.

Doing pointer events the easy way is…okay, assuming you can also provide events for common trackpad gestures like pinch-to-zoom and two-finger-scroll, without which your framework is going to immediately frustrate many users. And while the number of applications that need or want to do advanced gesture recognition or which expect to handle stylus input is fairly low, they certainly exist, and a desktop application framework that does not support these cases is fundamentally limited.

Keyboard input

The situation is worse for keyboard input, in two ways: here the hard case is both harder to do and doing it the ‘easy way’ is fundamentally limiting; going the easy route means your framework is essentially useless for much of the world’s population.

The easy way, for keyboard input, is very easy: the keys of a keyboard are generally associated with a character or string, and when the user presses a key, you can take that string and smush it in at the cursor position in the active text field. This works reasonably well for unilingual English text, and slightly-less-well-but-at-least-sort-of for general Latin-1 languages plus scripts that behave similarly to latin, such as Greek or Cyrillic or Turkish. Unfortunately (but not coincidentally) a large number of programmers mostly just type ASCII, but much of the world does not. Serving these users requires integrating with the platform text input and IME system, a problem that has the unfortunate property of being both fundamentally necessary and incredibly fiddly.

IME stands for Input Method Editor, and is a catch-all term for the platform specific mechanisms that convert keyboard events into text. This process is fairly trivial for most European languages and scripts, where at most you may need to insert an accented vowel, but it is much more complicated for the east-Asian languages (Chinese, Japanese, and Korean, or collectively, CJK) as well as for various other complex scripts.

Using a Japanese IME on macOS

Let’s stick to CJK, for the purpose of this example. In these scripts, keyboard events do not correspond directly to input; instead keyboard events are composed together into input text as you you type, but that text may change significantly between keystrokes, and the changes can affect not just the current character but also text that has previously been entered.

This is complicated in a number of ways. Firstly, it means that the interaction between a given text field and the IME is bidirectional: the IME needs to be able to both modify the contents of the textbox, but it also needs to be able to query the current contents of the textbox, in order to have the appropriate context with which to interpret events. Similarly, it needs to be notified of changes in the cursor position or selection state; the same key-press may produce different output based on the surrounding text. Secondly, we also need to keep the IME up-to-date on the position of the textbox on the screen, since the IME often presents a ‘candidate’ window of possible inputs for the active sequence of keyboard events. Finally (and not like actually finally, just that I’m three thousand words in to this and not nearly done yet) implementing IME in a cross-platform way is significantly complicated by the differences in the underlying platform APIs; macOS requires editable text fields to implement a protocol, and then lets the text field handle accepting and applying changes from the IME, whereas the Windows API uses a lock and release mechanism; designing an abstraction over both of these approaches is an additional layer of complexity.

There’s one additional complication related to text input: on macOS, you need to support the Cocoa Text System, which allows the user to specify system-wide keybindings that can issue a variety of text editing and navigation commands.

To summarize: handling input correctly is a lot of work, and if you don’t do it your framework is basically a toy.

Accessibility

A desktop application framework has to support native accessibility APIs, and should ideally do this in a way that does not require special thought or work from the application developer. Accessibility is a catchall term for a large number of assistive technologies; the most crucial being support for screen readers and assisted navigation. Screen reader support means interoperating with platform APIs that describe the structure and contents of your application, and assisted navigation means providing a method of moving between elements on the screen linearly, allowing elements to be highlighted, described and activated in turn using a keyboard or joystick.

In addition to these core features, your framework should also respect the user’s system-level preferences regarding things like text size, reduced color contrast, and reduced animation. Related, but not accessibility, exactly: you would like to support dark mode, as well as things like a user-chosen accent colour.

Internationalization and Localization

Your framework should support internationalization. The most obvious component of this is localization of strings, but it also includes things like mirroring interfaces in right-to-left locales. Additionally, information like times, dates, currency units, calendar units, names, sequences, and general formatting of numerical data should respect the user’s locale. If this is not a problem you have thought about before, then it is almost certainly more complicated than you imagine. But don’t worry: there’s a standard. All you need to do is implement it.

Other common features

Copy/paste & drag-and-drop: These overlap, although drag-and-drop is more complicated. For copy/paste, you want to support not just text, but also other standard formats, and additionally you need to support user defined formats. For paste, you need to let the user inspect the clipboard, see the available formats, and retrieve the data. Fun fact: on macOS and Windows the API to retrieve data from the clipboard is synchronous, and on x11 it is async. Have fun. For drag and drop, hopefully you can reuse some of the work you did when you reimplemented window tabs?
Printing: printing? Who needs printing?? Well: your users, unfortunately. Don’t worry, it’s probably not that hard.
App resumption and window restoration: you’re going to want to remember where the user’s windows were, and put them back when you relaunch. I hope they didn’t unplug a monitor.
Assets and app packaging: You’re going to want to let the user bundle up their application. This means doing things like generating your app’s manifest, validating required assets like app icons, and localization data, and making these things available at runtime per the conventions of the target platform.
Async You do have nice ergonomic async support, don’t you?

And other less common features

In addition to all of the features that are shared across most desktop environments, there are also platform-specific features to be thought about: some of these are stylistic things, like APIs to add transparency or vibrancy to some part of your window; or support for adding a menu bar extra or working with task bar extensions, or quick look, or implementing a control panel item, or any number of other things. Your framework should at least make these things possible. At the very least, you should provide opportunities for the user to drop down and work with the platform APIs directly, so that they have some escape hatch available for when they really need to achieve something that you haven’t foreseen (or gotten around to yet).

Putting it all together

That feels like a reasonable place to stop; there are certainly things I’ve overlooked, but I hope I’ve touched on the most significant ones. Once you have an idea of the things you need to support and implement, you can start thinking about how to fit it all together.

Designing cross-platform APIs

One of the more subtle and interesting challenges of designing your GUI framework is designing the API. Here, you face a very particular problem: you are attempting to design an API that provides a common interface for a set of underlying platform APIs that are fundamentally different.

A nice example is around your application’s menus. As mentioned earlier, linux and Windows generally expect a menu bar to exist on your app’s individual windows, whereas macOS has a single menu bar that is a component of the desktop environment, and which becomes your application menu when your application is active.

To handle this naively, you might have separate ‘application’ and ‘window’ menus, and then you might have conditional code to update one or the other based on conditional compilation or runtime checks. This ends up being a lot of duplicate code, however, and it will be easy to get wrong. In this particular case, I think there is a fairly clear, fairly simple API that works on both platforms. In your framework, you treat menus as being a property of the window: on Windows and Linux this is actually the case, so that’s fine, and then on macOS you set the application menu to be the menu of the currently active window, changing it as needed when windows gain or lose active status.

This is a fairly clean example, and many other APIs are not so clear cut. In general, designing these cross-platform APIs is a process of carefully reading through and experimenting with the platform-specific APIs, and then trying to identify the set of shared features and functionality that you can express in the abstraction above; and when no cleanly shared set of features exist, it means coming up with some other API that can at least be implemented in terms of what is provided by the platform.

The seduction of the web view

All of this platform complexity, with all of its subtle design flaws, missing documentation, and mysterious bugs, has already been worked around successfully by a few major cross-platform GUI frameworks: the major browsers, Chrome, Firefox, and (increasingly) Edge. (Safari doesn’t need to worry about this, because it isn’t cross-platform.)

The browsers have had to figure all of this out: the child windows, the text input, the accessibility, the font fallback, the compositor, the performant painting, the drag and drop…it’s all there.

If you’d like to do something cross-platform, then, there is a very natural and very understandable impulse to reach for web technologies, either by creating a real web app that runs in the browser, or else by leaning on the browser engine and using it to render your UI in a native window, à la Electron. This does come with obvious drawbacks, particularly around performance (on various axes, such as application size and memory consumption) as well as ‘look and feel’ (on which we’ll expand shortly) but it sure does make life a lot simpler, and the more time I spend working on projects in this space, the more sympathetic I become to folks who choose the browser side of this trade-off.

On “native look and feel”

Something that comes up frequently in discussions of cross-platform GUI work is a collection of things I’ll refer to as “native look and feel”. This is vague, and I think it’s helpful to split it in two: native behaviour and convention, and native appearance (although these can overlap.)

Native behaviour refers to many of the things we have already discussed, and some other things besides. Some examples would be scroll behaviour: does your application respect the user’s scroll preferences? Does your application have the same acceleration curves when scrolling as the default platform scroll views? Does your application handle standard system keyboard shortcuts, for instance for maximizing or hiding a window? Does IME work? This extends to other less obvious conventions, as well: does the application store user data in the locations that are conventional to the current platform? Does it use the system file open/save dialogs? Does it show expected menus, containing expected menu items?

These things are more important on some platforms than on others. On the Mac, in particular, getting these behavioural details correct is important: the Mac more than other platforms is designed around specific conventions, and Mac application developers have historically been diligent about respecting these. This in turn has helped create a community of users who value these conventions and are sensitive to them, and breaking from them is bound to upset this cohort. On Windows, things are slightly more relaxed; there has historically been a greater diversity of software on Windows, and Microsoft has never been quite as dogmatic as Apple has been about how an application should look and behave.

Native appearance refers more to how an application looks. Do your buttons look like native buttons? Do they have the same sizing and gradients? Do you more generally use the controls a platform expects for a given interaction, for instance preferring a checkbox on desktop but a toggle on mobile?

This is additionally complicated by the fact that ‘native appearance’ changes between not just platforms but also OS releases, to the point where looking ‘native’ on a given machine would require runtime detection of the OS version.

While all of this is possible, it is starting to add a huge amount of additional work, and for a modestly staffed project this can be hard to justify. For that reason, I am personally forgiving of a project that moves away from trying to do pixel-perfect replication of the platform’s built-in widgets, in favour of just trying to do something tasteful and coherent, while providing the tools necessary for the framework’s users to style things as needed.

Fin

I hope this catalog has helped at least vaguely define the scope of the problem. None of the things I have described here are impossible, but doing them all, and doing them well, is quite a bit of work.

This last point is worth ending on: for this work to be useful, it is not enough that it exist. If you would like people to use your framework, you are going to have to make it attractive to them: providing a good API that is easy to use, that is idiomatic in the host language, that is well documented, and that lets them solve their actual problems.

Contents