Book chapters

Performance Tuning for Tile-Based Architectures

Cover of the OpenGL Insights book

Bruce Merry. Performance tuning for tile-based architectures. In Patrick Cozzi and Christophe Riccio, editors, OpenGL Insights, pages 323–336. CRC Press, 2012.



In this chapter, we will examine tile-based rendering, a particular way to arrange a graphics pipeline that is used in several popular mobile GPUs. We will look at what tile-based rendering is and why it is used and then look at what needs to be done differently to achieve optimal performance. I assume that the reader already has experience with optimizing OpenGL applications and is familiar with the standard techniques, such as reducing state changes, reducing the number of draw calls, reducing shader complexity and texture compression, and is looking for advice that is specific to tile-based GPUs.

Papers — radio astronomy

Faster GPU-based convolutional gridding via thread coarsening

Gridding rates for four polarizations (GTX 980). The numbers inside the bars are the coarsening factors (in U and V), and the number of work-items per work-group (in U and V) in the best case.

BibTeX | publisher’s page | PDF

Bruce Merry. Faster GPU-based convolutional gridding via thread coarsening. Astronomy and Computing, Vol 16, pp 140-145, 2016.


Convolutional gridding is a processor-intensive step in interferometric imaging. While it is possible to use graphics processing units (GPUs) to accelerate this operation, existing methods use only a fraction of the available flops. We apply thread coarsening to improve the efficiency of an existing algorithm, and observe performance gains of up to 3.2× for single-polarization gridding and 1.9× for quad-polarization gridding on a GeForce GTX 980, and smaller but still significant gains on a Radeon R9 290X.

Approximating W projection as a separable kernel

Phase error as a function of image position

BibTeX | publisher’s page | PDF

Bruce Merry. Approximating W projection as a separable kernel. Monthly Notices of the Royal Astronomical Society, 456(2), pp. 1761-1766, 2016.


W projection is a commonly used approach to allow interferometric imaging to be accelerated by fast Fourier transforms, but it can require a huge amount of storage for convolution kernels. The kernels are not separable, but we show that they can be closely approximated by separable kernels. The error scales with the fourth power of the field of view, and so is small enough to be ignored at mid- to high frequencies. We also show that hybrid imaging algorithms combining W projection with either faceting, snapshotting, or W stacking allow the error to be made arbitrarily small, making the approximation suitable even for high-resolution wide-field instruments.

Papers — computer graphics and GPU computing

A performance comparison of sort and scan libraries for GPUs

Benchmark of scan performance

BibTeX | PDF | publisher’s page | software

(electronic version of an article published as Parallel Processing Letters, 25, 4, 2015, 1550007 10.1142/S0129626415500073 © copyright World Scientific Publishing Company

Bruce Merry. A performance comparison of sort and scan libraries for GPUs. Parallel Processing Letters, 25(4), pp. 1550007, Dec 2015.


Sorting and scanning are two fundamental primitives for constructing highly parallel algorithms. A number of libraries now provide implementations of these primitives for GPUs, but there is relatively little information about the performance of these implementations.

We benchmark seven libraries for 32-bit integer scan and sort, and sorting 32-bit values by 32-bit integer keys. We show that there is a large variation in performance between the libraries, and that no one library has both optimal performance and portability.

Moving least-squares reconstruction of large models with GPUs

Reconstruction of Songo Mnara

Bruce Merry, James Gain and Patrick Marais. Moving least-squares reconstruction of large models with GPUs. IEEE Transactions on Visualization and Computer Graphics, 20(2), pp. 249–261, Feb 2014.

BibTeX | paper | appendix | software

This is the author’s version of the work. The preprint can also be obtained from IEEE Xplore.


Modern laser range scanning campaigns produce extremely large point clouds, and reconstructing a triangulated surface thus requires both out-of-core techniques and significant computational power. We present a GPU-accelerated implementation of the Moving Least Squares (MLS) surface reconstruction technique. We believe this to be the first GPU-accelerated, out-of-core implementation of surface reconstruction that is suitable for laser range-scanned data. While several previous out-of-core approaches use a sweep-plane approach, we subdivide the space into cubic regions that are processed independently. This independence allows the algorithm to be parallelized using multiple GPUs, either in a single machine or a cluster. It also allows data sets with billions of point samples to be processed on a standard desktop PC. We show that our implementation is an order of magnitude faster than a CPU-based implementation when using a single GPU, and scales well to 8 GPUs.

Fast in-place binning of laser range-scanned point sets

Colour-coded bins

Bruce Merry, James Gain and Patrick Marais. Fast in-place binning of laser range-scanned point sets. Journal on Computing and Cultural Heritage, 6, 3, Article 14, July 2013.

BibTeX | paper

This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in the Journal on Computing and Cultural Heritage, 6, 3, Article 14 (July 2013),


Laser range scanning is commonly used in cultural heritage to create digital models of real-world artefacts. A large scanning campaign can produce billions of point samples — too many to be manipulated in memory on most computers. It is thus necessary to spatially partition the data so that it can be processed in bins or slices. We introduce a novel compression mechanism that exploits spatial coherence in the data to allow the bins to be computed with only 1.01 bytes of I/O traffic for each byte of input, compared to 2 or more for previous schemes. Additionally, the bins are loaded from the original files for processing rather than from a sorted copy, thus minimising disk space requirements. We demonstrate that our method yields performance improvements in a typical point-processing task, while also using little memory and guaranteeing an upper bound on the number of samples held in-core.

Accelerating kd-tree searches for all k-nearest neighbours

Incremental update of the tree path

Bruce Merry, James Gain and Patrick Marais. Accelerating kd-tree searches for all k-nearest neighbours. In Eurographics 2013 — Short Papers. 2013.

BibTeX | paper | tech report

The definitive version is available at


Finding the k nearest neighbours of each point in a point cloud forms an integral part of many point-cloud processing tasks. One common approach is to build a kd-tree over the points and then iteratively query the k nearest neighbors of each point. We introduce a simple modification to these queries to exploit the coherence between successive points; no changes are required to the kd-tree data structure. The path from the root to the appropriate leaf is updated incrementally, and backtracking is done bottom-up. We show that this can reduce the time to compute the neighbourhood graph of a 3D point cloud by over 10%, and by up to 24% when k = 1. The gains scale with the depth of the kd-tree, and the method is suitable for parallel implementation.

Analytic simplification of animated characters

Horse simplified from the rest pose, with the animation-space metric, and using influence simplification

Bruce Merry, Patrick Marais and James Gain. Analytic simplification of animated characters. In Proceedings of the 6th international conference on Computer graphics, virtual reality, visualisation and interaction in Africa (Afrigraph 2009), pages 37–45.

BibTeX | paper | video

This is an author-prepared version of the document, posted by permission of the ACM for your personal use. It is not for redistribution.


Traditionally, levels of detail (LOD) for animated characters are computed from a single pose. Later techniques refined this approach by considering a set of sample poses and evaluating a more representative error metric. A recent approach to the character animation problem, animation space, provides a framework for measuring error analytically. The work presented here uses the animation-space framework to derive two new techniques to improve the quality of LOD approximations.

Firstly, we use an animation-space distance metric within a progressive mesh-based LOD scheme, giving results that are reasonable across a range of poses, without requiring that the pose space be sampled.

Secondly, we simplify individual vertices by reducing the number of bones that influence them, using a constrained least-squares optimisation. This influence simplification is combined with the progressive mesh to form a single stream of simplifications. Influence simplification reduces the geometric error by up to an order of magnitude, and allows models to be simplified further than is possible with only a progressive mesh.

A comparison of linear skinning techniques for character animation

An image from the paper

David Jacka, Ashley Reid, Bruce Merry and James Gain. A comparison of linear skinning techniques for character animation. In AFRIGRAPH ‘07: Proceedings of the 5th international conference on Computer graphics, virtual reality, visualisation and interaction in Africa, pages 177–186.



Character animation is the task of moving a complex, artificial character in a life-like manner. A widely used method for character animation involves embedding a simple skeleton within a character model and then animating the character by moving the underlying skeleton. The character’s skin is required to move and deform along with the skeleton. Research into this problem has resulted in a number of skinning frameworks. There has, however, been no objective attempt to compare these methods.

We compare three linear skinning frameworks that are computationally efficient enough to be used for real-time animation: Skeletal Subspace Deformation, Animation Space and Multi-Weight Enveloping. These create a correspondence between the points on a character’s skin and the underlying skeleton by means of a number of weights, with more weights providing greater flexibility.

The quality of each of the three frameworks is tested by generating the skins for a number of poses for which the ideal skin is known. These generated skin meshes are then compared to the ideal skins using various mesh comparison techniques and human studies are used to determine the effect of any temporal artefacts introduced. We found that SSD lacks flexibility while Multi-Weight Enveloping is prone to overfitting. Animation Space consistently outperforms the other two frameworks.

Animation space: a truly linear framework for character animation

An image from the paper

Bruce Merry, Patrick Marais and James Gain. Animation space: A truly linear framework for character animation. ACM Trans. Graph. 25, 4 (Oct. 2006), 1400–1423.

BibTeX | paper | video

This is an author-prepared version of the document, posted by permission of the ACM for your personal use. It is not for redistribution.


Skeletal subspace deformation (SSD), a simple method of character animation used in many applications, has several shortcomings; the best-known is that joints tend to collapse when bent. We present animation space, a generalization of SSD that greatly reduces these effects and effectively eliminates them for joints that do not have an unusually large range of motion.

While other, more expensive generalizations exist, ours is unique in expressing the animation process as a simple linear transformation of the input coordinates. We show that linearity can be used to derive a measure of average distance (across the space of poses), and apply this to improving parametrizations.

Linearity also makes it possible to fit a model to a set of examples using least-squares methods. The extra generality in animation space allows for a good fit to realistic data, and overfitting can be controlled to allow fitted models to generalize to new poses. Despite the extra vertex attributes, it is possible to render these animation-space models in hardware with no loss of performance relative to SSD.

Normal transformations for articulated models

Visualisation of normal error

Merry B., Marais P. and Gain J. Normal Transformations for Articulated models. In ACM SIGGRAPH 2006 Conference Abstracts and Applications, August 2006.

BibTeX | paper | video | tech report

These are provided for personal use and may not be publicly redistributed.


It is well-established that when a matrix is used to transform a rigid object, the normals should be transformed by the inverse transpose of that matrix. For skeletally animated models, it is common to apply this approach to the blended matrix that animates each vertex. This is only an approximation, as it assumes that the blended matrix is locally constant. We derive a correct method for normal transformation in skeletally animated models, and examine the errors introduced by two approximations.

Compression of dense and regular point clouds

Prediction edges

Merry B., Marais P. and Gain J. Compression of dense and regular point clouds. In AFRIGRAPH 2006: Proceedings of the 4th international conference on Computer graphics, virtual reality, visualisation and interaction in Africa, pages 15–20.

BibTeX | paper | video | software

Merry B., Marais P. and Gain J. Compression of dense and regular point clouds. Computer Graphics Forum 25, 4 (Dec. 2006), 709–716. BibTeX

This is an author-prepared version of the document, posted by permission of the ACM for your personal use. It is not for redistribution.


We present a simple technique for single-rate compression of point clouds sampled from a surface, based on a spanning tree of the points. Unlike previous methods, we predict future vertices using both a linear predictor, which uses the previous edge as a predictor for the current edge, and lateral predictors that rotate the previous edge 90° left or right about an estimated normal.

By careful construction of the spanning tree and choice of prediction rules, our method improves upon existing compression rates when applied to regularly sampled point sets, such as those produced by laser range scanning or uniform tesselation of higher-order surfaces. For less regular sets of points, the compression rate is still generally within 1.5 bits per point of other compression algorithms.

PhD thesis

A linear framework for character animation

Merry, B. A linear framework for character animation. PhD thesis, University of Cape Town, 2007.

BibTeX | thesis | software


Character animation is the process of modelling and rendering a mobile character in a virtual world. It has numerous applications both off-line, such as virtual actors in films, and real-time, such as in games and other virtual environments. There are a number of algorithms for determining the appearance of an animated character, with different trade-offs between quality, ease of control, and computational cost. We introduce a new method, animation space, which provides a good balance between the ease-of-use of very simple schemes and the quality of more complex schemes, together with excellent performance. It can also be integrated into a range of existing computer graphics algorithms.

Animation space is described by a simple and elegant linear equation. Apart from making it fast and easy to implement, linearity facilitates mathematical analysis. We derive two metrics on the space of vertices (the “animation space”), which indicate the mean and maximum distances between two points on an animated character. We demonstrate the value of these metrics by applying them to the problems of parametrisation, level-of-detail (LOD) and frustum culling. These metrics provide information about the entire range of poses of an animated character, so they are able to produce better results than considering only a single pose of the character, as is commonly done.

In order to compute parametrisations, it is necessary to segment the mesh into charts. We apply an existing algorithm based on greedy merging, but use a metric better suited to the problem than the one suggested by the original authors. To combine the parametrisations with level-of-detail, we require the charts to have straight edges. We explored a heuristic approach to straightening the edges produced by the automatic algorithm, but found that manual segmentation produced better results. Animation space is nevertheless beneficial in flattening the segmented charts; we use least squares conformal maps (LSCM), with the Euclidean distance metric replaced by one of our animation-space metrics. The resulting parametrisations have significantly less overall stretch than those computed based on a single pose.

Similarly, we adapt appearance preserving simplification (APS), a progressive mesh-based LOD algorithm, to apply to animated characters by replacing the Euclidean metric with an animation-space metric. When using the memoryless form of APS (in which local rather than global error is considered), the use of animation space for computations reduces the geometric errors introduced by LOD decomposition, compared to simplification based on a single pose. User tests, in which users compared video clips of the two, demonstrated a statistically significant preference for the animation-space simplifications, indicating that the visual quality is better as well. While other methods exist to take multiple poses into account, they are based on a sampling of the pose space, and the computational cost scales with the number of samples used. In contrast, our method is analytic and uses samples only to gather statistics.

The quality of LOD approximations by improved further by introducing a novel approach to LOD, influence simplification, in which we remove the influences of bones on vertices, and adjust the remaining influences to approximate the original vertex as closely as possible. Once again, we use an animation-space metric to determine the approximation error. By combining influence simplification with the progressive mesh structure, we can obtain further improvements in quality: for some models and at some detail levels, the error is reduced by an order of magnitude relative to a pure progressive mesh. User tests showed that for some models this significantly improves quality, while for others it makes no significant difference.

Animation space is a generalisation of skeletal subspace deformation (SSD), a popular method for real-time character animation. This means that there is a large existing base of models that can immediately benefit from the modified algorithms mentioned above. Furthermore, animation space almost entirely eliminates the well-known shortcomings of SSD (the so-called “candy-wrapper” and “collapsing elbow” effects). We show that given a set of sample poses, we can fit an animation-space model to these poses by solving a linear least-squares problem.

Finally, we demonstrate that animation space is suitable for real-time rendering, by implementing it, along with level-of-detail rendering, on a PC with a commodity video card. We show that although the extra degrees of freedom make the straightforward approach infeasible for complex models, it is still possible to obtain high performance; in fact, animation space requires fewer basic operations to transform a vertex position than SSD. We also consider two methods of lighting LOD-simplified models using the original normals: tangent-space normal maps, an existing method that is fast to render but does not capture dynamic structures such as wrinkles; and tangent maps, a novel approach that encodes animation-space tangent vectors into textures, and which captures dynamic structures. We compare the methods both for performance and quality, and find that tangent-space normal maps are at least an order of magnitude faster, while user tests failed to show any perceived difference in quality between them.


The animation-space models used for testing are available below. The custom file format is described here. The files are further compressed with bzip2. The cyberdemon model is omitted for copyright reasons. Please refer to the thesis for a description of where the models come from.

The models include LOD records, computed using animation-space APS and influence simplification.

The cylinder video briefly demonstrates the project. It shows a bendy tube, and two ways of simplifying it. In the first half of the video, a naïve algorithm is used that doesn’t consider the joint in the tube, and how badly things can go wrong. In the second half, a smarter algorithm is used that takes the joint into account. The other videos show 100 copies of the model under animation, and were used in user testing. A small amount of LOD is in use (nominal tolerance of 1.5 pixels).


You can download the source code for the tools I developed during my thesis. This includes a collection of C++ tools for manipulating models in various ways, a renderer, and assorted scripts for Blender. See the file README in the tarball for more information.

You can also download a Blender script that applies Catmull-Clark subdivision to a mesh, while preserving seams and vertex groups. It currently still has limitations (in particular, creases are ignored). Blender 2.41 is required (I haven’t tested it on recent Blender versions, so it might not work at all).

Papers — other

Performance Analysis of Sandboxes for Reactive Tasks

Bruce Merry. Performance Analysis of Sandboxes for Reactive Tasks. Olympiads in Informatics, Vol 4 (2010), pages 87–94.

BibTeX | paper


Security mechanisms for programming contests introduce some overhead into time measurements. When large numbers of system calls are made, as is common in reactive tasks with processes communicating over pipes, this may significantly distort timing results. We compared the performance and consistency of two sandboxes based on different security mechanisms. We found that in-kernel security has negligible effect on measured run-times, while ptrace-based security can add overhead of around 75%. We also found that ptrace-based security on a dual-core CPU adds far greater overhead as well as producing highly variable results unless CPU affinity is used.

Using a Linux Security Module for contest security

Bruce Merry. Using a Linux Security Module for contest security. Olympiads in Informatics, Vol 3 (2009), pages 67–73.

BibTeX | paper


The goal of a programming contest grading system is to take unknown code and execute it on test data. Since the code is frequently buggy and potentially malicious, it is necessary to run the code in a restricted environment to prevent it from damaging the grading system, bypassing resource constraints, or stealing information in order to obtain a better score.

We present some background on methods to construct such a restricted environment. We then describe how the South African Computer Olympiad has used a Linux Security Module to implement a restricted environment, as well as the limitations of our solution.

Challenges in Running a Computer Olympiad in South Africa

Bruce Merry, Marco Gallotta, Carl Hultquist. Challenges in Running a Computer Olympiad in South Africa. Olympiads in Informatics, Vol 2 (2008), pages 105–114.

BibTeX | paper


Information and Communication Technology (ICT) in South Africa lags behind that of the developed world, which poses challenges in running the South African Computer Olympiad (SACO). We present the three-round structure of the SACO as it is run today, focusing on the challenges it faces and the choices we have made to overcome them. We also include a few statistics and some historical background to the Olympiad, and sample questions may be found in the appendix.