Graphics Enthusiast and Performance Devotee

Gunnar Sletta

Click here to edit subtitle

Blog

Looking at Throughput in Qt Quick

Posted by [email protected] on October 2, 2014 at 2:35 PM

I already talked about the swap test which helps us determine if velvet graphics is indeed possible. The second step I usually take is to benchmark how much graphics we can put on screen before things starts to stutter.


If we focus on pure graphics for now, there are two things worth looking at. Fillrate, which is the systems ability to put pixels on screen; and number of draw calls, which is the number of times we can tell the system to draw something.


Fill rate

When working with software graphics, fill rate is usually the biggest obstacle. The CPU needs to process millions of pixels per frame, potentially both reading and writing. The memory bus will be heavily taxed, and this will eat from the performance budget of the rest of the application. When you have a dedicated GPU however, fill rate is usually not the problem. I’m speaking now in the context of UIs, as both industrial 3D applications and games can easily push the GPU beyond its limits.


A couple of things affect our pixel throughput:


Blending

Doing source over alpha blending requires that we take our source pixel and mix it with the destination pixel. The destination pixel is both read and written. So there is a bit of extra work involved. In addition, some GPUs can do even further optimizations like hidden surface removal and early z-test. Such tricks are great for overdraw performance, but they only work when blending is turned off (as in glDisable(GL_BLEND)). On a desktop with a discrete graphics card, chances are this will never be a problem, but for an onboard laptop or mobile/embedded GPU, it just might.


Textures

Another aspect that can greatly reduce the throughput of an application is textures. Textures are memory blobs and need to be fetched from graphics memory. The GPU doesn’t always have a large cache, so that means that a lot cycles are spent just fetching texture memory. Working with small textures or small regions of a large texture will in most cases be cheap, while working with full screen images will take its toll.


Shader Complexity

The part of the GPU pipeline that decides the color of a pixel is called a fragment shader. There are other shader stages, like the vertex shader, but the ratio of pixels to vertices is usually so high that the fragment shader is the one that ends up counting. With user interfaces, the majority of fragment shaders are quite simple. The Qt Quick Scene Graph will typically alternate between “colored pixel”, “textured pixel” and “distance field textured pixel”. They translate into a few GPU instructions each. Compared to 3D graphics, we don’t have to deal with lights, normals, bump mapping, shadows and all the other goodness that modern 3D has to offer. There are a couple of exceptions though. For instance anything involving blurring, like Qt Graphical Effects’ GaussianBlur or DropShadow, requires a lot of texture samples to produce a single output pixel. It is very likely that an onboard laptop chip is not capable of running live Gaussian Blur with 32 samples at 60 fps.


So all in all, assuming that the underlying OpenGL stack works properly with vsynced swap and all, raw GPU throughput will usually not be the biggest problem (for user interfaces). Lets look at some numbers.


Overdraw Benchmark 

The benchmark can be found here. It creates a fullscreen Qt Quick window and draws variety of stuff into it. The goal is to see how far we can go with a specific testcase while sustaining a perfect 60 fps. Skipping one frame every 2-3 seconds is in this case considered a failure. Even though we're testing raw graphics performance I use Qt Quick because it is easy to put things together, and for the testcases I've written, the delta between raw OpenGL and what the OpenGL that the scene graph produces is small enough to not impact the results. When they do, I will make a comment about it.

 

Note: If you try the benchmark, you will see that it does skip frames while increasing and reducing complexity. I'll get back to this in a later post.


I've run the benchmarks on the following hardware: 

    • Desktop Computer: Intel i7-3770K @ 3.50GHz, 24 GB RAM, NVidia GT-210, 1920x1080 screen, Ubuntu 14.04, proprietary driver. Tested with kwin w/o compositor. Also tested with compositor and with unity.
    • MacBook Pro: Early 2011, 4 GB RAM, Intel HD 3000 or AMD Radeon HD 6750M, 1650x1050 screen, OSX 10.9.5
    • iPad Retina Mini: A7, PowerVR G6430, 2048x1536 screen, iOS 7.1
    • Jolla: Snapdragon 400 1.4 GHz dual-core, Adreno 305, 540x960 screen, Sailfish OS u9 (public RC)



What we can see in the graphs confirms what I already talked about. Solid opaque fills are generally extremly cheap, even the mobile GPUs can do 40-50 fullscreen rectangles on top of each other. Same with opaque textured fills, though we should keep in mind here that the scene graph renders opaque content front-to-back with z buffer enabled, so GPUs that implement early-z will end up only having to read the front-most texture. Blending is worse with blended textures being the most costly, though still decent. It is however, something that is worth taking note of. Take, for instance, the following example based on Qt Quick Controls: 


  


The left image is how it is rendered normally, and the right image is rendered using QSG_VISUALIZE = overdraw in the environment. As can be seen from the visualization, the background is opaque (green) and can be ignored. Then are the three group boxes stacked on top of each other. Each rendered as a separate blended (red) texture. If this or a similar pattern of background stacking was used on a machine that has either a lowend GPU or a GPU that doesn't match its screen size, it would eat a large part of the performance budget. And this is before we start adding the actual application content.


The graphs also seem to indicate that both kwin and unity compositors are quite bad for performance. I could maybe tolerate a 10-20% drop for going through the composition step, but what I'm measuring does not seem right. If anybody knows what's up with that, please give me ping. 


I should also mention that the iPad didn't start skipping frames when I reached 30 opaque textures. It ran out of memory and was killed!

 

When running similar benchmarks previously, I have seen embedded chips with overdraw performance as low as 1.5. That means the application can fill the screen 1.5 times before the application starts stuttering. In terms of content, that means a background image and a few icons and some text. Not a lot of flexibility. When working with such a system, the application will need to be very careful about what gets drawn if a sustained 60 fps UI is desirable.


What about more complex shaders

I mentioned that complex shaders would be problematic, especially on the less powerful GPUs. To test this, I incorporated the GaussianBlur from QtGraphicalEffects into the benchmark and tested how many samples I could have before it started to stutter. Now, the gaussian blur is implemented as a two pass algorithm. That means that in addition to doing a lot of texture sampling, we’ll also be rendering the equivalent of the screen several times per frame. First into the FBO which will be the source for the first pass. Then blur in one direction into a second FBO. Then blur the second FBO in the other direction onto the screen.


 

The Jolla managed 55-ish fps with 2 and 3 samples and i7/kwin/composited maxed out at 30 fps, which is why they are marked with 0. Neither managed to sustain a velvet frame rate. The only chip that managed to run with a high sample count, was the discrete graphics chip on the MacBook. This is in line with what is expected, as the complexity of the per-pixel operation grows, so does the requirement for the graphics hardware. What we can read from this is that these kinds of operations needs to be taken into use with some care, applied to smaller areas or otherwise in moderation.


There are alternatives though. For instance, it is possible to do fast blurring even on lowend using a combination of downscaling, simplified blurring and upscaling, such as this. For drop shadows, prefer to use pre-generated ones. Only use live drop shadow if the area is small and there are few instances or you know you’ll be running on a high end chip. Keep in mind that cheating looks equally good. It just runs faster!


Number of Draw Calls 

The other factor I mentioned up top which was worth looking at was the number draw calls. When compared to software graphics, this is where hardware based APIs are much worse. In many ways, the problem to solve is the inverse. With software graphics, at least when looking at QPainter with its software rasterizer, draw calls are cheap and the impact of state changes in the graphics pipeline are small. With OpenGL, and DirectX for that matter, pipeline changes and draw calls can be quite bad. As part of the benchmarks, I’ve created a bunch of unique ShaderEffect items, the scene graph renderer can not batch these, so they will all be scheduled using one glDrawElements() call per item. Lets look at some numbers:



One conclusion to draw from this is that without any form of abstraction layer to handle batching, it is going to be hard to do complex controls, such as a table, using a hardware accelerated API. In fact, this is one of the primary reasons we’ve been pushing the raster engine for the widgets stack in Qt. If we compare this to items that do batch, one of Qt’s autotests will create 100.000 rectangles and can translate them at 60 fps, so the difference is quite significant.

 

Something else to take note of is that the scene graph is in no way perfect. There are several types of QML level operations which will force items to be rendered one by one. ShaderEffects is one. Item::clip is another. The following is a visualization of Qt Quick Control’s “Gallery” example’s “Itemviews” page using QSG_VISUALIZE = batches in the environment:



 

At first glance this looks a bit like a christmas tree. If we look beyond that, we see that the majority of the list is drawn in three separate colors. That means that the various background and text elements of the list view have been batched together. Running the same page with QSG_RENDERER_DEBUG = render, we can see from the console output that 109 nodes were compressed to 8 batches. If we added clipping to those list entries, those 109 nodes would be drawn using separate glDrawXxx() calls and eat quite a bit out of our performance budget. 

 

Another thing that breaks batching is large images, as these do not fit into the scene graph’s built-in texture atlas. If you are curious if the application’s images are atlassed, run the application with QSG_ATLAS_OVERLAY = 1 and look for tinted images.

 

Closing thoughts

 

Benchmarks are one thing and the real world is another, but keeping basic rules in mind based on findings in benchmarks can greatly help the overall performance of the resulting application. It is one of the premature optimizations that do pay off. If application performance starts slipping, it can take a lot of work to get it back..

Categories: Graphics, Qt, Benchmarks

Post a Comment

Oops!

Oops, you forgot something.

Oops!

The words you entered did not match the given text. Please try again.

Already a member? Sign In

11 Comments

Reply Federico
8:30 AM on October 4, 2014 
Great post! Explains a lot of what impacts performances. Blended content does a lot of difference when speaking about animation and performances. It isn't very clear which primitives is actually blended of opaque, the group boxes in the example looks like rectangles with rounded edges with no transparency, at least at first sight. Are rounded edges that makes them become opaque?
Is there any opaque content with non-rectangular shape?
Reply Nils
9:25 AM on October 4, 2014 
Thanks for the article. It's a welcome addition to http://doc-snapshot.qt-project.org/qt5-5.4/qtquick-visualcanvas-s
cenegraph-renderer.html, which is sadly quite hidden but a must read for every QtQuick 2 user.

Not really scene graph related but because you mention it in the batching example: IMHO the current QtQuick.Controls TableView implementation is sub-optimal. Compared with QtWidgets TreeView it's a total failure regarding performance and also memory usage. This is not a fault of the scene graph, TableView relies way too much on Javascript - just check the code that tries to find out which type of model is used! All the control's logic and delegate caching should be implemented in a C++ backend.
Reply [email protected]
9:49 AM on October 4, 2014 
Frederico: "Are rounded edges that makes them become opaque?" No, they are blended because the entire control is rendered as one big texture. You might notice that the label doesn't stand out as a separate element either :) The way most controls are implemented in the default desktop style is by using the internal StyleItem, which opens a QPainter on a QImage, then uses QStyle to render the control, then the image is uploaded as a texture and cached. All elements rendered via StyleItem will be blended. Very functional, but not optimal.

Frederico: "Is there any opaque content with non-rectangular shape?" Using scene graph API, you can create any shape you like and have it be opaque. https://github.com/qtproject/playground-scenegraph/tree/master/sh
apes implements arbitrary vector paths, for instance. However, if there is antialiasing involved, then we also need to do blending. The shapes that are opaque by default are Rectangle, Image and BorderImage, when antialiasing is disabled.

Nils: Yeah, I agree that complex controls should be implemented using C++. JS is fast, but C++ is both faster and more memory efficient.
Reply Nils
9:54 AM on October 4, 2014 
@gunnar: regarding non-opaque images: Do you think it would make sense to add some kind of backgroundColor property to QtQuick's Image item for pre-blending transparent images? In most real life applications controls like an IconButton add quite a lot of transparent images to the scene graph although the background color is known (button face or toolbar color) and rarely changes,
Reply [email protected]
2:54 PM on October 4, 2014 
@Nils: You can actually do this already today using a ShaderEffect element. Since 5.4, thanks to Michael Brasser's changes, shader effects which specify supportsAtlasTexture and which otherwise have the same uniform properties, will support batching. This is how you would do it (code unstested):

// OpaqueTintedIcon.qml
ShaderEffect {
supportsAtlasTexture: true
blending: false;
property color background;
property Image image;
fragmentShader: "
lowp uniform sampler2D image;
lowp uniform vec4 background;
highp varying vec2 qt_TexCoord0;
void main() {
lowp vec4 p = texture2D(image, qt_TexCoord0);
gl_FragColor = mix(background, p, p.a);
}
"
}

I'm not sure it will result in much difference though. Icons are usually pretty small, after all. For medium to large sized image, it could be a big deal, of course.
Reply Jens
6:52 AM on October 7, 2014 
Not really scene graph related but because you mention it in the batching example: IMHO the current QtQuick.Controls TableView implementation is sub-optimal. Compared with QtWidgets TreeView it's a total failure regarding performance and also memory usage. This is not a fault of the scene graph, TableView relies way too much on Javascript - just check the code that tries to find out which type of model is used! All the control's logic and delegate caching should be implemented in a C++ backend.
[/Nils]
Just a comment regarding TableView performance. Make sure you compare using recent versions. If you compare the raw performance of TableView as it is in 5.3.2, it performs no worse than a _similarily_ complex ListView. In fact it generally performs much better due to the implicit item re-use we introduced in 5.3. The actual cache overhead does not register on a profiler at all so moving this code to C++ would not have measurably improved the performance. We did quite a bit of profiling to get this faster in 5.3, including making use of the tools Gunnar mentioned above and the performance has already been vastly improved from the initial release. There is unfortunately no _single_ way to make it perform much faster right now judging by the actual profiling data and it is somewhat unfair to compare it to older item views as supporting dynamic properties and binding-evaluation during scrolling is a lot more demanding than merely reflecting a set of static data. Unfortunately binding evaluation and object creation does not magically go away by moving these things to C++. The Qt Quick compiler largely does this for us already. That said, we will certainly focus on making this perform even better in the future, including optimizing things using C++ where it matters. Make sure to compare any performance using release builds of Qt though, as the debug builds are massively affected negatively by performance. And as far as memory is concerned, the most heavy overhead right now is cache from pixmap-based background items. This should be a static memory cost. By setting a custom row delegate, you can reduce this cost significantly as that is where the pixmap generation happens.
Reply Nils
5:32 PM on October 8, 2014 
@jens: Yes, performance has certainly improved.

My main issues are with memory. If you apply this patch http://pastebin.com/39XKPJK5 to the Gallery example and start the timer you can watch unbounded growth of memory.

Using the "mark generation" feature in Instruments it seems as if there is a leak in Qt signal/slot connection management.
Reply Jens
5:39 AM on October 13, 2014 
Nils says...
@jens: Yes, performance has certainly improved.
My main issues are with memory. If you apply this patch http://pastebin.com/39XKPJK5 to the Gallery example and start the timer you can watch unbounded growth of memory.

Did you already file a bug report for this? It's the first time I have heard about this but certainly not impossible that there is a memory issue.
Reply Nils
9:32 AM on October 13, 2014 
Jens says...
Did you already file a bug report for this?


Just did: QTBUG-41899.
Reply Craig Matsuura
3:41 PM on November 21, 2015 
Great article and benchmarks, I'm looking forward to running the benchmarks on our embedded device to see what we have to work with. Your info in invaluable and greatly appreciated.
Reply Federico
2:30 PM on March 3, 2017 
Hi, I'm very new to QT and QT Quick and I have a question. I'm encountering very bad slowdowns with QT even when rendering very simple scenes. I tested on windows, QT 5.8 with an old GPU (Firepro 2660 (and other intel integrated GPUs) ). By running the "customgeometry" example and setting the window to fullscreen (less than FullHD) the framerate drops to less than 5fps. Now even if the GPU is very slow it should easily render the scene at 60+fps in a normal OpenGL application. I also noticed that at fullscreen the GPU memory utilization goes beyond 200MB just for an empty scene. This basically means terrible performance in anything with less than 512MB of VRAM at fullhd. What is happening behind the scenes that uses all this VRAM and causes the slowdown? Thankyou very much.