|Posted by firstname.lastname@example.org on October 11, 2014 at 4:05 AM|
I mentioned before how blurring is a pretty expensive operation so it got me thinking. For dynamic content, we need to do the full thing, and what we have in Qt Graphical Effects is decent, but for static images there are techniques to do it a lot faster.
We can load images from disk, do a bit of processing, and then be able to dynamically animate from sharp to heavily blurred with close to the same rendering performance as blending a normal image, and with minimal memory overhead. Inspired by articles like this and this, I implemented something I think can be quite useful in Qt Quick. This is done by using custom mipmaps with mipmap bias. A QML module implementing this feature can be found here.
Below are two screenshots from the test application. The blur ratio is adjusted to keep focus on either the droids in the foreground (topmost) or on the walkers in the background (lowermost).
Mipmapping is normally used to improve performance and reduce aliasing in downscaled textures. In our case, we want to use the mipmap levels to render blurred images instead. Look at the following example:
The original image is on the left. Towards the right are the mipmap levels 1, 2 and 3 for this image, generated using glGenerateMipmaps(). The lower row shows the mipmap levels in their original size and the topmost shows the mipmap levels scaled back up to the size of the original image. Using mipmap bias, an optional argument to the texture sampling function, we can pick a higher mipmap level. When combined with GL_LINEAR_MIPMAP_LINEAR, we can interpolate both in x and y and also between the mipmap levels, seamlessly moving from sharp to coarse. However, the default mipmaps give quite blocky results.
The mipmap levels above are created automatically using glGenerateMipmaps(). It is possible to specify each level manually as well. In the following image, I’ve created mipmap levels by scaling the image down with linear sampling (basically 2x2 box sampling), then applied a 3x3 box blur. The process is repeated for each level based on the previous image. It is not a true gaussian blur, but it comes out quite decent.
The result is that the mipmap levels are now blurred instead of blocky. Using these mipmap levels and the mipmap bias, we can now select an arbitrary blurryness with a single texture2D() call in the fragment shader. The resulting performance is equivalent to that of blending a normal image, which means this approach is fully doable on mobile GPUs. The mipmap levels adds about 1/3 to the memory consumption, which is quite acceptable.
Please note that because of the manual downscaling and box blur, the initial texture upload is significantly slower than a plain upload, so it needs be done in controlled places, like a loading screen or during application startup, to avoid hickups in the animation. For a very controlled setup, one could imagine generating all the mipmap levels offline and just upload them in the application. That would save a bit of time, but makes it a bit less flexible. Note also that mipmapping does not mix with a texture atlas, so each image needs to be a separate texture, which means the blurred images do not batch in the scene graph renderer. That means that while we can have several large images being blurred, we cannot have hundreds or thousands of small ones in a scene and expect to sustain a velvet 60 fps.
So, if you want to incorporate animated blurs of static images into your application or game, have a look at BlurredImage.
|Posted by email@example.com on October 2, 2014 at 2:35 PM|
I already talked about the swap test which helps us determine if velvet graphics is indeed possible. The second step I usually take is to benchmark how much graphics we can put on screen before things starts to stutter.
If we focus on pure graphics for now, there are two things worth looking at. Fillrate, which is the systems ability to put pixels on screen; and number of draw calls, which is the number of times we can tell the system to draw something.
When working with software graphics, fill rate is usually the biggest obstacle. The CPU needs to process millions of pixels per frame, potentially both reading and writing. The memory bus will be heavily taxed, and this will eat from the performance budget of the rest of the application. When you have a dedicated GPU however, fill rate is usually not the problem. I’m speaking now in the context of UIs, as both industrial 3D applications and games can easily push the GPU beyond its limits.
A couple of things affect our pixel throughput:
Doing source over alpha blending requires that we take our source pixel and mix it with the destination pixel. The destination pixel is both read and written. So there is a bit of extra work involved. In addition, some GPUs can do even further optimizations like hidden surface removal and early z-test. Such tricks are great for overdraw performance, but they only work when blending is turned off (as in glDisable(GL_BLEND)). On a desktop with a discrete graphics card, chances are this will never be a problem, but for an onboard laptop or mobile/embedded GPU, it just might.
Another aspect that can greatly reduce the throughput of an application is textures. Textures are memory blobs and need to be fetched from graphics memory. The GPU doesn’t always have a large cache, so that means that a lot cycles are spent just fetching texture memory. Working with small textures or small regions of a large texture will in most cases be cheap, while working with full screen images will take its toll.
The part of the GPU pipeline that decides the color of a pixel is called a fragment shader. There are other shader stages, like the vertex shader, but the ratio of pixels to vertices is usually so high that the fragment shader is the one that ends up counting. With user interfaces, the majority of fragment shaders are quite simple. The Qt Quick Scene Graph will typically alternate between “colored pixel”, “textured pixel” and “distance field textured pixel”. They translate into a few GPU instructions each. Compared to 3D graphics, we don’t have to deal with lights, normals, bump mapping, shadows and all the other goodness that modern 3D has to offer. There are a couple of exceptions though. For instance anything involving blurring, like Qt Graphical Effects’ GaussianBlur or DropShadow, requires a lot of texture samples to produce a single output pixel. It is very likely that an onboard laptop chip is not capable of running live Gaussian Blur with 32 samples at 60 fps.
So all in all, assuming that the underlying OpenGL stack works properly with vsynced swap and all, raw GPU throughput will usually not be the biggest problem (for user interfaces). Lets look at some numbers.
The benchmark can be found here. It creates a fullscreen Qt Quick window and draws variety of stuff into it. The goal is to see how far we can go with a specific testcase while sustaining a perfect 60 fps. Skipping one frame every 2-3 seconds is in this case considered a failure. Even though we're testing raw graphics performance I use Qt Quick because it is easy to put things together, and for the testcases I've written, the delta between raw OpenGL and what the OpenGL that the scene graph produces is small enough to not impact the results. When they do, I will make a comment about it.
Note: If you try the benchmark, you will see that it does skip frames while increasing and reducing complexity. I'll get back to this in a later post.
I've run the benchmarks on the following hardware:
What we can see in the graphs confirms what I already talked about. Solid opaque fills are generally extremly cheap, even the mobile GPUs can do 40-50 fullscreen rectangles on top of each other. Same with opaque textured fills, though we should keep in mind here that the scene graph renders opaque content front-to-back with z buffer enabled, so GPUs that implement early-z will end up only having to read the front-most texture. Blending is worse with blended textures being the most costly, though still decent. It is however, something that is worth taking note of. Take, for instance, the following example based on Qt Quick Controls:
The left image is how it is rendered normally, and the right image is rendered using QSG_VISUALIZE = overdraw in the environment. As can be seen from the visualization, the background is opaque (green) and can be ignored. Then are the three group boxes stacked on top of each other. Each rendered as a separate blended (red) texture. If this or a similar pattern of background stacking was used on a machine that has either a lowend GPU or a GPU that doesn't match its screen size, it would eat a large part of the performance budget. And this is before we start adding the actual application content.
The graphs also seem to indicate that both kwin and unity compositors are quite bad for performance. I could maybe tolerate a 10-20% drop for going through the composition step, but what I'm measuring does not seem right. If anybody knows what's up with that, please give me ping.
I should also mention that the iPad didn't start skipping frames when I reached 30 opaque textures. It ran out of memory and was killed!
When running similar benchmarks previously, I have seen embedded chips with overdraw performance as low as 1.5. That means the application can fill the screen 1.5 times before the application starts stuttering. In terms of content, that means a background image and a few icons and some text. Not a lot of flexibility. When working with such a system, the application will need to be very careful about what gets drawn if a sustained 60 fps UI is desirable.
What about more complex shaders
I mentioned that complex shaders would be problematic, especially on the less powerful GPUs. To test this, I incorporated the GaussianBlur from QtGraphicalEffects into the benchmark and tested how many samples I could have before it started to stutter. Now, the gaussian blur is implemented as a two pass algorithm. That means that in addition to doing a lot of texture sampling, we’ll also be rendering the equivalent of the screen several times per frame. First into the FBO which will be the source for the first pass. Then blur in one direction into a second FBO. Then blur the second FBO in the other direction onto the screen.
The Jolla managed 55-ish fps with 2 and 3 samples and i7/kwin/composited maxed out at 30 fps, which is why they are marked with 0. Neither managed to sustain a velvet frame rate. The only chip that managed to run with a high sample count, was the discrete graphics chip on the MacBook. This is in line with what is expected, as the complexity of the per-pixel operation grows, so does the requirement for the graphics hardware. What we can read from this is that these kinds of operations needs to be taken into use with some care, applied to smaller areas or otherwise in moderation.
There are alternatives though. For instance, it is possible to do fast blurring even on lowend using a combination of downscaling, simplified blurring and upscaling, such as this. For drop shadows, prefer to use pre-generated ones. Only use live drop shadow if the area is small and there are few instances or you know you’ll be running on a high end chip. Keep in mind that cheating looks equally good. It just runs faster!
Number of Draw Calls
The other factor I mentioned up top which was worth looking at was the number draw calls. When compared to software graphics, this is where hardware based APIs are much worse. In many ways, the problem to solve is the inverse. With software graphics, at least when looking at QPainter with its software rasterizer, draw calls are cheap and the impact of state changes in the graphics pipeline are small. With OpenGL, and DirectX for that matter, pipeline changes and draw calls can be quite bad. As part of the benchmarks, I’ve created a bunch of unique ShaderEffect items, the scene graph renderer can not batch these, so they will all be scheduled using one glDrawElements() call per item. Lets look at some numbers:
One conclusion to draw from this is that without any form of abstraction layer to handle batching, it is going to be hard to do complex controls, such as a table, using a hardware accelerated API. In fact, this is one of the primary reasons we’ve been pushing the raster engine for the widgets stack in Qt. If we compare this to items that do batch, one of Qt’s autotests will create 100.000 rectangles and can translate them at 60 fps, so the difference is quite significant.
Something else to take note of is that the scene graph is in no way perfect. There are several types of QML level operations which will force items to be rendered one by one. ShaderEffects is one. Item::clip is another. The following is a visualization of Qt Quick Control’s “Gallery” example’s “Itemviews” page using QSG_VISUALIZE = batches in the environment:
At first glance this looks a bit like a christmas tree. If we look beyond that, we see that the majority of the list is drawn in three separate colors. That means that the various background and text elements of the list view have been batched together. Running the same page with QSG_RENDERER_DEBUG = render, we can see from the console output that 109 nodes were compressed to 8 batches. If we added clipping to those list entries, those 109 nodes would be drawn using separate glDrawXxx() calls and eat quite a bit out of our performance budget.
Another thing that breaks batching is large images, as these do not fit into the scene graph’s built-in texture atlas. If you are curious if the application’s images are atlassed, run the application with QSG_ATLAS_OVERLAY = 1 and look for tinted images.
Benchmarks are one thing and the real world is another, but keeping basic rules in mind based on findings in benchmarks can greatly help the overall performance of the resulting application. It is one of the premature optimizations that do pay off. If application performance starts slipping, it can take a lot of work to get it back..
|Posted by firstname.lastname@example.org on September 20, 2014 at 4:55 AM|
This is a test that I usually perform as a first step when encountering a performance issue on a new machine. It establishes whether or not velvet animations are possible at all. It is super simple and works pretty much everywhere. My equivalent of glxgears, you might say. The idea is to show a fullscreen rectangle and alternate the color between red and blue (or any combo of your choice) as fast as swapBuffers() allows. If the visual output is a shimmering, slightly uncomfortable, almost solid pink, then swapping works as it should.
People having issues with flashing light should avoid this particular test.
Because of a phenomenon we call persistence of vision, the red and blue color will stay in our retina for a fraction of a second after we switch to the alternate color. As long as the color is changed at fast and regular intervals we will observe a solid, shimmering, pink. If the color is changing at uneven intervals, it becomes really obvious. In theory this should be a matter of setting swap interval to 1, but it doesn’t always work. Perhaps more so on Linux.
If there is one or more horizontal lines across the screen, this is an indication of tearing. If the screen flashes with frames in which you can clearly see the red and blue colors, it is because frames are being dropped or because frames are rendered too fast. Check the driver and/or the compositor for vsync settings. Disabling the compositor is another thing that might help. There are a number of things that can be tweaked. Generally, I've found proprietary NVidia and AMD drivers produce solid results out of the box.
If the “QML (via animation system)” test fails and the others pass, you might be looking at a case of timer driver animations. On Linux, we’re using the basic render loop for mesa drivers. Run it again with QSG_INFO=1 in the environment and see if it prints out “basic render loop”. This was chosen some time ago as a sensible fallback in case we didn’t know what we were dealing with. Our primary worry was spinning at 100% when the GL stack was not throttling us, so safer to tick by timers to draw only every 16 ms. As you probably know, ticking animations based on timers does not lead to smooth results, so an alternative is to specify QSG_RENDER_LOOP=windows (yeah, windows… just accept it) in the environment. This render loop will tick animations in sync with buffer swapping and will look much smoother.
|Posted by email@example.com on September 15, 2014 at 4:10 AM|
I wanted to start this out with a definition of what is good enough and why. I already covered some of this some time ago here, but I wanted to tidy it up a bit and have it here for completeness, in a framework agnostic fashion.
In real life, objects will move, accelerate and decelerate, but very few things can teleport. So for our brain to perceive natural motion, objects on the screen need to follow the same rules. Their velocity, acceleration and deceleration needs to be steady. Say an object moves with 1 pixel per millisecond and our display refreshes at 60 Hz (meaning how fast the display is capable of updating the content on screen). For its motion to be steady, the object should move exactly 16.6 pixels per frame and the frames would have to be presented to the viewer at exactly 16.6 ms apart.
Both of these factors are important. If the object moves between 10-22 pixels per frame and is presented at 16.6 ms apart (with the help of vsync) it will look bad. If the objects are moving with 16.6 pixels per frame, but skips every ten frames, it will also look bad. When the application is producing 60 frames per second (fps), it will line up with the capacity of the display and it will be velvet.
- Ok, so my application runs at 50-something fps That is pretty good, isn’t it?
- Sorry, but no..
Assuming a 60 Hz display, if the application runs at 59 fps, this means that once per second, the velvet motion will stop. Some people see this, some don’t, but we all notice it. Put something that runs at 59 next to something that runs at 60 (assuming a refresh rate of 60) and anyone can tell you that 60 fps feels more right.
For almost a full second the application has been tricking your brain into thinking that it is observing objects moving about naturally. Then suddenly, they teleport. The illusion stops... Running close to the display’s frame rate is almost worse, because it is almost there so the shattering of the illusion is that much greater. If you cannot hit exactly 60, aim for a higher swap interval. It does mean that the observant and trained eye will be able to see discrete frames, but at least the motion will be steady. For a 60 Hz display, animations running at 30 fps will look better than animations running at 40 or 59 fps.
- Ok, but my application runs at 79 FPS, isn’t that awesome?
- Nope, equally bad..
Being able to run with a very high frame rate can serve a purpose. It gives an indication as to how much slack there is in the system before things turn sour and go below the display's frame rate. It also gives you an indication as to how high the CPU load will be and what the power consumption is once you fix it to run at the display's frame rate.
However, it does mean that it is not smooth. Either frames are being presented faster than the frame rate of the display, which leads to tearing. Or the frame is simply discarded, which means that the movement in pixels per visible frame is not steady.
Also, in most cases, this number is an average over time. A number of 79 could mean that most frames rendered in 12 ms, but there could be the odd frame rendering in 30 ms, leading to an overall high frame rate, but very uneven movement. Because animations anyway look crap at this frame rate, you don’t notice, so you are actually damaging the end result.
If you want to measure slack, disable the CPU and GPU scaling governors, run the application at exactly the display's frame rate and measure the CPU load and aim for as low as possible. This will let you catch bad single frames while measuring the live sustained performance of the application.
- So, is 60 fps the ultimate framerate?
- No, but it is good enough and often the best option available.
Though there exists higher frame rate devices, the majority of consumer devices, be it mobile phones, tablets, TV sets or laptops, tend to aim for either 50 Hz or 60 Hz, so these are the numbers to work with. When we, some time in the future, standardize on 100 Hz, 120 Hz or higher, the difference will be the animation and motion equivalent to retina displays. In the meantime, we'll have to make the best of what we have.
So, to sum it up:
Aim to run the application at exactly the frame rate of the target display. Not a single frame higher and not a single frame lower. If that is not possible after redesigning for performance and applying all micro optimizations, then aim for a higher swap interval. For instance, by halving the frame rate.