|Posted by firstname.lastname@example.org on October 11, 2014 at 4:05 AM|
I mentioned before how blurring is a pretty expensive operation so it got me thinking. For dynamic content, we need to do the full thing, and what we have in Qt Graphical Effects is decent, but for static images there are techniques to do it a lot faster.
We can load images from disk, do a bit of processing, and then be able to dynamically animate from sharp to heavily blurred with close to the same rendering performance as blending a normal image, and with minimal memory overhead. Inspired by articles like this and this, I implemented something I think can be quite useful in Qt Quick. This is done by using custom mipmaps with mipmap bias. A QML module implementing this feature can be found here.
Below are two screenshots from the test application. The blur ratio is adjusted to keep focus on either the droids in the foreground (topmost) or on the walkers in the background (lowermost).
Mipmapping is normally used to improve performance and reduce aliasing in downscaled textures. In our case, we want to use the mipmap levels to render blurred images instead. Look at the following example:
The original image is on the left. Towards the right are the mipmap levels 1, 2 and 3 for this image, generated using glGenerateMipmaps(). The lower row shows the mipmap levels in their original size and the topmost shows the mipmap levels scaled back up to the size of the original image. Using mipmap bias, an optional argument to the texture sampling function, we can pick a higher mipmap level. When combined with GL_LINEAR_MIPMAP_LINEAR, we can interpolate both in x and y and also between the mipmap levels, seamlessly moving from sharp to coarse. However, the default mipmaps give quite blocky results.
The mipmap levels above are created automatically using glGenerateMipmaps(). It is possible to specify each level manually as well. In the following image, I’ve created mipmap levels by scaling the image down with linear sampling (basically 2x2 box sampling), then applied a 3x3 box blur. The process is repeated for each level based on the previous image. It is not a true gaussian blur, but it comes out quite decent.
The result is that the mipmap levels are now blurred instead of blocky. Using these mipmap levels and the mipmap bias, we can now select an arbitrary blurryness with a single texture2D() call in the fragment shader. The resulting performance is equivalent to that of blending a normal image, which means this approach is fully doable on mobile GPUs. The mipmap levels adds about 1/3 to the memory consumption, which is quite acceptable.
Please note that because of the manual downscaling and box blur, the initial texture upload is significantly slower than a plain upload, so it needs be done in controlled places, like a loading screen or during application startup, to avoid hickups in the animation. For a very controlled setup, one could imagine generating all the mipmap levels offline and just upload them in the application. That would save a bit of time, but makes it a bit less flexible. Note also that mipmapping does not mix with a texture atlas, so each image needs to be a separate texture, which means the blurred images do not batch in the scene graph renderer. That means that while we can have several large images being blurred, we cannot have hundreds or thousands of small ones in a scene and expect to sustain a velvet 60 fps.
So, if you want to incorporate animated blurs of static images into your application or game, have a look at BlurredImage.
|Posted by email@example.com on October 2, 2014 at 2:35 PM|
I already talked about the swap test which helps us determine if velvet graphics is indeed possible. The second step I usually take is to benchmark how much graphics we can put on screen before things starts to stutter.
If we focus on pure graphics for now, there are two things worth looking at. Fillrate, which is the systems ability to put pixels on screen; and number of draw calls, which is the number of times we can tell the system to draw something.
When working with software graphics, fill rate is usually the biggest obstacle. The CPU needs to process millions of pixels per frame, potentially both reading and writing. The memory bus will be heavily taxed, and this will eat from the performance budget of the rest of the application. When you have a dedicated GPU however, fill rate is usually not the problem. I’m speaking now in the context of UIs, as both industrial 3D applications and games can easily push the GPU beyond its limits.
A couple of things affect our pixel throughput:
Doing source over alpha blending requires that we take our source pixel and mix it with the destination pixel. The destination pixel is both read and written. So there is a bit of extra work involved. In addition, some GPUs can do even further optimizations like hidden surface removal and early z-test. Such tricks are great for overdraw performance, but they only work when blending is turned off (as in glDisable(GL_BLEND)). On a desktop with a discrete graphics card, chances are this will never be a problem, but for an onboard laptop or mobile/embedded GPU, it just might.
Another aspect that can greatly reduce the throughput of an application is textures. Textures are memory blobs and need to be fetched from graphics memory. The GPU doesn’t always have a large cache, so that means that a lot cycles are spent just fetching texture memory. Working with small textures or small regions of a large texture will in most cases be cheap, while working with full screen images will take its toll.
The part of the GPU pipeline that decides the color of a pixel is called a fragment shader. There are other shader stages, like the vertex shader, but the ratio of pixels to vertices is usually so high that the fragment shader is the one that ends up counting. With user interfaces, the majority of fragment shaders are quite simple. The Qt Quick Scene Graph will typically alternate between “colored pixel”, “textured pixel” and “distance field textured pixel”. They translate into a few GPU instructions each. Compared to 3D graphics, we don’t have to deal with lights, normals, bump mapping, shadows and all the other goodness that modern 3D has to offer. There are a couple of exceptions though. For instance anything involving blurring, like Qt Graphical Effects’ GaussianBlur or DropShadow, requires a lot of texture samples to produce a single output pixel. It is very likely that an onboard laptop chip is not capable of running live Gaussian Blur with 32 samples at 60 fps.
So all in all, assuming that the underlying OpenGL stack works properly with vsynced swap and all, raw GPU throughput will usually not be the biggest problem (for user interfaces). Lets look at some numbers.
The benchmark can be found here. It creates a fullscreen Qt Quick window and draws variety of stuff into it. The goal is to see how far we can go with a specific testcase while sustaining a perfect 60 fps. Skipping one frame every 2-3 seconds is in this case considered a failure. Even though we're testing raw graphics performance I use Qt Quick because it is easy to put things together, and for the testcases I've written, the delta between raw OpenGL and what the OpenGL that the scene graph produces is small enough to not impact the results. When they do, I will make a comment about it.
Note: If you try the benchmark, you will see that it does skip frames while increasing and reducing complexity. I'll get back to this in a later post.
I've run the benchmarks on the following hardware:
What we can see in the graphs confirms what I already talked about. Solid opaque fills are generally extremly cheap, even the mobile GPUs can do 40-50 fullscreen rectangles on top of each other. Same with opaque textured fills, though we should keep in mind here that the scene graph renders opaque content front-to-back with z buffer enabled, so GPUs that implement early-z will end up only having to read the front-most texture. Blending is worse with blended textures being the most costly, though still decent. It is however, something that is worth taking note of. Take, for instance, the following example based on Qt Quick Controls:
The left image is how it is rendered normally, and the right image is rendered using QSG_VISUALIZE = overdraw in the environment. As can be seen from the visualization, the background is opaque (green) and can be ignored. Then are the three group boxes stacked on top of each other. Each rendered as a separate blended (red) texture. If this or a similar pattern of background stacking was used on a machine that has either a lowend GPU or a GPU that doesn't match its screen size, it would eat a large part of the performance budget. And this is before we start adding the actual application content.
The graphs also seem to indicate that both kwin and unity compositors are quite bad for performance. I could maybe tolerate a 10-20% drop for going through the composition step, but what I'm measuring does not seem right. If anybody knows what's up with that, please give me ping.
I should also mention that the iPad didn't start skipping frames when I reached 30 opaque textures. It ran out of memory and was killed!
When running similar benchmarks previously, I have seen embedded chips with overdraw performance as low as 1.5. That means the application can fill the screen 1.5 times before the application starts stuttering. In terms of content, that means a background image and a few icons and some text. Not a lot of flexibility. When working with such a system, the application will need to be very careful about what gets drawn if a sustained 60 fps UI is desirable.
What about more complex shaders
I mentioned that complex shaders would be problematic, especially on the less powerful GPUs. To test this, I incorporated the GaussianBlur from QtGraphicalEffects into the benchmark and tested how many samples I could have before it started to stutter. Now, the gaussian blur is implemented as a two pass algorithm. That means that in addition to doing a lot of texture sampling, we’ll also be rendering the equivalent of the screen several times per frame. First into the FBO which will be the source for the first pass. Then blur in one direction into a second FBO. Then blur the second FBO in the other direction onto the screen.
The Jolla managed 55-ish fps with 2 and 3 samples and i7/kwin/composited maxed out at 30 fps, which is why they are marked with 0. Neither managed to sustain a velvet frame rate. The only chip that managed to run with a high sample count, was the discrete graphics chip on the MacBook. This is in line with what is expected, as the complexity of the per-pixel operation grows, so does the requirement for the graphics hardware. What we can read from this is that these kinds of operations needs to be taken into use with some care, applied to smaller areas or otherwise in moderation.
There are alternatives though. For instance, it is possible to do fast blurring even on lowend using a combination of downscaling, simplified blurring and upscaling, such as this. For drop shadows, prefer to use pre-generated ones. Only use live drop shadow if the area is small and there are few instances or you know you’ll be running on a high end chip. Keep in mind that cheating looks equally good. It just runs faster!
Number of Draw Calls
The other factor I mentioned up top which was worth looking at was the number draw calls. When compared to software graphics, this is where hardware based APIs are much worse. In many ways, the problem to solve is the inverse. With software graphics, at least when looking at QPainter with its software rasterizer, draw calls are cheap and the impact of state changes in the graphics pipeline are small. With OpenGL, and DirectX for that matter, pipeline changes and draw calls can be quite bad. As part of the benchmarks, I’ve created a bunch of unique ShaderEffect items, the scene graph renderer can not batch these, so they will all be scheduled using one glDrawElements() call per item. Lets look at some numbers:
One conclusion to draw from this is that without any form of abstraction layer to handle batching, it is going to be hard to do complex controls, such as a table, using a hardware accelerated API. In fact, this is one of the primary reasons we’ve been pushing the raster engine for the widgets stack in Qt. If we compare this to items that do batch, one of Qt’s autotests will create 100.000 rectangles and can translate them at 60 fps, so the difference is quite significant.
Something else to take note of is that the scene graph is in no way perfect. There are several types of QML level operations which will force items to be rendered one by one. ShaderEffects is one. Item::clip is another. The following is a visualization of Qt Quick Control’s “Gallery” example’s “Itemviews” page using QSG_VISUALIZE = batches in the environment:
At first glance this looks a bit like a christmas tree. If we look beyond that, we see that the majority of the list is drawn in three separate colors. That means that the various background and text elements of the list view have been batched together. Running the same page with QSG_RENDERER_DEBUG = render, we can see from the console output that 109 nodes were compressed to 8 batches. If we added clipping to those list entries, those 109 nodes would be drawn using separate glDrawXxx() calls and eat quite a bit out of our performance budget.
Another thing that breaks batching is large images, as these do not fit into the scene graph’s built-in texture atlas. If you are curious if the application’s images are atlassed, run the application with QSG_ATLAS_OVERLAY = 1 and look for tinted images.
Benchmarks are one thing and the real world is another, but keeping basic rules in mind based on findings in benchmarks can greatly help the overall performance of the resulting application. It is one of the premature optimizations that do pay off. If application performance starts slipping, it can take a lot of work to get it back..
|Posted by firstname.lastname@example.org on September 20, 2014 at 4:55 AM|
This is a test that I usually perform as a first step when encountering a performance issue on a new machine. It establishes whether or not velvet animations are possible at all. It is super simple and works pretty much everywhere. My equivalent of glxgears, you might say. The idea is to show a fullscreen rectangle and alternate the color between red and blue (or any combo of your choice) as fast as swapBuffers() allows. If the visual output is a shimmering, slightly uncomfortable, almost solid pink, then swapping works as it should.
People having issues with flashing light should avoid this particular test.
Because of a phenomenon we call persistence of vision, the red and blue color will stay in our retina for a fraction of a second after we switch to the alternate color. As long as the color is changed at fast and regular intervals we will observe a solid, shimmering, pink. If the color is changing at uneven intervals, it becomes really obvious. In theory this should be a matter of setting swap interval to 1, but it doesn’t always work. Perhaps more so on Linux.
If there is one or more horizontal lines across the screen, this is an indication of tearing. If the screen flashes with frames in which you can clearly see the red and blue colors, it is because frames are being dropped or because frames are rendered too fast. Check the driver and/or the compositor for vsync settings. Disabling the compositor is another thing that might help. There are a number of things that can be tweaked. Generally, I've found proprietary NVidia and AMD drivers produce solid results out of the box.
If the “QML (via animation system)” test fails and the others pass, you might be looking at a case of timer driver animations. On Linux, we’re using the basic render loop for mesa drivers. Run it again with QSG_INFO=1 in the environment and see if it prints out “basic render loop”. This was chosen some time ago as a sensible fallback in case we didn’t know what we were dealing with. Our primary worry was spinning at 100% when the GL stack was not throttling us, so safer to tick by timers to draw only every 16 ms. As you probably know, ticking animations based on timers does not lead to smooth results, so an alternative is to specify QSG_RENDER_LOOP=windows (yeah, windows… just accept it) in the environment. This render loop will tick animations in sync with buffer swapping and will look much smoother.