284 lines
14 KiB
ReStructuredText
284 lines
14 KiB
ReStructuredText
.. _doc_gpu_optimization:
|
||
|
||
GPU optimization
|
||
================
|
||
|
||
Introduction
|
||
~~~~~~~~~~~~
|
||
|
||
The demand for new graphics features and progress almost guarantees that you
|
||
will encounter graphics bottlenecks. Some of these can be on the CPU side, for
|
||
instance in calculations inside the Godot engine to prepare objects for
|
||
rendering. Bottlenecks can also occur on the CPU in the graphics driver, which
|
||
sorts instructions to pass to the GPU, and in the transfer of these
|
||
instructions. And finally, bottlenecks also occur on the GPU itself.
|
||
|
||
Where bottlenecks occur in rendering is highly hardware-specific.
|
||
Mobile GPUs in particular may struggle with scenes that run easily on desktop.
|
||
|
||
Understanding and investigating GPU bottlenecks is slightly different to the
|
||
situation on the CPU. This is because, often, you can only change performance
|
||
indirectly by changing the instructions you give to the GPU. Also, it may be
|
||
more difficult to take measurements. In many cases, the only way of measuring
|
||
performance is by examining changes in the time spent rendering each frame.
|
||
|
||
Draw calls, state changes, and APIs
|
||
===================================
|
||
|
||
.. note:: The following section is not relevant to end-users, but is useful to
|
||
provide background information that is relevant in later sections.
|
||
|
||
Godot sends instructions to the GPU via a graphics API (OpenGL, OpenGL ES or
|
||
Vulkan). The communication and driver activity involved can be quite costly,
|
||
especially in OpenGL and OpenGL ES. If we can provide these instructions in a
|
||
way that is preferred by the driver and GPU, we can greatly increase
|
||
performance.
|
||
|
||
Nearly every API command in OpenGL requires a certain amount of validation to
|
||
make sure the GPU is in the correct state. Even seemingly simple commands can
|
||
lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to
|
||
reduce these instructions to a bare minimum and group together similar objects
|
||
as much as possible so they can be rendered together, or with the minimum number
|
||
of these expensive state changes.
|
||
|
||
2D batching
|
||
~~~~~~~~~~~
|
||
|
||
In 2D, the costs of treating each item individually can be prohibitively high -
|
||
there can easily be thousands of them on the screen. This is why 2D *batching*
|
||
is used. Multiple similar items are grouped together and rendered in a batch,
|
||
via a single draw call, rather than making a separate draw call for each item.
|
||
In addition, this means state changes, material and texture changes can be kept
|
||
to a minimum.
|
||
|
||
For more information on 2D batching, see :ref:`doc_batching`.
|
||
|
||
3D batching
|
||
~~~~~~~~~~~
|
||
|
||
In 3D, we still aim to minimize draw calls and state changes. However, it can be
|
||
more difficult to batch together several objects into a single draw call. 3D
|
||
meshes tend to comprise hundreds or thousands of triangles, and combining large
|
||
meshes in real-time is prohibitively expensive. The costs of joining them quickly
|
||
exceeds any benefits as the number of triangles grows per mesh. A much better
|
||
alternative is to **join meshes ahead of time** (static meshes in relation to each
|
||
other). This can either be done by artists, or programmatically within Godot.
|
||
|
||
There is also a cost to batching together objects in 3D. Several objects
|
||
rendered as one cannot be individually culled. An entire city that is off-screen
|
||
will still be rendered if it is joined to a single blade of grass that is on
|
||
screen. Thus, you should always take objects' location and culling into account
|
||
when attempting to batch 3D objects together. Despite this, the benefits of
|
||
joining static objects often outweigh other considerations, especially for large
|
||
numbers of distant or low-poly objects.
|
||
|
||
For more information on 3D specific optimizations, see
|
||
:ref:`doc_optimizing_3d_performance`.
|
||
|
||
Reuse Shaders and Materials
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The Godot renderer is a little different to what is out there. It's designed to
|
||
minimize GPU state changes as much as possible. :ref:`SpatialMaterial
|
||
<class_SpatialMaterial>` does a good job at reusing materials that need similar
|
||
shaders. If custom shaders are used, make sure to reuse them as much as
|
||
possible. Godot's priorities are:
|
||
|
||
- **Reusing Materials:** The fewer different materials in the
|
||
scene, the faster the rendering will be. If a scene has a huge amount
|
||
of objects (in the hundreds or thousands), try reusing the materials.
|
||
In the worst case, use atlases to decrease the amount of texture changes.
|
||
- **Reusing Shaders:** If materials can't be reused, at least try to reuse
|
||
shaders. Note: shaders are automatically reused between
|
||
SpatialMaterials that share the same configuration (features
|
||
that are enabled or disabled with a check box) even if they have different
|
||
parameters.
|
||
|
||
If a scene has, for example, ``20,000`` objects with ``20,000`` different
|
||
materials each, rendering will be slow. If the same scene has ``20,000``
|
||
objects, but only uses ``100`` materials, rendering will be much faster.
|
||
|
||
Pixel cost versus vertex cost
|
||
=============================
|
||
|
||
You may have heard that the lower the number of polygons in a model, the faster
|
||
it will be rendered. This is *really* relative and depends on many factors.
|
||
|
||
On a modern PC and console, vertex cost is low. GPUs originally only rendered
|
||
triangles. This meant that every frame:
|
||
|
||
1. All vertices had to be transformed by the CPU (including clipping).
|
||
2. All vertices had to be sent to the GPU memory from the main RAM.
|
||
|
||
Nowadays, all this is handled inside the GPU, greatly increasing performance.
|
||
3D artists usually have the wrong feeling about polycount performance because 3D
|
||
DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory for it to
|
||
be edited, reducing actual performance. Game engines rely on the GPU more, so
|
||
they can render many triangles much more efficiently.
|
||
|
||
On mobile devices, the story is different. PC and console GPUs are
|
||
brute-force monsters that can pull as much electricity as they need from
|
||
the power grid. Mobile GPUs are limited to a tiny battery, so they need
|
||
to be a lot more power efficient.
|
||
|
||
To be more efficient, mobile GPUs attempt to avoid *overdraw*. Overdraw occurs
|
||
when the same pixel on the screen is being rendered more than once. Imagine a
|
||
town with several buildings. GPUs don't know what is visible and what is hidden
|
||
until they draw it. For example, a house might be drawn and then another house
|
||
in front of it (which means rendering happened twice for the same pixel). PC
|
||
GPUs normally don't care much about this and just throw more pixel processors to
|
||
the hardware to increase performance (which also increases power consumption).
|
||
|
||
Using more power is not an option on mobile so mobile devices use a technique
|
||
called *tile-based rendering* which divides the screen into a grid. Each cell
|
||
keeps the list of triangles drawn to it and sorts them by depth to minimize
|
||
*overdraw*. This technique improves performance and reduces power consumption,
|
||
but takes a toll on vertex performance. As a result, fewer vertices and
|
||
triangles can be processed for drawing.
|
||
|
||
Additionally, tile-based rendering struggles when there are small objects with a
|
||
lot of geometry within a small portion of the screen. This forces mobile GPUs to
|
||
put a lot of strain on a single screen tile, which considerably decreases
|
||
performance as all the other cells must wait for it to complete before
|
||
displaying the frame.
|
||
|
||
To summarize, don't worry about vertex count on mobile, but
|
||
**avoid concentration of vertices in small parts of the screen**.
|
||
If a character, NPC, vehicle, etc. is far away (which means it looks tiny), use
|
||
a smaller level of detail (LOD) model. Even on desktop GPUs, it's preferable to
|
||
avoid having triangles smaller than the size of a pixel on screen.
|
||
|
||
Pay attention to the additional vertex processing required when using:
|
||
|
||
- Skinning (skeletal animation)
|
||
- Morphs (shape keys)
|
||
- Vertex-lit objects (common on mobile)
|
||
|
||
Pixel/fragment shaders and fill rate
|
||
====================================
|
||
|
||
In contrast to vertex processing, the costs of fragment (per-pixel) shading have
|
||
increased dramatically over the years. Screen resolutions have increased (the
|
||
area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA
|
||
screen, that is 27x the area), but also the complexity of fragment shaders has
|
||
exploded. Physically-based rendering requires complex calculations for each
|
||
fragment.
|
||
|
||
You can test whether a project is fill rate-limited quite easily. Turn off
|
||
V-Sync to prevent capping the frames per second, then compare the frames per
|
||
second when running with a large window, to running with a very small window.
|
||
You may also benefit from similarly reducing your shadow map size if using
|
||
shadows. Usually, you will find the FPS increases quite a bit using a small
|
||
window, which indicates you are to some extent fill rate-limited. On the other
|
||
hand, if there is little to no increase in FPS, then your bottleneck lies
|
||
elsewhere.
|
||
|
||
You can increase performance in a fill rate-limited project by reducing the
|
||
amount of work the GPU has to do. You can do this by simplifying the shader
|
||
(perhaps turn off expensive options if you are using a :ref:`SpatialMaterial
|
||
<class_SpatialMaterial>`), or reducing the number and size of textures used.
|
||
|
||
**When targeting mobile devices, consider using the simplest possible shaders
|
||
you can reasonably afford to use.**
|
||
|
||
Reading textures
|
||
~~~~~~~~~~~~~~~~
|
||
|
||
The other factor in fragment shaders is the cost of reading textures. Reading
|
||
textures is an expensive operation, especially when reading from several
|
||
textures in a single fragment shader. Also, consider that filtering may slow it
|
||
down further (trilinear filtering between mipmaps, and averaging). Reading
|
||
textures is also expensive in terms of power usage, which is a big issue on
|
||
mobiles.
|
||
|
||
**If you use third-party shaders or write your own shaders, try to use
|
||
algorithms that require as few texture reads as possible.**
|
||
|
||
Texture compression
|
||
~~~~~~~~~~~~~~~~~~~
|
||
|
||
By default, Godot compresses textures of 3D models when imported using video RAM
|
||
(VRAM) compression. Video RAM compression isn't as efficient in size as PNG or
|
||
JPG when stored, but increases performance enormously when drawing large enough
|
||
textures.
|
||
|
||
This is because the main goal of texture compression is bandwidth reduction
|
||
between memory and the GPU.
|
||
|
||
In 3D, the shapes of objects depend more on the geometry than the texture, so
|
||
compression is generally not noticeable. In 2D, compression depends more on
|
||
shapes inside the textures, so the artifacts resulting from 2D compression are
|
||
more noticeable.
|
||
|
||
As a warning, most Android devices do not support texture compression of
|
||
textures with transparency (only opaque), so keep this in mind.
|
||
|
||
.. note::
|
||
|
||
Even in 3D, "pixel art" textures should have VRAM compression disabled as it
|
||
will negatively affect their appearance, without improving performance
|
||
significantly due to their low resolution.
|
||
|
||
|
||
Post-processing and shadows
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Post-processing effects and shadows can also be expensive in terms of fragment
|
||
shading activity. Always test the impact of these on different hardware.
|
||
|
||
**Reducing the size of shadowmaps can increase performance**, both in terms of
|
||
writing and reading the shadowmaps. On top of that, the best way to improve
|
||
performance of shadows is to turn shadows off for as many lights and objects as
|
||
possible. Smaller or distant OmniLights/SpotLights can often have their shadows
|
||
disabled with only a small visual impact.
|
||
|
||
Transparency and blending
|
||
=========================
|
||
|
||
Transparent objects present particular problems for rendering efficiency. Opaque
|
||
objects (especially in 3D) can be essentially rendered in any order and the
|
||
Z-buffer will ensure that only the front most objects get shaded. Transparent or
|
||
blended objects are different. In most cases, they cannot rely on the Z-buffer
|
||
and must be rendered in "painter's order" (i.e. from back to front) to look
|
||
correct.
|
||
|
||
Transparent objects are also particularly bad for fill rate, because every item
|
||
has to be drawn even if other transparent objects will be drawn on top
|
||
later on.
|
||
|
||
Opaque objects don't have to do this. They can usually take advantage of the
|
||
Z-buffer by writing to the Z-buffer only first, then only performing the
|
||
fragment shader on the "winning" fragment, the object that is at the front at a
|
||
particular pixel.
|
||
|
||
Transparency is particularly expensive where multiple transparent objects
|
||
overlap. It is usually better to use transparent areas as small as possible to
|
||
minimize these fill rate requirements, especially on mobile, where fill rate is
|
||
very expensive. Indeed, in many situations, rendering more complex opaque
|
||
geometry can end up being faster than using transparency to "cheat".
|
||
|
||
Multi-platform advice
|
||
=====================
|
||
|
||
If you are aiming to release on multiple platforms, test *early* and test
|
||
*often* on all your platforms, especially mobile. Developing a game on desktop
|
||
but attempting to port it to mobile at the last minute is a recipe for disaster.
|
||
|
||
In general, you should design your game for the lowest common denominator, then
|
||
add optional enhancements for more powerful platforms. For example, you may want
|
||
to use the GLES2 backend for both desktop and mobile platforms where you target
|
||
both.
|
||
|
||
Mobile/tiled renderers
|
||
======================
|
||
|
||
As described above, GPUs on mobile devices work in dramatically different ways
|
||
from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers
|
||
split up the screen into regular-sized tiles that fit into super fast cache
|
||
memory, which reduces the number of read/write operations to the main memory.
|
||
|
||
There are some downsides though. Tiled rendering can make certain techniques
|
||
much more complicated and expensive to perform. Tiles that rely on the results
|
||
of rendering in different tiles or on the results of earlier operations being
|
||
preserved can be very slow. Be very careful to test the performance of shaders,
|
||
viewport textures and post processing.
|