Vulkan: Bad Practices

I’ve been working with Vulkan API for quite a while now, so I wanted to compile a non-exhaustive list of things that I find are bad practices with regards to performance, functionality and validity for Vulkan workloads.

As with most advice, the best thing to do could be to pass it on — or act upon it, in which case burden of investigating/profiling them is on you! Notice that they may or may not apply to your choice of API. With that said, here it goes.

Enabling all device features/extensions at all times

Vulkan API to create a logical Vulkan device is vkCreateDevice, which takes a pointer to enabled features and extensions. While from spec point of view, it’s perfectly legal to enable all extensions and features that a Vulkan-capable GPU supports, doing so may lead to performance loss and unnecessary allocation of feature/extension-related data structures in drivers. So, unless you intend to use a feature, don’t enable it! This bit is important and even asserted by the spec:

Some features, such as robustBufferAccessmay incur a run-time performance cost. Application writers should carefully consider the implications of enabling all supported features.

Overdoing command buffer recording

One of the best features that modern, explicit graphics APIs such as D3D12 and Vulkan offer to developers is the ability to truly multi-thread GPU commands recording and submission to HW queues.

While that sounds great, using hundreds of thousands of command buffers per-frame is no ideal. Memory pressure is one thing — for that various options are available for short-lived command buffers, albeit with a non-trivial overhead; e.g. vkTrimCommandPool and vkResetCommandPool. Bigger point of concern is missed optimization opportunities for drivers. The thing is, unlike OpenGL or D3D11 drivers, newer drivers don’t track active GPU configuration across threads — there is no single global context. N command buffers which program the same heavy-weight HW state in parallel would incur a cost bigger than if a more ideal number of command buffers were to be used.

Other side of the story is submission of such command buffers to HW queues. Queue submission is one of the most CPU overhead-incurring actions in a Vulkan driver, as noted by the spec:

Submission can be a high overhead operation, and applications should attempt to batch work together into as few calls to vkQueueSubmit as possible.

In many cases, submission of commands will require implicit flushes, synchronization overhead (e.g. paging queue, other HW contexts, etc.) which may cause bubbles in the pipeline. So, better be done as few times as possible by batching as the spec asserts.

Overdoing pipeline barriers just to be safe

I am not aware of any Validation layers that can detect unnecessary pipeline barriers (which on its own is a pretty difficult thing to achieve anyways), so these bad boys may go unnoticed easily. Presence of an unnecessary pipeline barrier is one thing, setting source/destination stages simply as VK_PIPELINE_STAGE_ALL_COMMANDS_BIT is the other. Again, drivers may need to take additional steps, in vain, simply because of presence of one unnecessary stage bit or access mask to be conservative. Note that the same applies also for pipeline barrier access scopes.

Don’t keep your barriers as broad as possible just for the sake of it. Otherwise, it will lead to missed opportunities for drivers to optimize for all sorts of data hazards.

Skipping Specialization Constants where applicable

That one may or may not be a trivial thing to do, given the complexities that some engines/applications employ regarding shader source -> binary code pipeline, however it’s a very nice feature that SPIR-V enables for us.

By paying the price of having to create multiple graphics/compute pipelines of the same configuration with all variants of a compile-time constant, you can easily turn dynamic uniform control flows into branch-free code.

Miscellaneous

  • Don’t presume support for any image format, device feature or extension, query it!
  • Don’t wait idle on the whole freaking queue because you want to know when some set of commands have completed execution, use finer-grained synchronization mechanisms that the API provides.
  • Be aware of minimum/maximum limits of whatever thing you make use of. If present, these’ll help you make more informed decisions ahead of time.

Disclaimer

None of what I have presented above is endorsed by Intel, Khronos or any other group. They are merely my own personal observations.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s