1 | Freedreno
|
---|
2 | =========
|
---|
3 |
|
---|
4 | Freedreno GLES and GL driver for Adreno 2xx-6xx GPUs. It implements up to
|
---|
5 | OpenGL ES 3.2 and desktop OpenGL 4.5.
|
---|
6 |
|
---|
7 | See the `Freedreno Wiki
|
---|
8 | <https://gitlab.freedesktop.org/freedreno/freedreno/-/wikis/home>`__ for more
|
---|
9 | details.
|
---|
10 |
|
---|
11 | Turnip
|
---|
12 | ------
|
---|
13 |
|
---|
14 | Turnip is a Vulkan 1.3 driver for Adreno 6xx GPUs.
|
---|
15 |
|
---|
16 | The current set of specific chip versions supported can be found in
|
---|
17 | :file:`src/freedreno/common/freedreno_devices.py`. The current set of features
|
---|
18 | supported can be found rendered at `Mesa Matrix <https://mesamatrix.net/>`__.
|
---|
19 | There are no plans to port to a5xx or earlier GPUs.
|
---|
20 |
|
---|
21 | Hardware architecture
|
---|
22 | ---------------------
|
---|
23 |
|
---|
24 | Adreno is a mostly tile-mode renderer, but with the option to bypass tiling
|
---|
25 | ("gmem") and render directly to system memory ("sysmem"). It is UMA, using
|
---|
26 | mostly write combined memory but with the ability to map some buffers as cache
|
---|
27 | coherent with the CPU.
|
---|
28 |
|
---|
29 | .. toctree::
|
---|
30 | :glob:
|
---|
31 |
|
---|
32 | freedreno/hw/*
|
---|
33 |
|
---|
34 | Hardware acronyms
|
---|
35 | ^^^^^^^^^^^^^^^^^
|
---|
36 |
|
---|
37 | .. glossary::
|
---|
38 |
|
---|
39 | Cluster
|
---|
40 | A group of hardware registers, often with multiple copies to allow
|
---|
41 | pipelining. There is an M:N relationship between hardware blocks that do
|
---|
42 | work and the clusters of registers for the state that hardware blocks use.
|
---|
43 |
|
---|
44 | CP
|
---|
45 | Command Processor. Reads the stream of state changes and draw commands
|
---|
46 | generated by the driver.
|
---|
47 |
|
---|
48 | PFP
|
---|
49 | Prefetch Parser. Adreno 2xx-4xx CP component.
|
---|
50 |
|
---|
51 | ME
|
---|
52 | Micro Engine. Adreno 2xx-4xx CP component after PFP, handles most PM4 commands.
|
---|
53 |
|
---|
54 | SQE
|
---|
55 | a6xx+ replacement for PFP/ME. This is the microcontroller that runs the
|
---|
56 | microcode (loaded from Linux) which actually processes the command stream
|
---|
57 | and writes to the hardware registers. See `afuc
|
---|
58 | <https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/afuc/README.rst>`__.
|
---|
59 |
|
---|
60 | ROQ
|
---|
61 | DMA engine used by the SQE for reading memory, with some prefetch buffering.
|
---|
62 | Mostly reads in the command stream, but also serves for
|
---|
63 | ``CP_MEMCPY``/``CP_MEM_TO_REG`` and visibility stream reads.
|
---|
64 |
|
---|
65 | SP
|
---|
66 | Shader Processor. Unified, scalar shader engine. One or more, depending on
|
---|
67 | GPU and tier.
|
---|
68 |
|
---|
69 | TP
|
---|
70 | Texture Processor.
|
---|
71 |
|
---|
72 | UCHE
|
---|
73 | Unified L2 Cache. 32KB on A330, unclear how big now.
|
---|
74 |
|
---|
75 | CCU
|
---|
76 | Color Cache Unit.
|
---|
77 |
|
---|
78 | VSC
|
---|
79 | Visibility Stream Compressor
|
---|
80 |
|
---|
81 | PVS
|
---|
82 | Primitive Visibility Stream
|
---|
83 |
|
---|
84 | FE
|
---|
85 | Front End? Index buffer and vertex attribute fetch cluster. Includes PC,
|
---|
86 | VFD, VPC.
|
---|
87 |
|
---|
88 | VFD
|
---|
89 | Vertex Fetch and Decode
|
---|
90 |
|
---|
91 | VPC
|
---|
92 | Varying/Position Cache? Hardware block that stores shaded vertex data for
|
---|
93 | primitive assembly.
|
---|
94 |
|
---|
95 | HLSQ
|
---|
96 | High Level Sequencer. Manages state for the SPs, batches up PS invocations
|
---|
97 | between primitives, is involved in preemption.
|
---|
98 |
|
---|
99 | PC_VS
|
---|
100 | Cluster where varyings are read from VPC and assembled into primitives to
|
---|
101 | feed GRAS.
|
---|
102 |
|
---|
103 | VS
|
---|
104 | Vertex Shader. Responsible for generating VS/GS/tess invocations
|
---|
105 |
|
---|
106 | GRAS
|
---|
107 | Rasterizer. Responsible for generating PS invocations from primitives, also
|
---|
108 | does LRZ
|
---|
109 |
|
---|
110 | PS
|
---|
111 | Pixel Shader.
|
---|
112 |
|
---|
113 | RB
|
---|
114 | Render Backend. Performs both early and late Z testing, blending, and
|
---|
115 | attachment stores of output of the PS.
|
---|
116 |
|
---|
117 | GMEM
|
---|
118 | Roughly 128KB-1MB of memory on the GPU (SKU-dependent), used to store
|
---|
119 | attachments during tiled rendering
|
---|
120 |
|
---|
121 | LRZ
|
---|
122 | Low Resolution Z. A low resolution area of the depth buffer that can be
|
---|
123 | initialized during the binning pass to contain the worst-case (farthest) Z
|
---|
124 | values in a block, and then used to early reject fragments during
|
---|
125 | rasterization.
|
---|
126 |
|
---|
127 | Cache hierarchy
|
---|
128 | ^^^^^^^^^^^^^^^
|
---|
129 |
|
---|
130 | The a6xx GPUs have two main caches: CCU and UCHE.
|
---|
131 |
|
---|
132 | UCHE (Unified L2 Cache) is the cache behind the vertex fetch, VSC writes,
|
---|
133 | texture L1, LRZ, and storage image accesses (``ldib``/``stib``). Misses and
|
---|
134 | flushes access system memory.
|
---|
135 |
|
---|
136 | The CCU is the separate cache used by 2D blits and sysmem render target access
|
---|
137 | (and also for resolves to system memory when in GMEM mode). Its memory comes
|
---|
138 | from a carveout of GMEM controlled by ``RB_CCU_CNTL``, with a varying amount
|
---|
139 | reserved based on whether we're in a render pass using GMEM for attachment
|
---|
140 | storage, or we're doing sysmem rendering. Cache entries have the attachment
|
---|
141 | number and layer mixed into the cache tag in some way, likely so that a
|
---|
142 | fragment's access is spread through the cache even if the attachments are the
|
---|
143 | same size and alignments in address space. This means that the cache must be
|
---|
144 | flushed and invalidated between memory being used for one attachment and another
|
---|
145 | (notably depth vs color, but also MRT color).
|
---|
146 |
|
---|
147 | The Texture Processors (TP) additionally have a small L1 cache (1KB on A330,
|
---|
148 | unclear how big now) before accessing UCHE. This cache is used for normal
|
---|
149 | sampling like ``sam``` and ``isam`` (and the compiler will make read-only
|
---|
150 | storage image access through it as well). It is not coherent with UCHE (may get
|
---|
151 | stale results when you ``sam`` after ``stib``), but must get flushed per draw or
|
---|
152 | something because you don't need a manual invalidate between draws storing to an
|
---|
153 | image and draws sampling from a texture.
|
---|
154 |
|
---|
155 | The command processor (CP) does not read from either of these caches, and
|
---|
156 | instead uses FIFOs in the ROQ to avoid stalls reading from system memory.
|
---|
157 |
|
---|
158 | Draw states
|
---|
159 | ^^^^^^^^^^^
|
---|
160 |
|
---|
161 | Since the SQE is not a fast processor, and tiled rendering means that many draws
|
---|
162 | won't even be used in many bins, since a5xx state updates can be batched up into
|
---|
163 | "draw states" that point to a fragment of CP packets. At draw time, if the draw
|
---|
164 | call is going to actually execute (some primitive is visible in the current
|
---|
165 | tile), the SQE goes through the ``GROUP_ID``\s and for any with an update since
|
---|
166 | the last time they were executed, it executes the corresponding fragment.
|
---|
167 |
|
---|
168 | Starting with a6xx, states can be tagged with whether they should be executed
|
---|
169 | at draw time for any of sysmem, binning, or tile rendering. This allows a
|
---|
170 | single command stream to be generated which can be executed in any of the modes,
|
---|
171 | unlike pre-a6xx where we had to generate separate command lists for the binning
|
---|
172 | and rendering phases.
|
---|
173 |
|
---|
174 | Note that this means that the generated draw state has to always update all of
|
---|
175 | the state you have chosen to pack into that ``GROUP_ID``, since any of your
|
---|
176 | previous state changes in a previous draw state command may have been skipped.
|
---|
177 |
|
---|
178 | Pipelining (a6xx+)
|
---|
179 | ^^^^^^^^^^^^^^^^^^
|
---|
180 |
|
---|
181 | Most CP commands write to registers. In a6xx+, the registers are located in
|
---|
182 | clusters corresponding to the stage of the pipeline they are used from (see
|
---|
183 | ``enum tu_stage`` for a list). To pipeline state updates and drawing, registers
|
---|
184 | generally have two copies ("contexts") in their cluster, so previous draws can
|
---|
185 | be working on the previous set of register state while the next draw's state is
|
---|
186 | being set up. You can find what registers go into which clusters by looking at
|
---|
187 | :command:`crashdec` output in the ``regs-name: CP_MEMPOOL`` section.
|
---|
188 |
|
---|
189 | As SQE processes register writes in the command stream, it sends them into a
|
---|
190 | per-cluster queue stored in ``CP_MEMPOOL``. This allows the pipeline stages to
|
---|
191 | process their stream of register updates and events independent of each other
|
---|
192 | (so even with just 2 contexts in a stage, earlier stages can proceed on to later
|
---|
193 | draws before later stages have caught up).
|
---|
194 |
|
---|
195 | Each cluster has a per-context bit indicating that the context is done/free.
|
---|
196 | Register writes will stall on the context being done.
|
---|
197 |
|
---|
198 | During a 3D draw command, SQE generates several internal events flow through the
|
---|
199 | pipeline:
|
---|
200 |
|
---|
201 | - ``CP_EVENT_START`` clears the done bit for the context when written to the
|
---|
202 | cluster
|
---|
203 | - ``PC_EVENT_CMD``/``PC_DRAW_CMD``/``HLSQ_EVENT_CMD``/``HLSQ_DRAW_CMD`` kick off
|
---|
204 | the actual event/drawing.
|
---|
205 | - ``CONTEXT_DONE`` event completes after the event/draw is complete and sets the
|
---|
206 | done flag.
|
---|
207 | - ``CP_EVENT_END`` waits for the done flag on the next context, then copies all
|
---|
208 | the registers that were dirtied in this context to that one.
|
---|
209 |
|
---|
210 | The 2D blit engine has its own ``CP_2D_EVENT_START``, ``CP_2D_EVENT_END``,
|
---|
211 | ``CONTEXT_DONE_2D``, so 2D and 3D register contexts can do separate context
|
---|
212 | rollover.
|
---|
213 |
|
---|
214 | Because the clusters proceed independently of each other even across draws, if
|
---|
215 | you need to synchronize an earlier cluster to the output of a later one, then
|
---|
216 | you will need to ``CP_WAIT_FOR_IDLE`` after flushing and invalidating any
|
---|
217 | necessary caches.
|
---|
218 |
|
---|
219 | Also, note that some registers are not banked at all, and will require a
|
---|
220 | ``CP_WAIT_FOR_IDLE`` for any previous usage of the register to complete.
|
---|
221 |
|
---|
222 | In a2xx-a4xx, there weren't per-stage clusters, and instead there were two
|
---|
223 | register banks that were flipped between per draw.
|
---|
224 |
|
---|
225 | Bindless/Bindful Descriptors (a6xx+)
|
---|
226 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
---|
227 |
|
---|
228 | Starting with a6xx++, cat5 (texture) and cat6 (image/ssbo/ubo) instructions are
|
---|
229 | extended to support bindless descriptors.
|
---|
230 |
|
---|
231 | In the old bindful model, descriptors are separate for textures, samplers,
|
---|
232 | UBOs, and IBOs (combined descriptor for images and SSBOs), with separate
|
---|
233 | registers for the memory containing the array of descriptors, and/or different
|
---|
234 | ``STATE_TYPE`` and ``STATE_BLOCK`` for ``CP_LOAD_STATE``/``_FRAG``/``_GEOM``
|
---|
235 | to pre-load the descriptors into cache.
|
---|
236 |
|
---|
237 | - textures - per-shader-stage
|
---|
238 | - registers: ``SP_xS_TEX_CONST``/``SP_xS_TEX_COUNT``
|
---|
239 | - state-type: ``ST6_CONSTANTS``
|
---|
240 | - state-block: ``SB6_xS_TEX``
|
---|
241 | - samplers - per-shader-stage
|
---|
242 | - registers: ``SP_xS_TEX_SAMP``
|
---|
243 | - state-type: ``ST6_SHADER``
|
---|
244 | - state-block: ``SB6_xS_TEX``
|
---|
245 | - UBOs - per-shader-stage
|
---|
246 | - registers: none
|
---|
247 | - state-type: ``ST6_UBO``
|
---|
248 | - state-block: ``SB6_xS_SHADER``
|
---|
249 | - IBOs - global acress shader 3d stages, separate for compute shader
|
---|
250 | - registers: ``SP_IBO``/``SP_IBO_COUNT`` or ``SP_CS_IBO``/``SP_CS_IBO_COUNT``
|
---|
251 | - state-type: ``ST6_SHADER``
|
---|
252 | - state-block: ``ST6_IBO`` or ``ST6_CS_IBO`` for compute shaders
|
---|
253 | - Note, unlike per-shader-stage descriptors, ``CP_LOAD_STATE6`` is used,
|
---|
254 | as opposed to ``CP_LOAD_STATE6_GEOM`` or ``CP_LOAD_STATE6_FRAG``
|
---|
255 | depending on shader stage.
|
---|
256 |
|
---|
257 | .. note::
|
---|
258 | For the per-shader-stage registers and state-blocks the ``xS`` notation
|
---|
259 | refers to per-shader-stage names, ex. ``SP_FS_TEX_CONST`` or ``SB6_DS_TEX``
|
---|
260 |
|
---|
261 | Textures and IBOs (images) use *basically* the same 64byte descriptor format
|
---|
262 | with some exceptions (for ex, for IBOs cubemaps are handles as 2d array).
|
---|
263 | SSBOs are just untyped buffers, but otherwise use the same descriptors and
|
---|
264 | instructions as images. Samplers use a 16byte descriptor, and UBOs use an
|
---|
265 | 8byte descriptor which packs the size in the upper 15 bits of the UBO address.
|
---|
266 |
|
---|
267 | In the bindless model, descriptors are split into 5 descriptor sets, which are
|
---|
268 | global across shader stages (but as with bindful IBO descriptors, separate for
|
---|
269 | 3d stages vs compute stage). Each hw descriptor is an array of descriptors
|
---|
270 | of configurable size (each descriptor set can be configured for a descriptor
|
---|
271 | pitch of 8bytes or 64bytes). Each descriptor can be of arbitrary format (ie.
|
---|
272 | UBOs/IBOs/textures/samplers interleaved), it's interpretation by the hw is
|
---|
273 | determined by the instruction that references the descriptor. Each descriptor
|
---|
274 | set can contain at least 2^^16 descriptors.
|
---|
275 |
|
---|
276 | The hw is configured with the base address of the descriptor set via an array
|
---|
277 | of "BINDLESS_BASE" registers, ie ``SP_BINDLESS_BASE[n]``/``HLSQ_BINDLESS_BASE[n]``
|
---|
278 | for 3d shader stages, or ``SP_CS_BINDLESS_BASE[n]``/``HLSQ_CS_BINDLESS_BASE[n]``
|
---|
279 | for compute shaders, with the descriptor pitch encoded in the low bits.
|
---|
280 | Which of the descriptor sets is referenced is encoded via three bits in the
|
---|
281 | instruction. The address of the descriptor is calculated as::
|
---|
282 |
|
---|
283 | descriptor_addr = (BINDLESS_BASE[n] & ~0x3) +
|
---|
284 | (idx * 4 * (2 << BINDLESS_BASE[n] & 0x3))
|
---|
285 |
|
---|
286 |
|
---|
287 | .. note::
|
---|
288 | Turnip reserves one descriptor set for internal use and exposes the other
|
---|
289 | four for the application via the vulkan API.
|
---|
290 |
|
---|
291 | Software Architecture
|
---|
292 | ---------------------
|
---|
293 |
|
---|
294 | Freedreno and Turnip use a shared core for shader compiler, image layout, and
|
---|
295 | register and command stream definitions. They implement separate state
|
---|
296 | management and command stream generation.
|
---|
297 |
|
---|
298 | .. toctree::
|
---|
299 | :glob:
|
---|
300 |
|
---|
301 | freedreno/*
|
---|
302 |
|
---|
303 | GPU devcoredump
|
---|
304 | ^^^^^^^^^^^^^^^^^^
|
---|
305 |
|
---|
306 | A kernel message from DRM of "gpu fault" can mean any sort of error reported by
|
---|
307 | the GPU (including its internal hang detection). If a fault in GPU address
|
---|
308 | space happened, you should expect to find a message from the iommu, with the
|
---|
309 | faulting address and a hardware unit involved:
|
---|
310 |
|
---|
311 | .. code-block:: console
|
---|
312 |
|
---|
313 | *** gpu fault: ttbr0=000000001c941000 iova=000000010066a000 dir=READ type=TRANSLATION source=TP|VFD (0,0,0,1)
|
---|
314 |
|
---|
315 | On a GPU fault or hang, a GPU core dump is taken by the DRM driver and saved to
|
---|
316 | ``/sys/devices/virtual/devcoredump/**/data``. You can cp that file to a
|
---|
317 | :file:`crash.devcore` to save it, otherwise the kernel will expire it
|
---|
318 | eventually. Echo 1 to the file to free the core early, as another core won't be
|
---|
319 | taken until then.
|
---|
320 |
|
---|
321 | Once you have your core file, you can use :command:`crashdec -f crash.devcore`
|
---|
322 | to decode it. The output will have ``ESTIMATED CRASH LOCATION`` where we
|
---|
323 | estimate the CP to have stopped. Note that it is expected that this will be
|
---|
324 | some distance past whatever state triggered the fault, given GPU pipelining, and
|
---|
325 | will often be at some ``CP_REG_TO_MEM`` (which waits on previous WFIs) or
|
---|
326 | ``CP_WAIT_FOR_ME`` (which waits for all register writes to land) or similar
|
---|
327 | event. You can try running the workload with ``TU_DEBUG=flushall`` or
|
---|
328 | ``FD_MESA_DEBUG=flush`` to try to close in on the failing commands.
|
---|
329 |
|
---|
330 | You can also find what commands were queued up to each cluster in the
|
---|
331 | ``regs-name: CP_MEMPOOL`` section.
|
---|
332 |
|
---|
333 | If ``ESTIMATED CRASH LOCATION`` doesn't exist you could find ``CP_SQE_STAT``,
|
---|
334 | though going here is the last resort and likely won't be helpful.
|
---|
335 |
|
---|
336 | .. code-block::
|
---|
337 |
|
---|
338 | indexed-registers:
|
---|
339 | - regs-name: CP_SQE_STAT
|
---|
340 | dwords: 51
|
---|
341 | PC: 00d7 <-------------
|
---|
342 | PKT: CP_LOAD_STATE6_FRAG
|
---|
343 | $01: 70348003 $11: 00000000
|
---|
344 | $02: 20000000 $12: 00000022
|
---|
345 |
|
---|
346 | The ``PC`` value is an instruction address in the current firmware.
|
---|
347 | You would need to disassemble the firmware (/lib/firmware/qcom/aXXX_sqe.fw) via:
|
---|
348 |
|
---|
349 | .. code-block:: console
|
---|
350 |
|
---|
351 | afuc-disasm -v a650_sqe.fw > a650_sqe.fw.disasm
|
---|
352 |
|
---|
353 | Now you should search for PC value in the disassembly, e.g.:
|
---|
354 |
|
---|
355 | .. code-block::
|
---|
356 |
|
---|
357 | l018: 00d1: 08dd0001 add $addr, $06, 0x0001
|
---|
358 | 00d2: 981ff806 mov $data, $data
|
---|
359 | 00d3: 8a080001 mov $08, 0x0001 << 16
|
---|
360 | 00d4: 3108ffff or $08, $08, 0xffff
|
---|
361 | 00d5: 9be8f805 and $data, $data, $08
|
---|
362 | 00d6: 9806e806 mov $addr, $06
|
---|
363 | 00d7: 9803f806 mov $data, $03 <------------- HERE
|
---|
364 | 00d8: d8000000 waitin
|
---|
365 | 00d9: 981f0806 mov $01, $data
|
---|
366 |
|
---|
367 |
|
---|
368 | Command Stream Capture
|
---|
369 | ^^^^^^^^^^^^^^^^^^^^^^
|
---|
370 |
|
---|
371 | During Mesa development, it's often useful to look at the command streams we
|
---|
372 | send to the kernel. Mesa itself doesn't implement a way to stream them out
|
---|
373 | (though it maybe should!). Instead, we have an interface for the kernel to
|
---|
374 | capture all submitted command streams:
|
---|
375 |
|
---|
376 | .. code-block:: console
|
---|
377 |
|
---|
378 | cat /sys/kernel/debug/dri/0/rd > cmdstream &
|
---|
379 |
|
---|
380 | By default, command stream capture does not capture texture/vertex/etc. data.
|
---|
381 | You can enable capturing all the BOs with:
|
---|
382 |
|
---|
383 | .. code-block:: console
|
---|
384 |
|
---|
385 | echo Y > /sys/module/msm/parameters/rd_full
|
---|
386 |
|
---|
387 | Note that, since all command streams get captured, it is easy to run the system
|
---|
388 | out of memory doing this, so you probably don't want to enable it during play of
|
---|
389 | a heavyweight game. Instead, to capture a command stream within a game, you
|
---|
390 | probably want to cause a crash in the GPU during a frame of interest so that a
|
---|
391 | single GPU core dump is generated. Emitting ``0xdeadbeef`` in the CS should be
|
---|
392 | enough to cause a fault.
|
---|
393 |
|
---|
394 | Capturing Hang RD
|
---|
395 | +++++++++++++++++
|
---|
396 |
|
---|
397 | Devcore file doesn't contain all submitted command streams, only the hanging one.
|
---|
398 | Additionally it is geared towards analyzing the GPU state at the moment of the crash.
|
---|
399 |
|
---|
400 | Alternatively, it's possible to obtain the whole submission with all command
|
---|
401 | streams via ``/sys/kernel/debug/dri/0/hangrd``:
|
---|
402 |
|
---|
403 | .. code-block:: console
|
---|
404 |
|
---|
405 | sudo cat /sys/kernel/debug/dri/0/hangrd > logfile.rd // Do the cat _before_ the expected hang
|
---|
406 |
|
---|
407 | The format of hangrd is the same as in ordinary command stream capture.
|
---|
408 | ``rd_full`` also has the same effect on it.
|
---|
409 |
|
---|
410 | Replaying Command Stream
|
---|
411 | ^^^^^^^^^^^^^^^^^^^^^^^^
|
---|
412 |
|
---|
413 | `replay` tool allows capturing and replaying ``rd`` to reproduce GPU faults.
|
---|
414 | Especially useful for transient GPU issues since it has much higher chances to
|
---|
415 | reproduce them.
|
---|
416 |
|
---|
417 | Dumping rendering results or even just memory is currently unsupported.
|
---|
418 |
|
---|
419 | - Replaying command streams requires kernel with ``MSM_INFO_SET_IOVA`` support.
|
---|
420 | - Requires ``rd`` capture to have full snapshots of the memory (``rd_full`` is enabled).
|
---|
421 |
|
---|
422 | Replaying is done via `replay` tool:
|
---|
423 |
|
---|
424 | .. code-block:: console
|
---|
425 |
|
---|
426 | ./replay test_replay.rd
|
---|
427 |
|
---|
428 | More examples:
|
---|
429 |
|
---|
430 | .. code-block:: console
|
---|
431 |
|
---|
432 | ./replay --first=start_submit_n --last=last_submit_n test_replay.rd
|
---|
433 |
|
---|
434 | .. code-block:: console
|
---|
435 |
|
---|
436 | ./replay --override=0 --generator=./generate_rd test_replay.rd
|
---|
437 |
|
---|
438 | Editing Command Stream (a6xx+)
|
---|
439 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
---|
440 |
|
---|
441 | While replaying a fault is useful in itself, modifying the capture to
|
---|
442 | understand what causes the fault could be even more useful.
|
---|
443 |
|
---|
444 | ``rddecompiler`` decompiles a single cmdstream from ``rd`` into compilable C source.
|
---|
445 | Given the address space bounds the generated program creates a new ``rd`` which
|
---|
446 | could be used to override cmdstream with 'replay'. Generated ``rd`` is not replayable
|
---|
447 | on its own and depends on buffers provided by the source ``rd``.
|
---|
448 |
|
---|
449 | C source could be compiled using rdcompiler-meson.build as an example.
|
---|
450 |
|
---|
451 | The workflow would look like this:
|
---|
452 |
|
---|
453 | 1. Find the cmdstream № you want to edit;
|
---|
454 | 2. Decompile it:
|
---|
455 |
|
---|
456 | .. code-block:: console
|
---|
457 |
|
---|
458 | ./rddecompiler -s %cmd_stream_n% example.rd > generate_rd.c
|
---|
459 |
|
---|
460 | 3. Edit the command stream;
|
---|
461 | 4. Compile it back, see rdcompiler-meson.build for the instructions;
|
---|
462 | 5. Plug the generator into cmdstream replay:
|
---|
463 |
|
---|
464 | .. code-block:: console
|
---|
465 |
|
---|
466 | ./replay --override=%cmd_stream_№% --generator=~/generate_rd
|
---|
467 |
|
---|
468 | 6. Repeat 3-5.
|
---|
469 |
|
---|
470 | GPU Hang Debugging
|
---|
471 | ^^^^^^^^^^^^^^^^^^
|
---|
472 |
|
---|
473 | Not a guide for how to do it but mostly an enumeration of methods.
|
---|
474 |
|
---|
475 | Useful ``TU_DEBUG`` (for Turnip) options to narrow down the hang cause:
|
---|
476 |
|
---|
477 | ``sysmem``, ``gmem``, ``nobin``, ``forcebin``, ``noubwc``, ``nolrz``, ``flushall``, ``syncdraw``, ``rast_order``
|
---|
478 |
|
---|
479 | Useful ``FD_MESA_DEBUG`` (for Freedreno) options:
|
---|
480 |
|
---|
481 | ``sysmem``, ``gmem``, ``nobin``, ``noubwc``, ``nolrz``, ``notile``, ``dclear``, ``ddraw``, ``flush``, ``inorder``, ``noblit``
|
---|
482 |
|
---|
483 | Useful ``IR3_SHADER_DEBUG`` options:
|
---|
484 |
|
---|
485 | ``nouboopt``, ``spillall``, ``nopreamble``, ``nofp16``
|
---|
486 |
|
---|
487 | Use Graphics Flight Recorder to narrow down the place which hangs,
|
---|
488 | use our own breadcrumbs implementation in case of unrecoverable hangs.
|
---|
489 |
|
---|
490 | In case of faults use RenderDoc to find the problematic command. If it's
|
---|
491 | a draw call, edit shader in RenderDoc to find whether it culprit is a shader.
|
---|
492 | If yes, bisect it.
|
---|
493 |
|
---|
494 | If editing the shader messes the assembly too much and the issue becomes unreproducible
|
---|
495 | try editing the assembly itself via ``IR3_SHADER_OVERRIDE_PATH``.
|
---|
496 |
|
---|
497 | If fault or hang is transient try capturing an ``rd`` and replay it. If issue
|
---|
498 | is reproduced - bisect the GPU packets until the culprit is found.
|
---|
499 |
|
---|
500 | Do the above if culprit is not a shader.
|
---|
501 |
|
---|
502 | The hang recovery mechanism in Kernel is not perfect, in case of unrecoverable
|
---|
503 | hangs check whether the kernel is up to date and look for unmerged patches
|
---|
504 | which could improve the recovery.
|
---|
505 |
|
---|
506 | GPU Breadcrumbs
|
---|
507 | +++++++++++++++
|
---|
508 |
|
---|
509 | Breadcrumbs described below are available only in Turnip.
|
---|
510 |
|
---|
511 | Freedreno has simpler breadcrumbs, in debug build writes breadcrumbs
|
---|
512 | into ``CP_SCRATCH_REG[6]`` and per-tile breadcrumbs into ``CP_SCRATCH_REG[7]``,
|
---|
513 | in this way they are available in the devcoredump. TODO: generalize Tunip's
|
---|
514 | breadcrumbs implementation.
|
---|
515 |
|
---|
516 | This is a simple implementations of breadcrumbs tracking of GPU progress
|
---|
517 | intended to be a last resort when debugging unrecoverable hangs.
|
---|
518 | For best results use Vulkan traces to have a predictable place of hang.
|
---|
519 |
|
---|
520 | For ordinary hangs as a more user-friendly solution use GFR
|
---|
521 | "Graphics Flight Recorder".
|
---|
522 |
|
---|
523 | Or breadcrumbs implementation aims to handle cases where nothing can be done
|
---|
524 | after the hang. In-driver breadcrumbs also allow more precise tracking since
|
---|
525 | we could target a single GPU packet.
|
---|
526 |
|
---|
527 | While breadcrumbs support gmem, try to reproduce the hang in a sysmem mode
|
---|
528 | because it would require much less breadcrumb writes and syncs.
|
---|
529 |
|
---|
530 | Breadcrumbs settings:
|
---|
531 |
|
---|
532 | .. code-block:: console
|
---|
533 |
|
---|
534 | TU_BREADCRUMBS=%IP%:%PORT%,break=%BREAKPOINT%:%BREAKPOINT_HITS%
|
---|
535 |
|
---|
536 | ``BREAKPOINT``
|
---|
537 | The breadcrumb starting from which we require explicit ack.
|
---|
538 | ``BREAKPOINT_HITS``
|
---|
539 | How many times breakpoint should be reached for break to occur.
|
---|
540 | Necessary for a gmem mode and re-usable cmdbuffers in both of which
|
---|
541 | the same cmdstream could be executed several times.
|
---|
542 |
|
---|
543 | A typical work flow would be:
|
---|
544 |
|
---|
545 | - Start listening for breadcrumbs on a remote host:
|
---|
546 |
|
---|
547 | .. code-block:: console
|
---|
548 |
|
---|
549 | nc -lvup $PORT | stdbuf -o0 xxd -pc -c 4 | awk -Wposix '{printf("%u:%u\n", "0x" $0, a[$0]++)}'
|
---|
550 |
|
---|
551 | - Start capturing command stream;
|
---|
552 | - Replay the hanging trace with:
|
---|
553 |
|
---|
554 | .. code-block:: console
|
---|
555 |
|
---|
556 | TU_BREADCRUMBS=$IP:$PORT,break=-1:0
|
---|
557 |
|
---|
558 | - Increase hangcheck period:
|
---|
559 |
|
---|
560 | .. code-block:: console
|
---|
561 |
|
---|
562 | echo -n 60000 > /sys/kernel/debug/dri/0/hangcheck_period_ms
|
---|
563 |
|
---|
564 | - After GPU hang note the last breadcrumb and relaunch trace with:
|
---|
565 |
|
---|
566 | .. code-block:: console
|
---|
567 |
|
---|
568 | TU_BREADCRUMBS=%IP%:%PORT%,break=%LAST_BREADCRUMB%:%HITS%
|
---|
569 |
|
---|
570 | - After the breakpoint is reached each breadcrumb would require
|
---|
571 | explicit ack from the user. This way it's possible to find
|
---|
572 | the last packet which didn't hang.
|
---|
573 |
|
---|
574 | - Find the packet in the decoded cmdstream.
|
---|
575 |
|
---|
576 | Debugging random failures
|
---|
577 | ^^^^^^^^^^^^^^^^^^^^^^^^^
|
---|
578 |
|
---|
579 | In most cases random GPU faults and rendering artifacts are caused by some kind
|
---|
580 | of undifined behaviour that falls under the following categories:
|
---|
581 |
|
---|
582 | - Usage of a stale reg value;
|
---|
583 | - Usage of stale memory (e.g. expecting it to be zeroed when it is not);
|
---|
584 | - Lack of the proper synchronization.
|
---|
585 |
|
---|
586 | Finding instances of stale reg reads
|
---|
587 | ++++++++++++++++++++++++++++++++++++
|
---|
588 |
|
---|
589 | Turnip has a debug option to stomp the registers with invalid values to catch
|
---|
590 | the cases where stale data is read.
|
---|
591 |
|
---|
592 | .. code-block:: console
|
---|
593 |
|
---|
594 | MESA_VK_ABORT_ON_DEVICE_LOSS=1 \
|
---|
595 | TU_DEBUG_STALE_REGS_RANGE=0x00000c00,0x0000be01 \
|
---|
596 | TU_DEBUG_STALE_REGS_FLAGS=cmdbuf,renderpass \
|
---|
597 | ./app
|
---|
598 |
|
---|
599 | .. envvar:: TU_DEBUG_STALE_REGS_RANGE
|
---|
600 |
|
---|
601 | the reg range in which registers would be stomped. Add ``inverse`` to the
|
---|
602 | flags in order for this range to specify which registers NOT to stomp.
|
---|
603 |
|
---|
604 | .. envvar:: TU_DEBUG_STALE_REGS_FLAGS
|
---|
605 |
|
---|
606 | ``cmdbuf``
|
---|
607 | stomp registers at the start of each command buffer.
|
---|
608 | ``renderpass``
|
---|
609 | stomp registers before each renderpass.
|
---|
610 | ``inverse``
|
---|
611 | changes `TU_DEBUG_STALE_REGS_RANGE` meaning to
|
---|
612 | "regs that should NOT be stomped".
|
---|
613 |
|
---|
614 | The best way to pinpoint the reg which causes a failure is to bisect the regs
|
---|
615 | range. In case when a fail is caused by combination of several registers
|
---|
616 | the `inverse` flag may be set to find the reg which prevents the failure.
|
---|