Changeset 74588 in vbox
- Timestamp:
- Oct 2, 2018 11:44:30 PM (6 years ago)
- Location:
- trunk/src/VBox/VMM
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
trunk/src/VBox/VMM/VMMR3/NEMR3Native-win.cpp
r74517 r74588 2070 2070 * packed. 2071 2071 * 2072 * Update: Somewhere along the security fixes or/and microcode updates this 2073 * summer (2018), performance dropped even more. 2074 * 2075 * Update [build 17757]: Some performance improvements here, but they don't 2076 * yet make up for what was lost this summer. 2077 * 2072 2078 * 2073 2079 * - We need a way to directly modify the TSC offset (or bias if you like). … … 2100 2106 * there is no way to support X2APIC. 2101 2107 * 2108 * 2109 * - Not sure if this is a thing, but WHvCancelVirtualProcessor seems to cause 2110 * cause a lot more spurious WHvRunVirtualProcessor returns that what we get 2111 * with the replacement code. By spurious returns we mean that the 2112 * subsequent call to WHvRunVirtualProcessor would return immediately. 2113 * 2114 * 2115 * - There is no API for modifying protection of a page within a GPA range. 2116 * 2117 * From what we can tell, the only way to modify the protection (like readonly 2118 * -> writable, or vice versa) is to first unmap the range and then remap it 2119 * with the new protection. 2120 * 2121 * We are for instance doing this quite a bit in order to track dirty VRAM 2122 * pages. VRAM pages starts out as readonly, when the guest writes to a page 2123 * we take an exit, notes down which page it is, makes it writable and restart 2124 * the instruction. After refreshing the display, we reset all the writable 2125 * pages to readonly again, bulk fashion. 2126 * 2127 * Now to work around this issue, we do page sized GPA ranges. In addition to 2128 * add a lot of tracking overhead to WinHvPlatform and VID.SYS, this also 2129 * causes us to exceed our quota before we've even mapped a default sized 2130 * (128MB) VRAM page-by-page. So, to work around this quota issue we have to 2131 * lazily map pages and actively restrict the number of mappings. 2132 * 2133 * Our best workaround thus far is bypassing WinHvPlatform and VID entirely 2134 * when in comes to guest memory management and instead use the underlying 2135 * hypercalls (HvCallMapGpaPages, HvCallUnmapGpaPages) to do it ourselves. 2136 * (This also maps a whole lot better into our own guest page management 2137 * infrastructure.) 2138 * 2139 * Update [build 17757]: Introduces a KVM like dirty logging API which could 2140 * help tracking dirty VGA pages, while being useless for shadow ROM and 2141 * devices trying catch the guest updating descriptors and such. 2142 * 2143 * 2144 * - Observed problems doing WHvUnmapGpaRange immediately followed by 2145 * WHvMapGpaRange. 2146 * 2147 * As mentioned above, we've been forced to use this sequence when modifying 2148 * page protection. However, when transitioning from readonly to writable, 2149 * we've ended up looping forever with the same write to readonly memory 2150 * VMEXIT. We're wondering if this issue might be related to the lazy mapping 2151 * logic in WinHvPlatform. 2152 * 2153 * Workaround: Insert a WHvRunVirtualProcessor call and make sure to get a GPA 2154 * unmapped exit between the two calls. Not entirely great performance wise 2155 * (or the santity of our code). 2156 * 2157 * 2158 * - Implementing A20 gate behavior is tedious, where as correctly emulating the 2159 * A20M# pin (present on 486 and later) is near impossible for SMP setups 2160 * (e.g. possiblity of two CPUs with different A20 status). 2161 * 2162 * Workaround: Only do A20 on CPU 0, restricting the emulation to HMA. We 2163 * unmap all pages related to HMA (0x100000..0x10ffff) when the A20 state 2164 * changes, lazily syncing the right pages back when accessed. 2165 * 2166 * 2167 * - WHVRunVirtualProcessor wastes time converting VID/Hyper-V messages to its 2168 * own format (WHV_RUN_VP_EXIT_CONTEXT). 2169 * 2170 * We understand this might be because Microsoft wishes to remain free to 2171 * modify the VID/Hyper-V messages, but it's still rather silly and does slow 2172 * things down a little. We'd much rather just process the messages directly. 2173 * 2174 * 2175 * - WHVRunVirtualProcessor would've benefited from using a callback interface: 2176 * 2177 * - The potential size changes of the exit context structure wouldn't be 2178 * an issue, since the function could manage that itself. 2179 * 2180 * - State handling could probably be simplified (like cancelation). 2181 * 2182 * 2183 * - WHvGetVirtualProcessorRegisters and WHvSetVirtualProcessorRegisters 2184 * internally converts register names, probably using temporary heap buffers. 2185 * 2186 * From the looks of things, they are converting from WHV_REGISTER_NAME to 2187 * HV_REGISTER_NAME from in the "Virtual Processor Register Names" section in 2188 * the "Hypervisor Top-Level Functional Specification" document. This feels 2189 * like an awful waste of time. 2190 * 2191 * We simply cannot understand why HV_REGISTER_NAME isn't used directly here, 2192 * or at least the same values, making any conversion reduntant. Restricting 2193 * access to certain registers could easily be implement by scanning the 2194 * inputs. 2195 * 2196 * To avoid the heap + conversion overhead, we're currently using the 2197 * HvCallGetVpRegisters and HvCallSetVpRegisters calls directly, at least for 2198 * the ring-0 code. 2199 * 2200 * Update [build 17757]: Register translation has been very cleverly 2201 * optimized and made table driven (2 top level tables, 4 + 1 leaf tables). 2202 * Register information consists of the 32-bit HV register name, register page 2203 * offset, and flags (giving valid offset, size and more). Register 2204 * getting/settings seems to be done by hoping that the register page provides 2205 * it all, and falling back on the VidSetVirtualProcessorState if one or more 2206 * registers are not available there. 2207 * 2208 * Note! We have currently not updated our ring-0 code to take the register 2209 * page into account, so it's suffering a little compared to the ring-3 code 2210 * that now uses the offical APIs for registers. 2211 * 2212 * 2213 * - The YMM and XCR0 registers are not yet named (17083). This probably 2214 * wouldn't be a problem if HV_REGISTER_NAME was used, see previous point. 2215 * 2216 * Update [build 17757]: XCR0 is added. YMM register values seems to be put 2217 * into a yet undocumented XsaveState interface. Approach is a little bulky, 2218 * but saves number of enums and dispenses with register transation. Also, 2219 * the underlying Vid setter API duplicates the input buffer on the heap, 2220 * adding a 16 byte header. 2221 * 2222 * 2223 * - Why does VID.SYS only query/set 32 registers at the time thru the 2224 * HvCallGetVpRegisters and HvCallSetVpRegisters hypercalls? 2225 * 2226 * We've not trouble getting/setting all the registers defined by 2227 * WHV_REGISTER_NAME in one hypercall (around 80). Some kind of stack 2228 * buffering or similar? 2229 * 2230 * 2231 * - To handle the VMMCALL / VMCALL instructions, it seems we need to intercept 2232 * \#UD exceptions and inspect the opcodes. A dedicated exit for hypercalls 2233 * would be more efficient, esp. for guests using \#UD for other purposes.. 2234 * 2235 * 2236 * - Wrong instruction length in the VpContext with unmapped GPA memory exit 2237 * contexts on 17115/AMD. 2238 * 2239 * One byte "PUSH CS" was reported as 2 bytes, while a two byte 2240 * "MOV [EBX],EAX" was reported with a 1 byte instruction length. Problem 2241 * naturally present in untranslated hyper-v messages. 2242 * 2243 * 2244 * - The I/O port exit context information seems to be missing the address size 2245 * information needed for correct string I/O emulation. 2246 * 2247 * VT-x provides this information in bits 7:9 in the instruction information 2248 * field on newer CPUs. AMD-V in bits 7:9 in the EXITINFO1 field in the VMCB. 2249 * 2250 * We can probably work around this by scanning the instruction bytes for 2251 * address size prefixes. Haven't investigated it any further yet. 2252 * 2253 * 2254 * - Querying WHvCapabilityCodeExceptionExitBitmap returns zero even when 2255 * intercepts demonstrably works (17134). 2256 * 2257 * 2258 * - Querying HvPartitionPropertyDebugChannelId via HvCallGetPartitionProperty 2259 * (hypercall) hangs the host (17134). 2260 * 2261 * 2262 * 2263 * Old concerns that have been addressed: 2102 2264 * 2103 2265 * - The WHvCancelVirtualProcessor API schedules a dummy usermode APC callback … … 2116 2278 * modifications and the extra kernel call. 2117 2279 * 2118 * 2119 * - Not sure if this is a thing, but WHvCancelVirtualProcessor seems to cause 2120 * cause a lot more spurious WHvRunVirtualProcessor returns that what we get 2121 * with the replacement code. By spurious returns we mean that the 2122 * subsequent call to WHvRunVirtualProcessor would return immediately. 2280 * Update: All concerns have addressed in or about build 17757. 2281 * 2282 * The WHvCancelVirtualProcessor API is now implemented using a new 2283 * VidMessageSlotHandleAndGetNext() flag (4). Codepath is slightly longer 2284 * than NtAlertThread, but has the added benefit that spurious wakeups can be 2285 * more easily reduced. 2123 2286 * 2124 2287 * … … 2131 2294 * what is missing from his point of view in a single kernel call. 2132 2295 * 2296 * Update: All concerns have been addressed in or about build 17757. Selected 2297 * registers are now available via shared memory and thus HLT should (not 2298 * verified) no longer require a system call to compose the exit context data. 2299 * 2133 2300 * 2134 2301 * - The WHvRunVirtualProcessor implementation does lazy GPA range mappings when … … 2143 2310 * point). 2144 2311 * 2145 * 2146 * - There is no API for modifying protection of a page within a GPA range. 2147 * 2148 * From what we can tell, the only way to modify the protection (like readonly 2149 * -> writable, or vice versa) is to first unmap the range and then remap it 2150 * with the new protection. 2151 * 2152 * We are for instance doing this quite a bit in order to track dirty VRAM 2153 * pages. VRAM pages starts out as readonly, when the guest writes to a page 2154 * we take an exit, notes down which page it is, makes it writable and restart 2155 * the instruction. After refreshing the display, we reset all the writable 2156 * pages to readonly again, bulk fashion. 2157 * 2158 * Now to work around this issue, we do page sized GPA ranges. In addition to 2159 * add a lot of tracking overhead to WinHvPlatform and VID.SYS, this also 2160 * causes us to exceed our quota before we've even mapped a default sized 2161 * (128MB) VRAM page-by-page. So, to work around this quota issue we have to 2162 * lazily map pages and actively restrict the number of mappings. 2163 * 2164 * Our best workaround thus far is bypassing WinHvPlatform and VID entirely 2165 * when in comes to guest memory management and instead use the underlying 2166 * hypercalls (HvCallMapGpaPages, HvCallUnmapGpaPages) to do it ourselves. 2167 * (This also maps a whole lot better into our own guest page management 2168 * infrastructure.) 2169 * 2170 * 2171 * - Observed problems doing WHvUnmapGpaRange immediately followed by 2172 * WHvMapGpaRange. 2173 * 2174 * As mentioned above, we've been forced to use this sequence when modifying 2175 * page protection. However, when transitioning from readonly to writable, 2176 * we've ended up looping forever with the same write to readonly memory 2177 * VMEXIT. We're wondering if this issue might be related to the lazy mapping 2178 * logic in WinHvPlatform. 2179 * 2180 * Workaround: Insert a WHvRunVirtualProcessor call and make sure to get a GPA 2181 * unmapped exit between the two calls. Not entirely great performance wise 2182 * (or the santity of our code). 2183 * 2184 * 2185 * - Implementing A20 gate behavior is tedious, where as correctly emulating the 2186 * A20M# pin (present on 486 and later) is near impossible for SMP setups 2187 * (e.g. possiblity of two CPUs with different A20 status). 2188 * 2189 * Workaround: Only do A20 on CPU 0, restricting the emulation to HMA. We 2190 * unmap all pages related to HMA (0x100000..0x10ffff) when the A20 state 2191 * changes, lazily syncing the right pages back when accessed. 2192 * 2193 * 2194 * - WHVRunVirtualProcessor wastes time converting VID/Hyper-V messages to its 2195 * own format (WHV_RUN_VP_EXIT_CONTEXT). 2196 * 2197 * We understand this might be because Microsoft wishes to remain free to 2198 * modify the VID/Hyper-V messages, but it's still rather silly and does slow 2199 * things down a little. We'd much rather just process the messages directly. 2200 * 2201 * 2202 * - WHVRunVirtualProcessor would've benefited from using a callback interface: 2203 * 2204 * - The potential size changes of the exit context structure wouldn't be 2205 * an issue, since the function could manage that itself. 2206 * 2207 * - State handling could probably be simplified (like cancelation). 2208 * 2209 * 2210 * - WHvGetVirtualProcessorRegisters and WHvSetVirtualProcessorRegisters 2211 * internally converts register names, probably using temporary heap buffers. 2212 * 2213 * From the looks of things, they are converting from WHV_REGISTER_NAME to 2214 * HV_REGISTER_NAME from in the "Virtual Processor Register Names" section in 2215 * the "Hypervisor Top-Level Functional Specification" document. This feels 2216 * like an awful waste of time. 2217 * 2218 * We simply cannot understand why HV_REGISTER_NAME isn't used directly here, 2219 * or at least the same values, making any conversion reduntant. Restricting 2220 * access to certain registers could easily be implement by scanning the 2221 * inputs. 2222 * 2223 * To avoid the heap + conversion overhead, we're currently using the 2224 * HvCallGetVpRegisters and HvCallSetVpRegisters calls directly. 2225 * 2226 * 2227 * - The YMM and XCR0 registers are not yet named (17083). This probably 2228 * wouldn't be a problem if HV_REGISTER_NAME was used, see previous point. 2229 * 2230 * 2231 * - Why does VID.SYS only query/set 32 registers at the time thru the 2232 * HvCallGetVpRegisters and HvCallSetVpRegisters hypercalls? 2233 * 2234 * We've not trouble getting/setting all the registers defined by 2235 * WHV_REGISTER_NAME in one hypercall (around 80). Some kind of stack 2236 * buffering or similar? 2237 * 2238 * 2239 * - To handle the VMMCALL / VMCALL instructions, it seems we need to intercept 2240 * \#UD exceptions and inspect the opcodes. A dedicated exit for hypercalls 2241 * would be more efficient, esp. for guests using \#UD for other purposes.. 2242 * 2243 * 2244 * - Wrong instruction length in the VpContext with unmapped GPA memory exit 2245 * contexts on 17115/AMD. 2246 * 2247 * One byte "PUSH CS" was reported as 2 bytes, while a two byte 2248 * "MOV [EBX],EAX" was reported with a 1 byte instruction length. Problem 2249 * naturally present in untranslated hyper-v messages. 2250 * 2251 * 2252 * - The I/O port exit context information seems to be missing the address size 2253 * information needed for correct string I/O emulation. 2254 * 2255 * VT-x provides this information in bits 7:9 in the instruction information 2256 * field on newer CPUs. AMD-V in bits 7:9 in the EXITINFO1 field in the VMCB. 2257 * 2258 * We can probably work around this by scanning the instruction bytes for 2259 * address size prefixes. Haven't investigated it any further yet. 2260 * 2261 * 2262 * - Querying WHvCapabilityCodeExceptionExitBitmap returns zero even when 2263 * intercepts demonstrably works (17134). 2264 * 2265 * 2266 * - Querying HvPartitionPropertyDebugChannelId via HvCallGetPartitionProperty 2267 * (hypercall) hangs the host (17134). 2312 * Update: All concerns have been addressed in or about build 17757. 2268 2313 * 2269 2314 * … … 2358 2403 * @subsection sec_nem_win_benchmarks Benchmarks. 2359 2404 * 2360 * @subsubsection subsect_nem_win_benchmarks_bs2t1 Bootsector2-test12405 * @subsubsection subsect_nem_win_benchmarks_bs2t1 17134/2018-06-22: Bootsector2-test1 2361 2406 * 2362 2407 * This is ValidationKit/bootsectors/bootsector2-test1.asm as of 2018-06-22 … … 2467 2512 * 2468 2513 * 2469 * @subsubsection subsect_nem_win_benchmarks_w2k Windows 2000 Boot & Shutdown 2514 * @subsubsection subsect_nem_win_benchmarks_bs2t1u1 17134/2018-10-02: Bootsector2-test1 2515 * 2516 * Update on 17134. While expectantly testing a couple of newer builds (17758, 2517 * 17763) hoping for some increases in performance, the numbers turn out 2518 * to be generally worse than the initial test June test run. So, I went back 2519 * to the 1803 (17134) installation and re-tested, finding that the numbers had 2520 * somehow turned worse over the last 3-4 months. 2521 * 2522 * 2523 * 2524 * Suspects are security updates and/or microcode updates installed since then, 2525 * either hitting thread switching and/or hyper-V badly. I'm a bit puzzled why 2526 * AMD is affected this badly too. 2527 * 2528 * 2529 * @subsubsection subsect_nem_win_benchmarks_bs2t1u2 17763: Bootsector2-test1 2530 * 2531 * Some preliminary numbers for build 17763 on the 3.4 GHz AMD 1950X, the second 2532 * column will improve we get time to have a look the register page. 2533 * 2534 * There is a 50% performance loss here compared to the June numbers with 2535 * build 17134. The RDTSC numbers hits that it isn't in the Hyper-V core 2536 * (hvax64.exe), but something on the NT side. 2537 * 2538 * @verbatim 2539 TESTING... WinHv API Hypercalls + VID VirtualBox AMD-V 2540 32-bit paged protected mode, CPUID : 54 145 ins/sec 51 436 2541 real mode, CPUID : 54 178 ins/sec 51 713 2542 [snip] 2543 32-bit paged protected mode, RDTSC : 98 927 639 ins/sec 100 254 552 2544 real mode, RDTSC : 99 601 206 ins/sec 100 886 699 2545 [snip] 2546 32-bit paged protected mode, 32-bit IN : 54 621 ins/sec 51 524 2547 32-bit paged protected mode, 32-bit OUT : 54 870 ins/sec 51 671 2548 32-bit paged protected mode, 32-bit IN-to-ring-3 : 54 624 ins/sec 43 964 2549 32-bit paged protected mode, 32-bit OUT-to-ring-3 : 54 803 ins/sec 44 087 2550 [snip] 2551 32-bit paged protected mode, 32-bit read : 28 230 ins/sec 34 042 2552 32-bit paged protected mode, 32-bit write : 27 962 ins/sec 34 050 2553 32-bit paged protected mode, 32-bit read-to-ring-3 : 27 841 ins/sec 28 397 2554 32-bit paged protected mode, 32-bit write-to-ring-3 : 27 896 ins/sec 29 455 2555 * @endverbatim 2556 * 2557 * 2558 * @subsubsection subsect_nem_win_benchmarks_w2k 17134/2018-06-22: Windows 2000 Boot & Shutdown 2470 2559 * 2471 2560 * Timing the startup and automatic shutdown of a Windows 2000 SP4 guest serves -
trunk/src/VBox/VMM/testcase/NemRawBench-1.cpp
r74582 r74588 1258 1258 if (rcExit == 0) 1259 1259 { 1260 printf("tstNem Mini-1: Successfully created test VM...\n");1260 printf("tstNemBench-1: Successfully created test VM...\n"); 1261 1261 1262 1262 /* … … 1267 1267 mmioTest(cFactor); 1268 1268 1269 printf("tstNem Mini-1: done\n");1269 printf("tstNemBench-1: done\n"); 1270 1270 } 1271 1271 return rcExit;
Note:
See TracChangeset
for help on using the changeset viewer.