Opened 3 years ago
Last modified 3 years ago
#20625 new defect
After hardware upgrade OS/2 does not boot
Reported by: | James Moe | Owned by: | |
---|---|---|---|
Component: | host support | Version: | VirtualBox 6.1.26 |
Keywords: | os/2, hardware | Cc: | |
Guest type: | other | Host type: | Linux |
Description
opensuse tumbleweed 20211012 linux v5.14.9-1-default x86_64 VBox v6.1.26_SUSE r145957
After an hardware upgrade the OS/2 guest has stopped booting. The boot proceeds normally to a point, then simply stops.
Old CPU: AMD Athlon II 4x New CPU: AMD Ryzen 5 5600x
Mainboard Old: asus m5a88-m New: asus tuf gaming b550+
Attachments (1)
Change History (17)
by , 3 years ago
comment:1 by , 3 years ago
I hope it's not caused by software which has trouble with a too fast CPU.
If it is related to tripping over certain CPU feature details you could try letting VirtualBox report a different (older) CPU profile. This works from the command line.
You can list the CPU profiles with
$ VBoxManage list cpu-profiles
Selecting a profile for a certain VM works with
$ VBoxManage modifyvm "vmname" --cpu-profile="profile name"
Note that not all profiles are going to work on a specific host CPU. It does not add features to your CPU. The highest chance of success is with older CPU profiles from the same vendor (or quite old Intel models). Also, remember it is a profile specifying which CPU features to report. It does not have any impact on CPU clock speed.
comment:2 by , 3 years ago
The VBoxManage list cpu-profile
command doesn't work in 6.1, it's a feature in the upcoming major release. Sorry. Here is a list of available CPU profile names for 6.1 (in alphabetical order):
"AMD Athlon 64 3200+" "AMD Athlon 64 X2 Dual Core 4200+" "AMD FX-8150 Eight-Core" "AMD Phenom II X6 1100T" "Hygon C86 7185 32-core" "Intel 80186" "Intel 80286" "Intel 80386" "Intel 80486" "Intel 8086" "Intel Atom 330 1.60GHz" "Intel Core Duo T2600 2.16GHz" "Intel Core i5-3570" "Intel Core i7-2635QM" "Intel Core i7-3820QM" "Intel Core i7-3960X" "Intel Core i7-5600U" "Intel Core i7-6700K" "Intel Core2 T7600 2.33GHz" "Intel Core2 X6800 2.93GHz" "Intel Pentium 4 3.00GHz" "Intel Pentium M processor 2.00GHz" "Intel Pentium N3530 2.16GHz" "Intel Xeon X5482 3.20GHz" "Quad-Core AMD Opteron 2384" "VIA QuadCore L4700 1.2+ GHz" "ZHAOXIN KaiXian KX-U5581 1.8GHz"
For debugging the issue, I would suggest trying to trigger the driver loading messages and let us know when it stops (Alt-F2 or similar).
comment:3 by , 3 years ago
I remember other aspects of this. (After the upgrade there was a lot happening.)
The os/2 guest had been saved before the upgrade. Afterwards, the saved session ran as expected. It is the boot process that is the issue.
The last line of the boot screen: c:\os2\boot\testcfg.sys
I tried:
- restoring from an older backup
- reducing the execution cap to 30%
comment:4 by , 3 years ago
FWIW, did you also try the current VirtualBox 6.1.28, which supports Linux kernel 5.14?
comment:5 by , 3 years ago
No.
I use the current version on the Tumbleweed repository, which is 6.1.26. It is usually only a week or so after a VBox release that the new version appears.
I have tried to mix the two releases in the past; it did not go well.
comment:6 by , 3 years ago
VBox was upgracded to 6.1.28_SUSE r147628 today.
It made no difference to the failure of os/2 to start.
comment:8 by , 3 years ago
@klaus / @bird: A wild guess:
A few other VirtualBox users upgraded their host to modern AMD CPUs and experienced crashes in their Windows 9x guests. The technical background (without hypervisors involved) is described in Windows 9x TLB Invalidation Bug and TLB and Pagewalk Coherence in x86 Processors. Do you think that OS/2 may have the same bug as Windows 9x?
@jimoe: For a test, disable System > Acceleration > Enable Nested Paging.
comment:10 by , 3 years ago
Thanks for reporting back! :)
Can you easily try if OS/2 runs natively on your new hardware? Background reason for this request is that the maintainer of the OS/2 Museum didn't encounter your issue (yet), although having similar setups. TIA.
comment:11 by , 3 years ago
Does it boot from a storage drive instead of a VM? I have no idea.
comment:12 by , 3 years ago
Just my 0.02$ here, but booting an OS/2 build VM on an threadripper 3990x would crash when the display doctor (SDD) started unless I set the CPU profile to some older intel CPU (I picked "Intel Pentium M processor 2.00GHz" as I know OS/2 worked on that CPU (old thinkpad)).
comment:13 by , 3 years ago
I think I've reproduced a related issue here while trying to implement unattended installation of OS/2 guests. TESTCFG.SYS frequently crashes during initialization after it returns from query APM support from the BIOS in real mode (calling DevHlp 24h to read some DOS variable pointing to APM Info), the registers restored from the stack in the epilogue are all wrong and it finally #GPs on the RETF as there is no valid CS on the phantom stack it's using. When looking the stack in the debugger, everything seems fine. Will try track down where this goes south and whether it's specific to this real-mode tripping or not.
comment:14 by , 3 years ago
As mentioned, TESTCFG.SYS ends up causing a BIOS call in real mode. When switch to real mode an identity mapped page virtual address 12000h is installed by tweaking the page directory / tables. When returning to protected mode, these page table and page directory changes are undone after enabling paging, but no CR3 flushing is done afterwards and that's causing trouble.
VBoxDbg> u 1200:00000153 L 60 1200:00000153 55 push bp 1200:00000154 8b ec mov bp, sp 1200:00000156 b8 00 0a mov ax, 00a00h 1200:00000159 8e d8 mov ds, ax 1200:0000015b 9a 3a 27 00 12 call far 01200h:0273ah 1200:00000160 fa cli 1200:00000161 8b 46 04 mov ax, word [bp+004h] 1200:00000164 25 ff 8f and ax, 08fffh 1200:00000167 0d 00 30 or ax, 03000h 1200:0000016a 89 46 04 mov word [bp+004h], ax 1200:0000016d 25 ff fd and ax, 0fdffh 1200:00000170 50 push ax 1200:00000171 9d popfw 1200:00000172 e8 47 25 call 026bch 1200:00000175 66 0f 01 16 12 13 lgdt [01312h] 1200:0000017b 66 0f 01 1e 18 13 lidt [01318h] 1200:00000181 66 a1 90 64 mov eax, dword [06490h] 1200:00000185 0f 22 d8 mov cr3, eax ; Modified CR3 loaded prior to enabling paging. 1200:00000188 0f 20 c0 mov eax, cr0 1200:0000018b 66 0b 06 5b 0d or eax, dword [00d5bh] 1200:00000190 66 50 push eax 1200:00000192 0f 22 c0 mov cr0, eax 1200:00000195 ea 9a 01 00 12 jmp far 01200h:0019ah 1200:0000019a 33 c0 xor ax, ax 1200:0000019c 8e c0 mov es, ax 1200:0000019e b8 00 0a mov ax, 00a00h 1200:000001a1 8e d8 mov ds, ax 1200:000001a3 8e 16 5f 0d mov ss, [00d5fh] 1200:000001a7 e8 dc 24 call 02686h 1200:000001aa 66 58 pop eax 1200:000001ac 66 a9 00 00 00 80 test eax, dword 080000000h 1200:000001b2 0f 84 3a 00 je +0003ah (001f0h) 1200:000001b6 66 53 push ebx ; Start of code restoring the page directory and page table 1200:000001b8 b8 60 01 mov ax, 00160h ; it's original state. 1200:000001bb 8e d8 mov ds, ax 1200:000001bd 67 66 a1 08 70 ed ff mov eax, dword [0ffed7008h] 1200:000001c4 67 66 8b 1d 28 b9 ec ff mov ebx, dword [0ffecb928h] 1200:000001cc 67 66 89 18 mov dword [eax], ebx 1200:000001d0 66 33 c0 xor eax, eax 1200:000001d3 b8 00 12 mov ax, 01200h 1200:000001d6 66 c1 e8 06 shr eax, 006h 1200:000001da 67 66 03 05 30 b9 ec ff add eax, dword [0ffecb930h] 1200:000001e2 67 66 8b 1d 2c b9 ec ff mov ebx, dword [0ffecb92ch] 1200:000001ea 67 66 89 18 mov dword [eax], ebx 1200:000001ee 66 5b pop ebx ; Restored, but no TLB flushing. 1200:000001f0 b8 28 01 mov ax, 00128h 1200:000001f3 8e d8 mov ds, ax 1200:000001f5 80 26 15 00 fd and byte [00015h], 0fdh 1200:000001fa b8 10 00 mov ax, 00010h 1200:000001fd 0f 00 d8 ltr ax ; This can come in handy on AMD systems. 1200:00000200 b8 28 00 mov ax, 00028h 1200:00000203 0f 00 d0 lldt ax 1200:00000206 9a fc 2c 48 01 call far 00148h:02cfch 1200:0000020b 33 c0 xor ax, ax 1200:0000020d 8e d8 mov ds, ax 1200:0000020f 8e c0 mov es, ax 1200:00000211 8e e0 mov fs, ax 1200:00000213 8e e8 mov gs, ax 1200:00000215 c9 leave 1200:00000216 c3 retn
Patching the code in the debugger to do a TLB flush fixes the problem (squeezed out 1 byte from sequence loading 160h into DS, and 5 bytes from an inefficient load of 48h into eax, giving me the 6 bytes needed for a CR3 reload).
VBoxDbg> u 1200:1b8 1200:000001b8 68 60 01 push 00160h 1200:000001bb 1f pop DS 1200:000001bc 67 66 a1 08 70 ed ff mov eax, dword [0ffed7008h] 1200:000001c3 67 66 8b 1d 28 b9 ec ff mov ebx, dword [0ffecb928h] 1200:000001cb 67 66 89 18 mov dword [eax], ebx 1200:000001cf 66 31 c0 xor eax, eax 1200:000001d2 b0 48 mov AL, 048h 1200:000001d4 67 66 03 05 30 b9 ec ff add eax, dword [0ffecb930h] 1200:000001dc 67 66 8b 1d 2c b9 ec ff mov ebx, dword [0ffecb92ch] 1200:000001e4 67 66 89 18 mov dword [eax], ebx 1200:000001e8 0f 20 d8 mov eax, cr3 ; Added TLB flush sequence. 1200:000001eb 0f 22 d8 mov cr3, eax 1200:000001ee 66 5b pop ebx
However, patching OS2KRNL is tedious as the image would need to uncompressed before patching and recompressed afterwards. There are probably also fixups needing adjusting. So, for AMD hosts, we could intercept the LTR or the LLDT instructions and take a sledge hammer to the TLBs from the VMM. An LTR intercept with flush works here.
There are other things we could use to flush the TLBs from the VMM, as the call to 00148h:02cfch typically triggers a certain amount of I/O exits. However coming up with good heuristics for when to flush and when not to would be difficult, if not impossible.
P.S. I think there might be other copies of this code in SMP kernels.
comment:15 by , 3 years ago
@bird: Kudos for the detailed analysis and sharing it. :)
I do understand the TLB and Pagewalk Coherence issues in general, but I don't know much about OS/2 and nothing about the significance of TESTCFG.SYS and SDD: Do your findings indicate that OS/2 has similar issues like Windows 9x, so that "many" OS/2 users with current AMD CPUs will trip over it?
comment:16 by , 3 years ago
I would suggest upgrading to ArcaOS. Among other things, the TESTCFG.SYS has been completely rewritten and does not make any BIOS calls in real mode.
Further, Arca Noae is not aware of any issues such as described here (with any available host CPU, regardless of frequency, with or without nested paging enabled).
Log of failed os/2 boot. LIne 1229 is the last entry before turning off the guest.