Opened 13 years ago
Closed 8 years ago
#10535 closed defect (obsolete)
ehci_handle_endpoint_reclaimation panic in Solaris guest VM on VirtualBox 4.1.15
Reported by: | jayce | Owned by: | |
---|---|---|---|
Component: | other | Version: | VirtualBox 4.1.14 |
Keywords: | solaris usb ehci panic | Cc: | jason.banham@… |
Guest type: | Solaris | Host type: | Mac OS X |
Description
Problem Description
I was trying to attach a Kingston USB memory stick into the guest VM (Solaris 11) when the guest VM panic'd.
Host O/S is MacOS 10.6.8 and running VirtualBox 4.1.15 after recent host O/S panic's upon guest O/S (Linux) reboots. VirtualBox was upgraded to 4.1.15 as per:
Ticket #9897 Synopsis: Frequent panics on Mac OS X when shutting down/starting VMs.
Extension pack is: Oracle VM VirtualBox Extension Pack 4.1.4r77440
... I didn't see a 4.1.15 extension pack.
Core Dump Analysis
The panic string is this:
> $C ffffff0003ea7af0 ddi_get32+0x12() ffffff0003ea7b30 ehci_handle_endpoint_reclaimation+0x91(ffffff0149379000) ffffff0003ea7b70 ehci_intr+0x184(ffffff0149379000, 0) ffffff0003ea7bc0 av_dispatch_autovect+0x74(13) ffffff0003ea7c00 dispatch_hardint+0x33(13, 0) ffffff0003e05a50 switch_sp_and_call+0x13() ffffff0003e05aa0 do_interrupt+0xb6(ffffff0003e05ab0, 1) ffffff0003e05ab0 _interrupt+0xba() ffffff0003e05ba0 mach_cpu_idle+6() ffffff0003e05bd0 cpu_idle+0xb2() ffffff0003e05c00 idle+0x116() ffffff0003e05c10 thread_start+8() From msgbuf we see: 2012 May 9 09:07:13 ffffff0161aa3ac0 WARNING: /pci@0,0/pci8086,265c@b (ehci0): Connecting device on port 1 failed 2012 May 9 09:08:07 ffffff0161ab3e00 panic[cpu0]/thread=ffffff0003ea7c20: 2012 May 9 09:08:07 ffffff0161aa3640 BAD TRAP: type=e (#pf Page fault) rp=ffffff0003ea79c0 addr=ffffff023772af00 2012 May 9 09:08:07 ffffff015b93fa00 2012 May 9 09:08:07 ffffff0167d940c0 sched: 2012 May 9 09:08:07 ffffff0168024a00 #pf Page fault 2012 May 9 09:08:07 ffffff0161aa3100 Bad kernel fault at addr=0xffffff023772af00 2012 May 9 09:08:08 ffffff0168080200 pid=0, pc=0xfffffffffb85ff32, sp=0xffffff0003ea7ab8, eflags=0x10246 2012 May 9 09:08:08 ffffff0161aa3e80 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6b8<xmme,fxsr,pge,pae,pse,de> 2012 May 9 09:08:08 ffffff0161aa3b80 cr2: ffffff023772af00 2012 May 9 09:08:08 ffffff0168080d40 cr3: 38a6000 2012 May 9 09:08:08 ffffff0168080ec0 cr8: c 2012 May 9 09:08:08 ffffff016807e840 2012 May 9 09:08:08 ffffff0168024340 rdi: ffffff014bbe5bc0 rsi: ffffff023772af00 rdx: c 2012 May 9 09:08:08 ffffff016807e300 rcx: ffffff014acfe300 r8: 2ceaf r9: 2ceaf 2012 May 9 09:08:08 ffffff0161aa3400 rax: ffffff023772af00 rbx: ffffff014bb98800 rbp: ffffff0003ea7af0 2012 May 9 09:08:08 ffffff015c7959c0 r10: fffffffffbcb9ab0 r11: fffffffffb82d0bc r12: ffffff0149379000 2012 May 9 09:08:08 ffffff0161aa3d00 r13: ffffff023772af00 r14: ffffff014bb98854 r15: ffffff01466b4c30 2012 May 9 09:08:08 ffffff0161aa31c0 fsb: 0 gsb: fffffffffbc3ebc0 ds: 4b 2012 May 9 09:08:08 ffffff0168080800 es: 4b fs: 0 gs: 1c3 2012 May 9 09:08:09 ffffff0161aa3340 trp: e err: 0 rip: fffffffffb85ff32 2012 May 9 09:08:09 ffffff0168024580 cs: 30 rfl: 10246 rsp: ffffff0003ea7ab8 2012 May 9 09:08:09 ffffff0161ab3d40 ss: 38 2012 May 9 09:08:09 ffffff0161ab3c80 2012 May 9 09:08:09 ffffff0161ab3bc0 ffffff0003ea78e0 unix:die+131 () 2012 May 9 09:08:09 ffffff0161ab3b00 ffffff0003ea79b0 unix:trap+152b () 2012 May 9 09:08:09 ffffff0161ab3a40 ffffff0003ea79c0 unix:cmntrap+e6 () 2012 May 9 09:08:09 ffffff0161ab3980 ffffff0003ea7af0 unix:ddi_getl+12 () 2012 May 9 09:08:09 ffffff0161ab38c0 ffffff0003ea7b30 ehci:ehci_handle_endpoint_reclaimation+91 () 2012 May 9 09:08:09 ffffff0161ab3800 ffffff0003ea7b70 ehci:ehci_intr+184 () 2012 May 9 09:08:09 ffffff0161ab3740 ffffff0003ea7bc0 unix:av_dispatch_autovect+74 () 2012 May 9 09:08:09 ffffff0161ab3680 ffffff0003ea7c00 unix:dispatch_hardint+33 () 2012 May 9 09:08:09 ffffff0161ab35c0 ffffff0003e05a50 unix:switch_sp_and_call+13 () 2012 May 9 09:08:09 ffffff0161ab3500 ffffff0003e05aa0 unix:do_interrupt+b6 () 2012 May 9 09:08:10 ffffff0161ab3440 ffffff0003e05ab0 unix:cmnint+ba () 2012 May 9 09:08:10 ffffff0161ab3380 ffffff0003e05ba0 unix:mach_cpu_idle+6 () 2012 May 9 09:08:10 ffffff0161ab32c0 ffffff0003e05bd0 unix:cpu_idle+b2 () 2012 May 9 09:08:10 ffffff0161ab3200 ffffff0003e05c00 unix:idle+116 () 2012 May 9 09:08:10 ffffff0161ab3140 ffffff0003e05c10 unix:thread_start+8 () 2012 May 9 09:08:10 ffffff0161ab3080 2012 May 9 09:09:03 ffffff0161e85f00 panic[cpu0]/thread=ffffff0003ea7c20: 2012 May 9 09:09:03 ffffff0161e85e40 BAD TRAP: type=e (#pf Page fault) rp=fffffffffbc879d0 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference 2012 May 9 09:09:03 ffffff0161e85d80 2012 May 9 09:09:03 ffffff0161e85cc0 syncing file systems... 2012 May 9 09:09:03 ffffff0161e85c00 done 2012 May 9 09:09:04 ffffff0161e85b40 dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel 2012 May 9 09:09:05 ffffff0161e85a80 NOTICE: ahci0: ahci_tran_reset_dport port 0 reset port => Interesting to see we couldn't attach the device on port 1 > *panic_thread::findstack -v stack pointer for thread ffffff0003ea7c20: ffffff0003ea7730 ffffff0003ea7870 0xffffff0003ea79c0() ffffff0003ea79c0 0x10000000e() ffffff0003ea7af0 ddi_get32+0x12() ffffff0003ea7b30 ehci_handle_endpoint_reclaimation+0x91(ffffff0149379000) ffffff0003ea7b70 ehci_intr+0x184(ffffff0149379000, 0) ffffff0003ea7bc0 av_dispatch_autovect+0x74(13) ffffff0003ea7c00 dispatch_hardint+0x33(13, 0) ffffff0003e05a50 switch_sp_and_call+0x13() ffffff0003e05aa0 do_interrupt+0xb6(ffffff0003e05ab0, 1) ffffff0003e05ab0 _interrupt+0xba() ffffff0003e05ba0 mach_cpu_idle+6() ffffff0003e05bd0 cpu_idle+0xb2() ffffff0003e05c00 idle+0x116() ffffff0003e05c10 thread_start+8() => Some confusion over whether we're running ddi_getl() or ddi_get32() > ddi_get32::dis ddi_get32: movl 0x68(%rdi),%edx ddi_get32+3: cmpl $0xa,%edx ddi_get32+6: jne +0x5 <ddi_get32+0xd> ddi_get32+8: movq %rsi,%rdx ddi_get32+0xb: inl (%dx) ddi_get32+0xc: ret ddi_get32+0xd: cmpl $0xc,%edx ddi_get32+0x10: jne +0x3 <ddi_get32+0x15> ddi_get32+0x12: movl (%rsi),%eax ddi_get32+0x14: ret ddi_get32+0x15: jmp *0x88(%rdi) => So we gell over at instruction offset 0x12 which was: movl (%rsi),%eax > ::regs %rax = 0xffffff023772af00 %r9 = 0x000000000002ceaf %rbx = 0xffffff014bb98800 %r10 = 0xfffffffffbcb9ab0 lwpsleepq+0x3570 %rcx = 0xffffff014acfe300 %r11 = 0xfffffffffb82d0bc dispatch_hardint %rdx = 0x000000000000000c %r12 = 0xffffff0149379000 %rsi = 0xffffff023772af00 %r13 = 0xffffff023772af00 %rdi = 0xffffff014bbe5bc0 %r14 = 0xffffff014bb98854 %r8 = 0x000000000002ceaf %r15 = 0xffffff01466b4c30 %rip = 0xfffffffffb85ff32 ddi_get32+0x12 %rbp = 0xffffff0003ea7af0 %rsp = 0xffffff0003ea7ab8 %rflags = 0x00010246 id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0 status=<of,df,IF,tf,sf,ZF,af,PF,cf> %cs = 0x0030 %ds = 0x004b %es = 0x004b %trapno = 0xe %fs = 0x0000 %gs = 0x01c3 %err = 0x0 The value of %rsi is 0xffffff023772af00, we're trying to dereference that and store it into %eax > 0xffffff023772af00/J mdb: failed to read data from target: no mapping for address 0xffffff023772af00: And it sure looks bogus to me, plus it's the bogus address seen in the panic stack. So this would have been passed upto us via ehci_handle_endpoint_reclaimation() Looking at the illumos source code (which may be slightly different to the Solaris 11 source, which isn't publicly available): 217 /* 218 * ehci_handle_endpoint_reclamation: 219 * 220 * Reclamation of Host Controller (HC) Endpoint Descriptors (QH). 221 */ 222 void 223 ehci_handle_endpoint_reclaimation(ehci_state_t *ehcip) 224 { 225 usb_frame_number_t current_frame_number; 226 usb_frame_number_t endpoint_frame_number; 227 ehci_qh_t *reclaim_qh; 228 > ffffff0149379000::print ehci_state_t { ehci_dip = 0xffffff01466b4c30 ehci_instance = 0 ehci_hcdi_ops = 0xffffff0148486388 ehci_cb_hdl = 0xffffff01466b4e90 ehci_flags = 0x1e ehci_vendor_id = 0x8086 ehci_device_id = 0x265c ehci_rev_id = 0 ehci_caps_handle = 0xffffff014bbe5d00 ehci_capsp = 0xffffff010272a000 ehci_regsp = 0xffffff010272a020 ehci_config_handle = 0xffffff014bbe5e40 ehci_frame_interval = 0 ehci_dma_attr = { . . . (truncated) It's quite a long structure and I'm not 100% familiar with it, so not sure it's worth reviewing it all. Can I look at the device ID? > 0xffffff01466b4c30::print struct dev_info { devi_parent = 0xffffff01466b8688 devi_child = 0 devi_sibling = 0xffffff01466b4960 devi_binding_name = 0xffffff0146690b1c "pciclass,0c0320" devi_addr = 0xffffff014bbf3c00 "b" . . . devi_hw_prop_ptr = 0xffffff014b6b6418 devi_node_name = 0xffffff0148832e48 "pci8086,265c" devi_compat_names = 0xffffff0146690b00 "pci8086,265c.0" Hmmm, seems so. How about the vendor id and device id? 0x8086 Intel Corporation Device Id Chip Description Vendor Id Vendor Name 0x265C USB 2.0 EHCI Controller 0x8086 Intel Corporation Well the structure itself looks reasonably sound from an initial review. So where did we hand over to ddi_get32() ?? ehci_handle_endpoint_reclaimation+0x89: movq %r12,%rsi ehci_handle_endpoint_reclaimation+0x8c: call -0x296d <ehci_deallocate_qh> ehci_handle_endpoint_reclaimation+0x91: movq 0x650(%r13),%r12 Interesting, the dis-assembly shows us calling ehci_deallocate_qh() and not ddi_get32() Taking apart the stack: > ffffff0003ea7730-0x60,100/nap 0xffffff0003ea76d0: 0xffffff0003ea76d0: 0xd1 0xffffff0003ea76d8: 0xd1 0xffffff0003ea76e0: 0xe 0xffffff0003ea76e8: 0xffffff0003ea77b0 0xffffff0003ea76f0: kvseg 0xffffff0003ea76f8: 0xffffff0003ea7778 0xffffff0003ea7700: 0xffffff0003ea7760 0xffffff0003ea7708: avl_find+0x56 0xffffff0003ea7710: vpanic+0x22 0xffffff0003ea7718: 0xfffffffffb957418 0xffffff0003ea7720: 0xffffff0003ea7830 0xffffff0003ea7728: 0xfffffffffb957698 0xffffff0003ea7730: 0xffffff0003ea7870 0xffffff0003ea7738: 0xffffff0003ea79c0 0xffffff0003ea7740: 0xffffff023772af00 0xffffff0003ea7748: 0xffffff0003ea7780 . . . 0xffffff0003ea7a48: 0xffffff014bbe5d00 0xffffff0003ea7a50: 0xffffff0003ea7ac0 0xffffff0003ea7a58: 0x4b 0xffffff0003ea7a60: 0x4b 0xffffff0003ea7a68: 0 0xffffff0003ea7a70: 0x1c3 0xffffff0003ea7a78: 0xe 0xffffff0003ea7a80: 0 0xffffff0003ea7a88: ddi_get32+0x12 0xffffff0003ea7a90: 0x30 0xffffff0003ea7a98: 0x10246 0xffffff0003ea7aa0: 0xffffff0003ea7ab8 0xffffff0003ea7aa8: 0x38 0xffffff0003ea7ab0: 0xffffff0003ea7af0 0xffffff0003ea7ab8: ehci_deallocate_qh+0x57 0xffffff0003ea7ac0: 0xffffffffc002cea0 0xffffff0003ea7ac8: 0xffffff0149379000 0xffffff0003ea7ad0: 0xffffff014bb98800 0xffffff0003ea7ad8: 0x23a007 0xffffff0003ea7ae0: 0xffffff014bb98800 0xffffff0003ea7ae8: 0xffffff0149379000 0xffffff0003ea7af0: 0xffffff0003ea7b30 0xffffff0003ea7af8: ehci_handle_endpoint_reclaimation+0x91 0xffffff0003ea7b00: 0xffffff01466b4c30 0xffffff0003ea7b08: 0xffffff014b6d3298 ... seems we do call ehci_deallocate_qh() before getting to ddi_get32() > ehci_deallocate_qh::dis ehci_deallocate_qh: pushq %rbp ehci_deallocate_qh+1: movq %rsp,%rbp ehci_deallocate_qh+4: subq $0x10,%rsp ehci_deallocate_qh+8: movq %rdi,-0x8(%rbp) ehci_deallocate_qh+0xc: movq %rsi,-0x10(%rbp) ehci_deallocate_qh+0x10: pushq %rbx ehci_deallocate_qh+0x11: pushq %r12 ehci_deallocate_qh+0x13: pushq %r13 ehci_deallocate_qh+0x15: subq $0x8,%rsp ehci_deallocate_qh+0x19: movq %rdi,%r12 ehci_deallocate_qh+0x1c: movq %rsi,%rbx ehci_deallocate_qh+0x1f: movq 0x128(%r12),%rdi ehci_deallocate_qh+0x27: leaq 0x10(%rbx),%rsi ehci_deallocate_qh+0x2b: call +0x3ae00dc <ddi_get32> ehci_deallocate_qh+0x30: andq $0xffffffffffffffe0,%rax ehci_deallocate_qh+0x34: movq %r12,%rdi ehci_deallocate_qh+0x37: movq %rax,%rsi ehci_deallocate_qh+0x3a: call +0x1671 <ehci_qtd_iommu_to_cpu> ehci_deallocate_qh+0x3f: movq %rax,%r13 ehci_deallocate_qh+0x42: testq %r13,%r13 ehci_deallocate_qh+0x45: je +0x35 <ehci_deallocate_qh+0x7c> ehci_deallocate_qh+0x47: movq 0x160(%r12),%rdi ehci_deallocate_qh+0x4f: movq %r13,%rsi ehci_deallocate_qh+0x52: call +0x3ae00b5 <ddi_get32> ehci_deallocate_qh+0x57: movl %eax,%esi ehci_deallocate_qh+0x59: movq %r12,%rdi ... and the stack shows the call to ddi_get32() at ehci_deallocate_qh+0x57 It seems we're looking for an offset from %rbx to get our %rsi for the first call to ddi_get32() then it's the value of %r13 being moved into %esi where we then fall over. In the source I see we're doing this: 1330 first_dummy_qtd = ehci_qtd_iommu_to_cpu(ehcip, 1331 (Get_QH(old_qh->qh_next_qtd) & EHCI_QH_NEXT_QTD_PTR)); Where Get_QH() decodes as: 870 #define Get_QH(addr) ddi_get32(ehcip->ehci_qh_pool_mem_handle, \ 871 (uint32_t *)&addr) ... so this is where the ddi_get32() comes into play. Alas there are a number of calls to Get_QH() in this ehci_deallocate_qh() function.
Summary
I suspect the failure to connect to the device port is causing some form of bogus/NULL address to appear in a structure here and we're tripping over it.
Change History (5)
comment:1 by , 13 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
comment:2 by , 13 years ago
Hello there,
As I pointed out at the beginning of the bug report, I'm running 4.1.15 *because* I was hitting panic's on my Mac when rebooting my guest virtual machine. Since I put that on I haven't had a panic under MacOS 10.6.8.
I would have installed guest extensions 4.1.15 but they do not appear to be available on your download web page. If this is available I'll happily try installing that and seeing if the panic's in the guest OS (Solaris 11) are reproducible.
I know 4.1.15 vbox binaries and the 4.1.14 extension pack are not a good/supported option and I clearly mentioned this at the beginning being open and honest. You can see I performed a fair amount of work looking at the Solaris 11 dump, so I have to say I'm somewhat disappointed this has been closed without anyone actually discussing the matter with me.
If there is a problem in this code area come the next version then potentially you'll have a bunch of Solaris 11 boxes panicing, which I'd like to help prevent. Of course it's highly likely that the version number mismatch is the problem, so if I could install that and the problem goes away then hey we know that was the root cause. It's a pretty quick test and has very little impact on you guys if this is the fix.
On the other hand if I demonstrate the problem again then maybe there is something that needs investigating?
Just trying to work with you guys, see if someone can meet me half way here.
Thanks,
Jason
comment:3 by , 13 years ago
Resolution: | invalid |
---|---|
Status: | closed → reopened |
Could you try the 4.1.15 Mac OS X test build together with this Extension Pack?
comment:4 by , 13 years ago
Hi Frank,
I tried the 4.1.15 extension pack and had mixed results. The virtual machine did appear to hang with a USB stick however X wouldn't start. Technically the gdm service was running and Xorg was running but I had no login box I could see, the display had an odd red/yellow flashing output in the miniature display in the main Virtualbox window.
I had to revert back to 4.1.14 to see if that problem went away and I could get X back - I could. So I'll try upgrading to 4.1.15 (binaries and extension pack) again to see if the problem is reproducible.
Thanks,
Jason.
comment:5 by , 8 years ago
Resolution: | → obsolete |
---|---|
Status: | reopened → closed |
Please reopen if still relevant with a recent VirtualBox release.
Sorry, we do not accept bug reports against mismatched VirtualBox <-> Extension Pack versions. Unless you're running one of the Mountain Lion DPs, there is no point in running a 4.1.15 test build. In fact you should not use a test build unless specifically directed to do so; if you do, expect trouble (and no help).
Please go back to the official release (4.1.14) and the accompanying extension pack. If the error persists, then open a ticket against that combination.