Opened 14 years ago
Closed 13 years ago
#8363 closed defect (fixed)
Vboxmanage snapshot restorecurrent corrupts media registry
Reported by: | Daniel | Owned by: | |
---|---|---|---|
Component: | virtual disk | Version: | VirtualBox 4.0.4 |
Keywords: | Cc: | ||
Guest type: | Windows | Host type: | Windows |
Description
Occasionally, vboxmanage snapshot restorecurrent results in a broken mediaregistry chain, with virtualbox not registering the new disk uuid in the media registry, and instead keeping the old uuid (that is, the one before the restorecurrent command). This results in an unusable machine configuration.
Disk configuration is MultiAttach disk -> Normal disk -> Snapshot disk.
The usage pattern is this:
system("$vbox_exe\vboxmanage.exe controlvm $vm_name poweroff"); wait_svc(); system("$vbox_exe\vboxmanage.exe snapshot $vm_name restorecurrent"); wait_svc();
wait_svc() waits for VBoxSvc.exe to exit. If this pacing mechanism is not used, corruption occurs. This was not happening in 4.0.2, however there were 3.x versions of VirtualBox where the bug was intermittently appearing. This fixes the corruption when only one virtualbox machine is running.
Where it mostly happens now is concurrent startvm/restorecurrent commands from 2 or more machines at the same time, as we can no longer check for VBoxSvc.exe exiting, because it will always be running when at least one virtual machine is running.
Attachments (17)
Change History (97)
comment:1 by , 14 years ago
comment:2 by , 14 years ago
In poweredoff state RestoreCurrent + StartVM combination has same effect :(
comment:3 by , 14 years ago
I recently upgraded 2 independent instances from 3.x to 4.0.4, and am seeing this issue regularly now, though still not every run. It hit both instances within 24 hours of upgrading. Both hosted in Windows 7 Pro 64-bit, with a 32-bit Windows XP guest created previously under VirtualBox 3.x.
I am only running one VM at a time, so concurrent access to Svc isn't the trigger.
Steps to replicate (intermittently):
- Open VirtualBox Manager.
- Restore most recent snapshot. Note that:
- In the C:\VirtualBox\Machines\Whistler\Snapshots directory, the previous transient VDI is deleted and replaced with a new empty VDI with fresh GUID.
- In C:\VirtualBox\Machines\Whistler\Whistler.xml , the previous transient-VDI GUID has been replaced with the same fresh GUID.
- In C:\Users\Dewi\.VirtualBox\VirtualBox.xml , the same replacement should have taken place, but has NOT.
- Open the Virtual Media Manager. Note that the same fresh GUID is correctly in place despite the lack of change in the persistent store VirtualBox.xml.
- Close both windows, and allow some time for shutdown.
- Open VirtualBox Manager. Note that:
- The guest state is now indicated as "Inaccessible", due to a VBOX_E_OBJECT_NOT_FOUND condition.
- The user is prompted to check a problem in the Virtual Media Manager. Now it reflects the state from VirtualBox.xml; it knows about the old erased VDI instead of the new fresh one.
Workarounds:
I have been able to work around this issue by editing VirtualBox.xml manually to compensate for the omitted update.
In cases where I deleted the non-working empty VDI, I have been able to edit Whistler.xml (the guest configuration) to list the parent snapshot VDI in the final/current (non-snapshot) state's <AttachedDevice type="HardDisk"> node. Note the danger here - if you run the machine like this your snapshot will start to drift! - so instead immediately restore the snapshot, which spawns a fresh child VDI.
comment:4 by , 14 years ago
Sorry about the bad WikiFormatting there. It was my first time posting to Trac.
comment:5 by , 14 years ago
I can confirm this happening with Linux OpenSuse 11.3 64-bit host and Windows 7 64-bit guest also.
comment:6 by , 14 years ago
Confirmed. Happens only for 'older VMs', that is for VMs which were created by VBox 3.2.x or older.
comment:7 by , 14 years ago
I have also the same issue with 4.0.4. For me this consistently happens when I restore current snapshot when powering off the machine and the VirtualBox Manager window has been closed before. When the VirtualBox Manager window is open I don't have any issues at all.
comment:8 by , 14 years ago
I experience this bug with old and new virtual machines. It also happens with VirtualBox Manager windows open, if I close it soon after powering down the virtual machine and restoring it to a snapshot. If I leave VirtualBox Manager windows open for a while before closing it, the bug does not manifest itself.
comment:9 by , 14 years ago
It is most probably a bug related to Vboxsvc.exe caching the virtualbox.xml file, opening up all sort of race conditions/thread unsafe behavior. It appears to surface now and then in various forms.
comment:10 by , 14 years ago
Summary: | Vboxmanage snapshot restorecurrent corrupts media registry → Vboxmanage snapshot restorecurrent corrupts media registry => Fixed in SVN |
---|
This bug will be fixed in the next maintenance release.
comment:11 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
follow-up: 13 comment:12 by , 14 years ago
I tried the release version of Vbox 4.0.6, and intermittently faced this issue again.
What I did was:
- Power on Virtual Machine from VirtualBox Manager.
- Work in the VM
- Click on the close button to close the Virtual Machine
- Ticked the checkbox to restore the current snapshot.
- Click on OK.
After that, the VM is unusable. It shows that the guest is "Inaccessible", due to a VBOX_E_OBJECT_NOT_FOUND condition.
Are you 100% sure this bug is fixed in 4.0.6? Or am I facing another issue? Can anybody else still see this problem in 4.0.6?
Host OS: Win7 x64 Guest OS: Various versions of Windows.
Regards
comment:13 by , 14 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Replying to reureu:
Can anybody else still see this problem in 4.0.6?
I can confirm this problem in 4.0.6. In my case if I close and restart VirtualBox Manager, the machine becomes accessible again. But Virtual Media Manager shows strange additional vdis which make it impossible to delete intermediate snapshots.
Host: OpenSUSE 11.3 64-bit Linux, Guest: Windows 7 64-bit
comment:14 by , 14 years ago
Summary: | Vboxmanage snapshot restorecurrent corrupts media registry => Fixed in SVN → Vboxmanage snapshot restorecurrent corrupts media registry |
---|
No, I'm not sure. We thought we fixed it but apparently didn't. This seems to be a duplicate of #8735 (or the other way around). Unfortunately we are still not able to reproduce this problem. We looking for a simple test case to reproduce this problem. The test case should contain steps like 'create a VM', 'attach a VM to it', 'start the VM', 'take a snapshot', ...
comment:16 by , 14 years ago
I am experiencing the same issue on an Ubuntu 10.10 64bit host system running VirtualBox 4.0.8 r71778. It seems anytime I restore a snapshot, it causes the VM to become Inaccessible. Once I restart VMM, I get the error "Could not find an open hard disk with UUID". I am running multiple VMs at once usually, however I am only restoring one snapshot at a time. I believe I was experiencing the same on 4.0.6 and I hoped the update to 4.0.8 would resolve the issue.
comment:17 by , 14 years ago
Then we again need a scenario to reproduce this issue. Could you reproduce this problem again and then attach the VirtualBox.xml file as well as the settings file of the specific VM to this ticket? Thank you!
comment:18 by , 14 years ago
I've now had this happen twice to me, first time with 4.0.4 (iirc) and now with 4.0.8. Unfortunately it mostly works, so the issue is not reproducible. I can therefore understand that this issue is not easy to fix.
But what you should clean up is the user friendliness disaster this error produces. Having to hack xml files to revive a VM that appears terminally broken, to say it harshly, sucks. What VirtualBox should do is something like Firefox does when reloading a stored session fails. It should show a page that says something like this:
Oops, something seems to be wrong with this VM. I'm sorry but I cannot find the HD image referenced in the currently active snapshot.
What do you want me to do?
- Try to find the correct image.
- Restore a working snapshot.
- Leave everything alone and let you fix it.
For my recent issue, 1 would have esentially worked since the HD snapshot image was there, just not in the central registry (and the place where to put it should be obvious from the snapshot tree). And 2 should always work with least risk as the error apparently occurs when you are trying to go back to the latest snapshot anyway.
Of course this doesn't fix the bug but would make living with it easier for the users...
by , 14 years ago
Attachment: | VirtualBox.xml added |
---|
follow-up: 21 comment:19 by , 14 years ago
(The just-previously-attached Virtualbox.xml goes with this comment)
I can confirm this still happens with 4.0.8 r71778 (host: Ubuntu 10.10 64-bit, guest: Win 7).
There were some strange corruptions in the VirtualBox Manager GUI app, which might help point to what is wrong (see below).
After the problem appeared, VirtualBox.xml and VirtualBox.xml-prev were -identical-, both timestamped at the time of the error.
Here's what I did:
- Shut down guest normally.
- Created snapshot
- Booted guest in safe mode; installed new guest additions including 3D support. Rebooted.
- Guest booted okay, but an app did not start as expected, so I decided to abandon...
- Powered off guest (without saving state)
- Restore snapshot
Almost instantly got the "inaccessible VM" error but the details box was empty (clicking Refresh had no effect). The NAME of the vi shown in the list on the left panel was missing, replaced with "inaccessible...(something)".
- Closed and re-started the VirtualBox Manager app.
The vm NAME was correct now, with "inaccessible" state, and the error detail box contained:
Could not find an open hard disk with UUID {52f51f24-e783-47e8-b9eb-97e6b4ac4ef9}. Result Code: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001) Component: VirtualBox Interface: IVirtualBox {d2de270c-1d4b-4c9e-843f-bbb9b47269ff}
(VirtualBox.xml was attached previously. It is the same as VirtualBox.xml-prev .
by , 14 years ago
Attachment: | Windows7_32bit.xml added |
---|
Windows7_32bit.xml (goes with comment 2011-06-06 03:52:04)
comment:20 by , 14 years ago
I noticed that VBox.Log.1 ends with the following error:
00:02:25.229 Changing the VM state from 'DESTROYING' to 'TERMINATED'. 00:02:25.286 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={515e8e8d-f932-4d8e-9f32-79a52aead882} aComponent={Console} aText={The virtual machine is being powered down}, preserve=false
This happened when powering off the guest (without saving state) before attempting to restore the snapshot.
comment:21 by , 14 years ago
It’s three days I keep encountering this same issue described by jimav, except my host is Gentoo and my guest is Windows XP; VirtualBox is 4.0.8 r71778. The pattern that causes the issue for me is exactly the one described by jimav, step by step. There’s also a second scenario, but I’m still trying to reproduce it.
This is the latest I just run into.
VirtualBox.xml contains (edited for conciseness):
<HardDisk uuid="{e5cc5a68-…}" location="…/WindowsXP-System.vmdk" type="Normal"> <HardDisk uuid="{220d3bd4-…}" location="…/Snapshots/{220d3bd4-a098-4788-9429-52f5b92f51ed}.vmdk"/> </HardDisk>
While the guest’s WindowsXP.vbox (edited too):
<Snapshot uuid="{341ed8d2-…}" name="ok"> <StorageController type="AHCI" useHostIOCache="false"> <AttachedDevice><Image uuid="{e5cc5a68-…}"/></AttachedDevice> </StorageController> </Snapshot> <StorageController type="AHCI" useHostIOCache="false"> <AttachedDevice><Image uuid="{00bb9640-…}"/></AttachedDevice> </StorageController>
I edited out an immutable disk image whose child image, being the parent immutable, I couldn’t care less to lose.
The child image {220d3bd4-…} is the one VirtualBox should have deleted when I clicked “Restore snapshot”, I can tell by its size (it’s much bigger than {00bb9640-…}). The child image {00bb9640-…} must therefore be the new one that was prepared to replace {220d3bd4-…} in the snapshot.
So, as I understand it, VirtualBox.xml is wrong, and the .vbox file is correct.
What’s interesting in the log I’m about to attach, is that it looks like someone (VBoxSVC or VirtualBox?) encountered several E_ACCESSDENIED errors while saving the .vbox file (from the considerations above), while I’m sure I have all necessary access to all the files and folders involved.
by , 14 years ago
Attachment: | VBox-raffaellod.log added |
---|
Log of the session that lead to situation described in comment 21
follow-up: 23 comment:22 by , 14 years ago
Still cannot reproduce. raffaellod and jimav, is it possible that the VBoxSVC daemon crashes at some time? Can you reproduce this screwup every time? If so, could you make an experiment?
- Start the VBoxSVC daemon in a separate terminal window. The VBox GUI will connect to this daemon, otherwise it would start a daemon itself.
- Repeat the steps you described above
Do you experience a crash of VBoxSVC at any time?
comment:23 by , 14 years ago
Replying to frank:
Still cannot reproduce. raffaellod and jimav, is it possible that the VBoxSVC daemon crashes at some time? Can you reproduce this screwup every time? If so, could you make an experiment?
- Start the VBoxSVC daemon in a separate terminal window. The VBox GUI will connect to this daemon, otherwise it would start a daemon itself.
- Repeat the steps you described above
Do you experience a crash of VBoxSVC at any time?
Bingo.
$ /usr/lib/virtualbox/VBoxSVC ****************************************************** Oracle VM VirtualBox XPCOM Server Version 4.0.8-Gentoo (C) 2008-2011 Oracle Corporation All rights reserved. Starting event loop.... [press Ctrl-C to quit] VBoxNetAdpCtl: ioctl failed for /dev/vboxnetctl: Invalid argument Informational: VirtualBox object created (rc=NS_OK). Segmentation fault $
This is how it happened this time, from host’s power on to VBoxSVC segfault:
- start VBoxSVC
- start VirtualBox
- make a new (and only) snapshot for the guest (my guest has one normal, fixed size VMDK and an immutable, dynamic VDI, both on a no-host-cache AHCI controller)
- start the guest
- work a little bit in the guest, at least launch a few programs, maybe browse Windows Update (I don’t know if it’s necessary, but that’s what I did in my case, for a total of some 10 minutes of activity)
- shut down the guest (cleanly, using shutdown command from within guest)
- merge snapshot, create new one immediately (total snapshots: 1)
- start the guest, but abort it (off by VirtualBox command) while it’s still booting, without restoring the snapshot
- restore the snapshot
- segfault.
One thing I should mention, is that I believe every single time this has happened (that is, at least eight times for me, in the last four days), the guest was not shut down cleanly: either I interrupted it by powering off the virtual machine, or the guest went BSoD (the second scenario I mentioned above).
Also, I can’t seem to reproduce the segfault by just creating a snapshot, starting the guest, aborting it, and restoring the snapshot. There’s something about either previous guest disk activity, or a timing issue with the deletion (merger) and creation of a new snapshot, or the guest going BSoD/being uncleanly shutdown (disk cache?).
All my disk images are on an ext4 filesystem, which (if you didn’t make it work yet, I haven’t checked) means that the virtual machine is actually using the host cache, in spite of my turning it off.
Please tell me what else I can do to help nailing this nasty segfault.
follow-up: 28 comment:24 by , 14 years ago
raffaellod, tried your steps but still unable to reproduce. I've built a similar setup here (1 flat VMDK, normal plus 1 VDI, immutable attached to an AHCI controller) but still no luck in reproducing the segfault.
Re ext4: This bug was only relevant for older kernels (< 2.6.36). As you are using 2.6.38, your VMs are not affected.
Btw, the message VBoxNetAdpCtl sounds interesting but this could be specific to the Gentoo build.
Do you have the debugging symbols available for your VBox binaries? If so then it would help if you would start VBoxSVC with gdb attached (gdb /usr/lib/virtualbox/VBoxSVC) and try to reproduce the segfault again. At this point, you should see the gdb point. Now enter the gdb command bt and you should get some information about the stack. The whole gdb output would be interesting, please attach it to this ticket. Thanks!
comment:25 by , 14 years ago
I too saw VBoxSVC segfault at start-up, but only once.
Also got the following while running a stress-test script:
+ VBoxManage startvm Windows7_32bit Waiting for the VM to power on... VBoxManage: error: An unexpected process (PID=0x000012FA) has tried to lock the machine 'Windows7_32bit', while only the process started by LaunchVMProcess (PID=0x00001313) is allowed VBoxManage: error: Details: code E_ACCESSDENIED (0x80070005), component Machine, interface IMachine, callee nsISupports Context: "LockMachine(a->session, LockType_Shared)" at line 84 of file VBoxManageControlVM.cpp
This happend twice, not reproducibly. Meanwhile VBoxSVC (run manually) showned no problems.
An instance of the VirtualBox gui was open each time.
The stress-test script (attached below) ran all night without problems at first; then I added some background processes to load the cpu and use up memory, and then errors started happening. That's consistent with a race condition somewhere.
by , 14 years ago
Attachment: | VBOXtest.typescript added |
---|
Another error running the stress-test script attched above
comment:26 by , 14 years ago
Oops, the locking error was my fault because the script tried to poweroff the VM while it was being started by another process (tho the message could be more friendly...)
However, there may be a real bug here because subsequent commands failed the same way for about a minute later, and the VBoxSVC process (running under gdb) said
VBoxNetAdpCtl: ioctl failed for /dev/vboxnetctl: Invalid argument
around that time.
by , 14 years ago
Attachment: | VBOXtest.bash added |
---|
stress-test script (revised to not deliberately create conflicting VBoxManage processes)
comment:27 by , 14 years ago
Okay, I encountered the bug once more, this time with (some) debugging information (whatever “configure --build-debug” did). It took me a lot of effort to reproduce it this time, not sure why.
As you’ll see, there are multiple issues in the several files I’m attaching:
- in order to start any virtual machine, I had to comment the Assert() at /src/recompiler/VBoxRecompiler.c:264, in REMR3Init();
- then, I got a kernel oops;
- when I finally managed to trigger the segfault, the backtrace was missing file names and line numbers; only the source statement was provided.
I only rebuilt the userspace parts of VirtualBox with --build-debug; I didn’t recompile the kernel modules. Maybe this could be the cause of the kernel oops?
I’m sorry I’m really too tired right now to analyze the situation any further, for now I’ll just submit the logs.
by , 14 years ago
Attachment: | dmesg-raffaellod.log added |
---|
dmesg of VBoxSVC gdb session, includes aforementioned kernel oops
by , 14 years ago
Attachment: | gdb-raffaellod-AssertFailed.log added |
---|
Excerpt of gdb session interrupted because of failing Assert()
by , 14 years ago
Attachment: | VirtualBox-raffaellod-AssertFailed.log added |
---|
VirtualBox session interrupted by failing Assert()
by , 14 years ago
Attachment: | VirtualBox-raffaellod-segfault.log added |
---|
VirtualBox session leading to VBoxSVC segfault
comment:28 by , 14 years ago
Replying to frank:
Btw, the message VBoxNetAdpCtl sounds interesting but this could be specific to the Gentoo build.
Possibly. In any case, vboxnet0 works fine, so it shouldn’t be anything serious.
The message “WARNING: failed to send RELEASE event” (from /src/libs/xpcom18a4/ipc/ipcd/extensions/dconnect/src/ipcDConnectService.cpp:2310) seems to appear sporadically, does that sound like something new?
comment:29 by , 14 years ago
Interrupting (with Control-C) VBoxManage while restoring a snapshot can cause the VBoxSVC process to segfault. While that is clearly a bug (nothing should segfault), it might not be off-topic to this bug. On the other hand, maybe not...
Here is a backtrace. I was running the OSE debug version packaged with Ubuntu 10.10:
[New Thread 0x7ffff00e6700 (LWP 24806)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff00e6700 (LWP 24806)] 0x00000000004b2977 in std::list<ComObjPtr<MediumAttachment>, std::allocator<ComObjPtr<MediumAttachment> > >::remove(ComObjPtr<MediumAttachment> const&) () (gdb) bt #0 0x00000000004b2977 in std::list<ComObjPtr<MediumAttachment>, std::allocator<ComObjPtr<MediumAttachment> > >::remove(ComObjPtr<MediumAttachment> const&) () #1 0x00000000004b02d4 in SessionMachine::restoreSnapshotHandler (this=0x7fffe41137e0, aTask=<value optimized out>) at /build/buildd/virtualbox-ose-3.2.8-dfsg/src/VBox/Main/SnapshotImpl.cpp:1889 #2 0x00000000004a6ed7 in SessionMachine::taskHandler (pvUser=0x8279e0) at /build/buildd/virtualbox-ose-3.2.8-dfsg/src/VBox/Main/SnapshotImpl.cpp:1216 #3 0x00007ffff76fe4fc in rtThreadMain (pThread=0x904220, NativeThread=<value optimized out>, pszThreadName=<value optimized out>) at /build/buildd/virtualbox-ose-3.2.8-dfsg/src/VBox/Runtime/common/misc/thread.cpp:679 #4 0x00007ffff773e212 in rtThreadNativeMain (pvArgs=<value optimized out>) at /build/buildd/virtualbox-ose-3.2.8-dfsg/src/VBox/Runtime/r3/posix/thread-posix.cpp:227 #5 0x00007ffff7997971 in start_thread () from /lib/libpthread.so.0 #6 0x00007ffff6bdc92d in clone () from /lib/libc.so.6 #7 0x0000000000000000 in ?? ()
follow-up: 33 comment:30 by , 14 years ago
Unfortunately that 3.2.8 backtrace does not help here as the code is very different from the current 4.0.8 code.
follow-up: 35 comment:31 by , 14 years ago
raffaellod, I assume you compile VirtualBox yourself. You build the package and install the resulting .run package, correct? The correct way of debugging such issues is a bit different:
- Don't do a debug build. A debug build means that all assertions are enabled and even assertitons not relevant for the current problem can trigger. Also debug builds run notably slower than normal builds due to the assertions and because all compiler optimizations are disabled.
- Better do ./configure --disable-hardening and rebuild the whole stuff.
- Don't install the .run package but run VirtualBox from the out/..../bin directory directly. This way all binaries have the debugging symbols included and gdb can display a lot more information.
- So do gdb ./VBoxSVC directly from there, in another terminal do restore the VM work (start the VM, restore snapshot and so on).
- Try to reproduce the VBoxSVC crash. When in gdb, the 'bt' command should show a much better backtrace.
Sorry that I didn't describe this earlier.
comment:32 by , 14 years ago
And yes, the --build-debug was also responsible for the other problems you observed.
comment:33 by , 14 years ago
Replying to frank:
Unfortunately that 3.2.8 backtrace does not help here as the code is very different from the current 4.0.8 code.
Can you supply a 4.0.8 non-OSE build with debug symbols (suitable for ubuntu 64bit hosts)? I'm happy to keep testing but I'm not up to compiling from source.
comment:34 by , 14 years ago
jimav, here are the debug symbols for the 4.0.8 Ubuntu Maverick package. Just install this package and you will get a backtrace with usable debug symbols. If you need a different package, please tell me. Thank you!
comment:35 by , 14 years ago
Replying to frank:
raffaellod, I assume you compile VirtualBox yourself. You build the package and install the resulting .run package, correct?
No, not really. I use the Gentoo’s Portage system, which builds from the source and then install what’s in the out/release/bin; to debug (in the way that did not give a useful backtrace) I modified the ebuild so that it would add --build-debug and it would then install the files from out/debug/bin. I see that that wasn’t very useful, but I had no way of knowing.
This way all binaries have the debugging symbols included and gdb can display a lot more information.
Well, I’ll just revert to the stock Gentoo ebuild, add --disable-hardening to its ./configure call, and tell Portage to keep the sources instead of cleaning up at the end of the build - when I have some time again, possibly tomorrow (your today).
comment:36 by , 14 years ago
I rebuilt VirtualBox once more, this time with -O1 -ggdb, and without --build-debug (I forgot to add --disable-hardening, but I still got a nice backtrace).
This time I also figured out a more precise test case, which so far gives me a very high chance of a segfault:
- take one and only snapshot
- start the guest
- do some network activity (in my case, I always download some 60MB of Windows Update stuff)
- shutdown the guest cleanly
- take one more snapshot (two total)
- start the guest, but abort it after the second resolution change (a couple of seconds after the Windows boot logo, progress bar and text “emerge” fade in from the black background), without restoring to a snapshot
- restore to the second (latest) snapshot
- segfault
If you think it’s relevant, you can try the same network configuration as my guest: nic0 is NAT, nic1 is vboxnet0 with static IP address. They’re not bridged.
by , 14 years ago
Attachment: | gdb-raffaellod-segfault.log added |
---|
gdb session with segfault and backtrace with debug symbols
comment:37 by , 14 years ago
Many thanks raffaellod for your investigation. We are getting closer to the problem but still ... We continue to try to reproduce that issue (still no luck :-(). If you have some more time, could you use valgrind instead of gdb to start the VBoxSVC server? The valgrind tool would show any invalid access which might even happen before the process would segfault.
follow-up: 39 comment:38 by , 14 years ago
Oh, raffaellod, could you again attach your current VirtualBox.xml file together with the .xml file of your VM?
by , 14 years ago
Attachment: | VirtualBox-raffaellod.xml added |
---|
VirtualBox settings (in corrupted state, after segfault)
comment:39 by , 14 years ago
Replying to frank:
Oh, raffaellod, could you again attach your current VirtualBox.xml file together with the .xml file of your VM?
Done. Right now, they’re both in the post-segfault state, I haven’t had time to clean them up. It may be even more interesting for you, though.
I’ll try with valgrind as soon as I have some time. I’ve never used it, though, so if you could give me a few hints on how to do what you’re asking, that would be very helpful. And, do I need to change debug info format generated by gcc?
comment:40 by , 14 years ago
No change is necessary, just replace gcc by valgrind. The debug information are already sufficient, valgrind will use the same information. By default, valgrind works as a memory checker. It will detect invalid memory accesses and accesses to uninitialized memory. When it finds something, it will print a nice backtrace.
follow-up: 42 comment:41 by , 14 years ago
Frank, I ran /usr/lib/virtualbox/VBoxSVC under valgrind, but clients won't connect to it for some reason. It starts up and says "Starting event loop....", but when VBoxManage or the VirtualBox gui is started later, another VBoxSVC process appears (with --auto-shutdown) to service the requests and the one running with valgrind remains silent. Any ideas?
comment:42 by , 14 years ago
Replying to jimav:
Frank, I ran /usr/lib/virtualbox/VBoxSVC under valgrind, but clients won't connect to it for some reason. It starts up and says "Starting event loop....", but when VBoxManage or the VirtualBox gui is started later, another VBoxSVC process appears (with --auto-shutdown) to service the requests and the one running with valgrind remains silent.
Same here. VirtualBox and VBoxManager start their own instance of VBoxSVC, I guess due to valgrind not actually spawning a process named VBoxSVC like gdb does. Is there a way to disable the check, and have them assume that a VBoxSVC is already running?
comment:43 by , 14 years ago
Here's a hard-reproducible snapshot corruption. VBoxSVC under gdb seems fine (no segfault). The sequence is as follows:
- take Backup
- take SNAP1
- take SNAP2
- restore SNAP2
- restore SNAP1
- delete SNAP1 (*fails*)
The delete fails every time saying a snapshot has "more than one child hard disk".
I will attach an archive containing a script to reproduce the bug, and before & after config files from my system.
by , 14 years ago
Attachment: | VBoxTest_SnapCorruption1.tar added |
---|
tar archive with before & after config files and script to reproduce bug
follow-up: 45 comment:44 by , 14 years ago
jimav, raffaellod, when using valgrind to start VBoxSVC please wait a bit longer until the service is fully started. valgrind makes the process run much slower because of all the sanity checks. Do no start the GUI or VBoxManage before you see the message [press Ctrl-C to quit]. Here it takes about 5 seconds but on slower systems this might take longer.
follow-up: 46 comment:45 by , 14 years ago
Replying to frank:
jimav, raffaellod, when using valgrind to start VBoxSVC please wait a bit longer until the service is fully started. valgrind makes the process run much slower because of all the sanity checks. Do no start the GUI or VBoxManage before you see the message [press Ctrl-C to quit]. Here it takes about 5 seconds but on slower systems this might take longer.
That’s not it. I gave it a full minute, on my Core 2 Duo T8100 (2.1 GHz), and anyway the Ctrl+C message appears within seconds. In spite of this, VirtualBox still starts its own VBoxSVC.
comment:46 by , 14 years ago
Replying to raffaellod:
In spite of this, VirtualBox still starts its own VBoxSVC.
Just to be safe, I quit any VirtualBox, waited for any auto-started VBoxSVC to terminate, waited 5 full minutes, run again VirtualBox, and this still happens.
follow-up: 49 comment:47 by , 14 years ago
Strange. Are you sure that the same user who did valgrind VBoxSVC starts VirtualBox? Wild guess: Do you have VirtualBox installed while you are trying to start a self-compiled VBoxSVC?
comment:48 by , 14 years ago
In my case, yes I'm sure. I was just running the pre-compiled programs from the regular ubuntu package (virtualbox-4.0 + virtualbox-4.0-dbg for symbols). Running the exact same executable (/usr/lib/virtualbox/VBoxSVC) under gdb works but under valgrind does not work. That is, VBoxManage will connect to the process under gdb but not under valgrind.
follow-up: 50 comment:49 by , 14 years ago
Replying to frank:
Strange. Are you sure that the same user who did valgrind VBoxSVC starts VirtualBox?
Yes. All VirtualBox-related processes (including valgrind) are started by my username.
Wild guess: Do you have VirtualBox installed while you are trying to start a self-compiled VBoxSVC?
Nope. I use Gentoo’s package (ebuild), although strictly speaking I guess you can consider it self-compiled (as in “compiled on my computer, on my command”). But to address your question, no, all VirtualBox-related files are from the same installation.
I even tried deleting /tmp/.vbox-$USERNAME-ipc after shutting down any VBoxSVC, then starting valgrind: I still get exactly one /tmp/.vbox* directory, but an extra VBoxSVC (other than the one running in valgrind) and two VBoxXPCOMIPCD processes. And the VBoxSVC in valgrind is completely ignored by everything else.
comment:50 by , 14 years ago
Replying to raffaellod:
[…] then starting valgrind […]
I meant: starting valgrind, waiting a minute or so, and then starting VirtualBox.
Also, I just checked the timestamps of the /tmp/.vbox-$USERNAME-ipc/ipcd socket and the lock file in the same directory: they both get overwritten when the second VBoxSVC is started.
follow-up: 54 comment:51 by , 14 years ago
Are you guys both using the Gentoo package? I start to think that this package is somehow screwed up ...
comment:52 by , 14 years ago
Nope, same here when using the installed package. Actually starting VBoxSVC with valgrind works only if you compile VBox yourself as non-hardened as I described in comment 31 and start VBoxSVC and VirtualBox from there. Sorry for not mentioning that earlier. Until now I was not aware of that valgrind problem myself.
comment:53 by , 14 years ago
Was the snapshot corruption I mentioned in comment 2011-06-11 03:23:23 relevant (and reproducible by you)?
comment:54 by , 14 years ago
Replying to frank:
Are you guys both using the Gentoo package? I start to think that this package is somehow screwed up ...
Nope, jimav is on Ubuntu, as he stated several comments ago.
I, on the other hand, was able to recompile with --disable-hardening, just by changing the call to ./configure in the package’s ebuild, then performing a regular install. This still didn’t fix the issue with VBoxSVC, so I just removed the SetUID bit from /usr/lib/virtualbox/VirtualBox, and sudo’ed both valgrind and VirtualBox itself. Yes, this is the only way I could get it to work. Don’t ask me, I have no idea why, I just tried with/without sudo, and with/without SetUID on VirtualBox.
So, anyway, I reproduced the segfault once more, this time in valgrind. I succeeded on the first attempt, using the latest pattern I described here.
by , 14 years ago
Attachment: | valgrind-raffaellod-segfault.log added |
---|
valgrind VBoxSVC session with segfault and backtrace
comment:55 by , 14 years ago
A few words on the valgrind log: the first error reported, «Syscall param semctl(arg) points to uninitialised byte(s)», is generated when VirtualBox is started, and happens every time, even when VBoxSVC doesn’t segfault.
The error «Invalid read of size 4», on the other hand, is generated when I restore the snapshot, that is, when the segfault is triggered.
So, it looks like nothing happens before the segfault. Since this doesn’t seem much more helpful than gdb’s output, I’m now expecting new instructions to provide you with more info. My computer is pretty decently powered, so don’t be afraid to ask for some heavier valgrind diagnostic option.
follow-up: 57 comment:56 by , 14 years ago
jimav, I could reproduce the error you describe in comment 2011-06-11 03:23:23 but without any snapshot corruption. Deleting the first snapshot is just not possible. I think this is a current limitation of VirtualBox.
raffaellod, thanks for the additional debugging attempt. So at least no prior screwup, just the STL operation fails and causes a crash. But I fear this is still not sufficient. And I cannot reproduce this problem (tried several times your steps of comment 36). What would perhaps would be a core dump of VBoxSVC from the 4.0.8 standard package from our website. Preferably on some standard Linux distribution (Ubuntu, Fedora, openSUSE) but I could even try with Gentoo (if you would use the .run package).
I would really like to save you from all this effort but unfortunately I cannot reproduce the problem :-(
comment:57 by , 14 years ago
Replying to frank:
What would perhaps would be a core dump of VBoxSVC from the 4.0.8 standard package from our website. Preferably on some standard Linux distribution (Ubuntu, Fedora, openSUSE) but I could even try with Gentoo (if you would use the .run package).
Maybe you should stop considering Gentoo as some weird (“non-standard”) distribution… I just spent two hours trying to prove your point, and instead obtained the same segfault, using the binaries from the .run package. Result that, considering that the Gentoo ebuild applies no source code patches (it’s all configuration and toolchain compatibility mumbo-jumbo), is not really surprising.
I’m not yet considering setting up a different Linux distribution, since I don’t have a spare computer available right now, so that would require playing on my daily-use computer.
What I did was:
- uninstall the VirtualBox userland package (emerge -C app-emulation/virtualbox), or just move /usr/lib/virtualbox (which contains all the binaries) somewhere else
- launch VirtualBox-4.0.8-71778-Linux_x86.run --noexec --target /usr/lib/virtualbox
- cd /usr/lib/virtualbox
- tar -xjf VirtualBox.tar.bz2
- create, in the same folder, this small launcher script (VBox-tmpenv), to make sure that the correct binaries will be picked up:
#!/bin/sh cd /usr/lib/virtualbox export PATH=${PWD}:${PATH} export LD_LIBRARY_PATH=${PWD}:${LD_LIBRARY_PATH} export VBOX_APP_HOME=${PWD} [ ${#} -gt 0 ] && exec "${@}"
- launch VBoxSVC with: sudo ./VBox-tmpenv valgrind ./VBoxSVC
- in another terminal, launch VirtualBox with: sudo ./VBox-tmpenv ./VirtualBox
- then, the “usual” segfault steps:
- create snapshot #1
- start the guest
- do some network and disk activity (I copied a 1.5 GiB file from the LAN to the guest’s non-immutable disk)
- shut down the guest regularly
- create snapshot #2
- start the guest, but abort it early (e.g. soon after the Windows boot logo is displayed), without restoring to a snapshot
- restore snapshot #2
- segfault
Now, where do I find the debug symbols for the .run file? I tried, by analogy with Ubuntu “Maverick” package » Ubuntu “Maverick” symbols, to download http://www.virtualbox.org/download/testcase/VirtualBox-dbg_4.0.8-71778-Linux_x86.run, but I got a 404 for all the variations thereof I tried, so I quit trying to guess. Can you please point me to the debug symbols for the “generic Linux” i386 package?
And then, how do I get a core dump?
P.S.: steps 6 and 7 include my (unsafe) fix for the inability to use valgrind on VBoxSVC without using sudo. This requires having both VirtualBox and the guest setup for the root account (/root/.VirtualBox/VirtualBox.xml, and so on).
P.P.S.: originally I was using /tmp/vbox as a temporary VirtualBox installation folder instead of /usr/lib/virtualbox, but for security reasons I was not allowed to load the .r0 binary from there (error message when starting a guest), so after some research I found out that the whole path of the .r0 has to be 0755/root:root. You guys of course know this, but I’m writing it here for whoever else might need to try what I did today.
follow-up: 59 comment:58 by , 14 years ago
I have this problem about once a week on a SuSE 11.4 64 bit host running VB 4.0.8 and GNOME desktop
1) Start a test VM. 2) Do some testing. 3) Shutdown the VM. 4) Restore the VM from a previous snapshot. 5) VM becomes unavailable.
I do run with 7 different workspaces with the VB manager open in one workspace, and the various VMs open in other workspaces.
This happens on different VMs, and it happens even when only running a single VM.
This is very frustrating.
In addition, I would like to suggest that there should be a standalone tool that can be used to edit/repair the VirtualBox.xml --- ".vdi nnn not found, would you like to replace it with .vdi mmm?". This file is very critical, and apparently very fragile. Backups do not help since the file was never properly updated.
comment:59 by , 14 years ago
Replying to DHughes:
This is very frustrating.
In addition, I would like to suggest that there should be a standalone tool that can be used to edit/repair the VirtualBox.xml --- ".vdi nnn not found, would you like to replace it with .vdi mmm?". This file is very critical, and apparently very fragile. Backups do not help since the file was never properly updated.
I totally agree with you. prev files are useless (what's the point to copy corrupted file?) I also have this problem and 4.0.10 does not fix it. Always a pleasure taking hours to recreate the VirtualBox.xml file manually.
Win XP SP3 32 bits. I use the same method to reproduce the problem: stop the machine or crash vbox (easy with D3D). restore the previous snapshot => machine config is dead.
comment:60 by , 14 years ago
Running 4.0.10 on XP from official package, with various Linux / Windows guests.
Have experienced this problem a couple of times after upgrading from 3.x a few weeks ago. I agree it's very frustrating and wastes a lot of time. Also my previous method for fixing the problem didn't work this time, indicating it's not always an easy fix (for the user).
I think given this issue seems to be a recurring one in different forms why not just bypass what is causing the segfault, config and snapshot corruption. Just go straight to a corrupted configuration and give users some useful options to resolve this, even if you fix this specific bug another one will probably pop up later that causes a similar corruption. I'm pretty sure we'd like access to some usable configuration rather than the generic error message, especially if a lot of snapshots are already saved?
comment:61 by , 13 years ago
I have experienced this problem with both Windows and Ubuntu hosts and various guest OS, since I upgraded VBox to version 4. I was also running version 3 before. Could it be that most people who face this problem have upgraded from version 3?
comment:62 by , 13 years ago
FWIW, I also upgraded from version 3 to 4. Frank, if you still can't repro the problem and are not using a VM created with v3, it might be worth trying that. Just an idea.
comment:64 by , 13 years ago
Here is a list of versions that I've used although I can't swear I directly upgraded between every version on the list, it would of at least been some combination of the below...
- VirtualBox-2.2.2-46594-Win.exe
- VirtualBox-2.2.4-47978-Win.exe
- VirtualBox-3.0.0-49315-Win.exe
- VirtualBox-3.0.8-53140-Win.exe
- VirtualBox-3.0.12-54655-Win.exe
- VirtualBox-3.1.6-59338-Win.exe
- VirtualBox-3.2.10-66523-Win.exe
- VirtualBox-4.0.10-72479-Win.exe
comment:65 by , 13 years ago
As far as re-creating the issue, this was a win2k3 guest used as a test environment, with multiple branched snapshots. There was a backlog of jobs, so I wanted to re-configure with more memory so I could run more jobs concurrently without swapping etc. (using 4 cores). Increasing the memory however caused the guest to freeze during boot-up. I went back to a working configuration and took a snapshot to be on the safe side so I could experiment. Typically the guest freeze caused the virtualbox GUI to freeze as well, after terminating the guest manually the GUI would become responsive again. Obviously this leaves the guest in an aborted state, after numerous changes I wanted to go back to the snapshot so I could start again, this was where the problem occurred, i.e. restoring the most recent snapshot over an aborted guest. The trick with the guest seemed to be to turn off PAE/NX but leave on Nested Paging, this seemed to give the most reliable results (just in case anyone is interested).
comment:66 by , 13 years ago
Here are some details about what worked for me as a fix.
I had a similar issue with a Ubuntu 10.40.1 guest. This only had a single drive and I found that copying the next most recent snapshot file over the damaged one got it working (i.e. when sorted in date order the damaged one is at the top and the one I copied over it was the next down). I have since looked inside the snapshot file and can see that the UUID is also in the snapshot, so I'm not sure why this worked, unless for some reason VB decided this was acceptable in this instance.
This did not however work for my w2k3 guest, perhaps because it has 2 drives or perhaps because the logic used to check the snapshots is different for some reason. In this instance I modified the XML configuration file for the guest and manually substituted in the UUIDs of the most recent snapshot instead of the damaged ones (one snapshot file definitely appeared truncated). I'm not entirely sure how the snapshots work, but it does appear that only certain files can be utilised in this manner (presumably if it was any other snapshot bar the most recent one something needs to be reapplied anyway?), partly as they are incremental and perhaps there are some stub files that are used to contain the current state separate from the snapshot (I think that's what was damaged and what I managed to pull back by editing the configuration).
comment:67 by , 13 years ago
Enhancement: Change snapshot restore logic?
It appears that the configuration file is being changed before the snapshot files are written successfully (or the restore is just overwriting the current files as opposed to creating new ones). There must surely be a way to minimise this risk and or by creating some sort of snapshot history/transaction log or similar that can be scanned in. I'm sure the obvious suggestion would be can all the changes be made in such a way as to do a positive update to the configuration at the end of restoration process?
comment:68 by , 13 years ago
Enhancement: Recall any available snapshot to restore a broken configuration?
Would it be possible to offer the choice to restore a snapshot over a broken configuration and either ignore or dump the broken/missing files or configuration elements? Along the same lines, I think that having a snapshot browser in the GUI similar to the media manager (or as part of it) with some options to repair / inspect etc. etc. would be very welcome?
As a minimum would it be possible to get some prompt to restore the most recent snapshot as this seems to be what failed?
comment:69 by , 13 years ago
My worry in manually editing the configuration files is that I would totally break my configuration and leave it in a state that could not be fixed without restoring the entire guest which is many GBs. Ideally I'd like to rely on the snapshot feature for day to day backups of non production machines? Where applicable the licensing information for the guest / apps is also stored in the snapshot, so loosing the entire snapshot chain can be a problem. Actually even having the GUI link to some documentation/URL or the command line spit out a URL of somewhere to look for some further information would be of use?
comment:70 by , 13 years ago
I can confirm the corruption and VBoxSVC segfault on restore operations with VirtualBox 4.1 OSE.
Host: Gentoo x86_64 Guest: Ubuntu 11.04 x86 <- corruption on restore CentOS 6 x86_64 <- this is running
comment:71 by , 13 years ago
I found that the location of the disk information has changed. Prior to VB 4 it seems to have been split between the .xml files saved with the guest and the .xml files in the home directory of the id running VB. With VB 4 all of the disk information seems to be in the guest files. Upgrading to VB 4 does not appear to 'upgrade' the .xml files. I deleted the guest definitions that were created in earlier version of VB and recreated them under VB 4.1.0 (a bit of a pain, and it did involve some swearing). It did not resolve the corruption problem, but it did reduce it somewhat, and manually editing the .xml file to try and fix them is much easier since all of the disk information is in one place. It is also easier to restore corrupted VMs from individual guest backups because there is less interaction between the guest and host definition files.
I would still like a standalone repair tool for the .xlm files.
comment:72 by , 13 years ago
I've experienced this issue again, this time just by switching between snapshots. There were no crashes of the GUI or the guest involved. It just went straight to inaccessible after trying to restore one of the snapshots.
The sequence of events was something like this...
Create several snapshots
Recall a previous snapshot --> Probably created with 3.x
Boot the guest
Shutdown
Recall the most recent snapshot --> GUI showed inaccessible for the guest after doing this
comment:73 by , 13 years ago
As a further comment, after repairing (or whatever) the xml config I restored the most recent snapshot again. I'm not actually 100% certain which 2 files I actually put into the configuration file but it did seem happy with them. After doing this I did an A->B comparison between the 4 vdi files i.e. the 2 pairs. In both instances the files were identical except for the section at 188 bytes, 32 bytes long of which the first 16 bytes correspond to the UUID given in the filename of the vdi. I'm not sure if the next 16 bytes are a checksum or some sort of parent UUID, but it does seem likely that that is where the error is.
comment:74 by , 13 years ago
themadmonk, the question was not if the GUI crashed but if the VBoxSVC daemon crashed.
comment:75 by , 13 years ago
We think we found and fixed a bug which could be related to all these problems. The upcoming version 4.1.4 will contain this fix, however, this is a test build which contains the fix. So if anybody is interested to test, feedback is appreciated.
comment:76 by , 13 years ago
This release seems to fix the problem for me. After snapshot restore, snapshot machine is now OK even after machine and vbox manager crash. Thanks.
comment:77 by , 13 years ago
This fix is contained in VBox 4.1.4. Please confirm it works now. Thanks!
comment:78 by , 13 years ago
I have not had this problem since upgrading to 4.1.4. It was not a consistent error in the past, but I have done quite a few restores under 4.1.4 and would have expected at least one occurrence.
However, I did notice a new behavior: If I restore a snapshot and then stop and start VBoxManager, the status of the restored machines always changes to "Current State (changed)" even though there was no activity on those VMs after the restore.
I also scanned all of my .vbox files and found a number that had a link to the CD drive in the file even though the GUI showed it as nothing attached, and I found several where, early in the disk image chain, there was a section with two UUID entries one of which pointed to a non-existent file. My point is that even if 4.1.4 prevents future corruption, it does not fix corruption that has already happened. So, again, I would like to request a stand-alone config file utility, or, at least, the ability to force the restore of an earlier snapshot.
by , 13 years ago
Attachment: | vbox_check added |
---|
small rexx utility to check HardDisk entries in .vbox files
comment:80 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
I want to close this ticket. Please open a new ticket if you still observe similar problems.
I have the same problem. PowerOff + RestoreCurrent (or restore <current snapshot>) have this effect. If manually edit VirtualBox.xml to set new GUID value - VM will be accessible.