Opened 16 years ago
Closed 15 years ago
#2524 closed defect (fixed)
data corruption under heavy I/O load on host
Reported by: | Georg Moritz | Owned by: | |
---|---|---|---|
Component: | other | Version: | VirtualBox 2.0.2 |
Keywords: | Cc: | ||
Guest type: | other | Host type: | other |
Description
Heavy I/O load on the host leads to ide dma timeouts, device resets, incorrect read/write operations and ultimately to data corruption in the VBox.
It looks like iowait conditions on the host don't lead to a blocking of the virtual
vb10:~# tail -n 0 -f /var/log/messages Oct 27 18:28:35 vb10 kernel: [ 1559.569320] hdb: dma_timer_expiry: dma status == 0x61 Oct 27 18:28:45 vb10 kernel: [ 1569.569594] hdb: DMA timeout error Oct 27 18:28:45 vb10 kernel: [ 1569.570105] hdb: dma timeout error: status=0x48 { DriveReady DataRequest } Oct 27 18:28:45 vb10 kernel: [ 1569.570127] ide: failed opcode was: unknown Oct 27 18:28:45 vb10 kernel: [ 1569.570766] hda: DMA disabled Oct 27 18:28:45 vb10 kernel: [ 1569.572039] hdb: DMA disabled Oct 27 18:29:20 vb10 kernel: [ 1599.568186] ide0: reset timed-out, status=0x90 Oct 27 18:29:20 vb10 kernel: [ 1604.572374] hda: status timeout: status=0x90 { Busy } Oct 27 18:29:20 vb10 kernel: [ 1604.572396] ide: failed opcode was: unknown Oct 27 18:29:20 vb10 kernel: [ 1604.580286] Clocksource tsc unstable (delta = 4687228551 ns) Oct 27 18:30:00 vb10 kernel: quest: I/O error, dev hdb, sector 506799 Oct 27 18:30:01 vb10 kernel: [ 1635.421648] __ratelimit: 5022 messages suppressed Oct 27 18:30:01 vb10 kernel: [ 1635.421648] lost page write due to I/O error on hdb1 Oct 27 18:31:02 vb10 kernel: nd_request: I/O error, dev hdb, sector 551111 Oct 27 18:31:09 vb10 kernel: [ 1714.280038] __ratelimit: 5014 messages suppressed Oct 27 18:31:09 vb10 kernel: [ 1714.280038] lost page write due to I/O error on hdb1 Oct 27 18:31:20 vb10 kernel: [ 1725.024393] lost page write due to I/O error on hdb1 Oct 27 18:31:26 vb10 kernel: [ 1730.728966] lost page write due to I/O error on hdb1 Oct 27 18:31:29 vb10 kernel: [ 1733.614585] lost page write due to I/O error on hdb1 Oct 27 18:31:29 vb10 kernel: [ 1733.905566] lost page write due to I/O error on hdb1 Oct 27 18:31:31 vb10 kernel: [ 1736.305170] __ratelimit: 10 messages suppressed Oct 27 18:31:31 vb10 kernel: [ 1736.305403] lost page write due to I/O error on hdb1 Oct 27 18:31:36 vb10 kernel: [ 1741.416768] __ratelimit: 23 messages suppressed Oct 27 18:31:36 vb10 kernel: [ 1741.417144] lost page write due to I/O error on hdb1 Oct 27 18:31:41 vb10 kernel: [ 1746.255712] __ratelimit: 21 messages suppressed Oct 27 18:31:41 vb10 kernel: [ 1746.255917] lost page write due to I/O error on hdb1 Oct 27 18:31:46 vb10 kernel: [ 1751.279075] __ratelimit: 22 messages suppressed Oct 27 18:31:46 vb10 kernel: [ 1751.279288] lost page write due to I/O error on hdb1 Oct 27 18:31:51 vb10 kernel: [ 1756.235377] __ratelimit: 20 messages suppressed Oct 27 18:31:51 vb10 kernel: [ 1756.235377] lost page write due to I/O error on hdb1 Oct 27 18:31:56 vb10 kernel: [ 1761.304274] __ratelimit: 20 messages suppressed Oct 27 18:31:56 vb10 kernel: [ 1761.304455] lost page write due to I/O error on hdb1 Oct 27 18:32:01 vb10 kernel: 009488] end_request: I/O error, dev hdb, sector 580247 Oct 27 18:32:03 vb10 kernel: r, dev hdb, sector 601439 Oct 27 18:32:03 vb10 kernel: [ 1767.301521] __journal_remove_journal_head: freeing b_frozen_data ^C vb10:~# ls -l /data ls: cannot access /data/lost+found: Input/output error total 260968 -rw-r----- 1 root root 266964992 2008-10-27 18:30 foo.img d????????? ? ? ? ? ? lost+found vb10:~#
It looks like iowait conditions on the host don't lead to the blocking of the virtual pci bus, which results in the disk driver to run into timeouts as per the pci spec.
Setup:
Host: 8 CPUs, 32 GB RAM, 2.8 TB RAID 5 divided into 32 logical volumes of 83 GB each, running debian lenny
VBoxes: 768 MB RAM, 1 GB ext3 root fs (hda), 11 GB ext3 data fs (hdb), running debian lenny
I have been testing throughput and stability running iozone simultaneously in different numbers of VBoxes. Corruption rate was 0 out of 4 (0/4), 5/8 and 16/16.
Change History (13)
comment:1 by , 16 years ago
comment:2 by , 16 years ago
We are aware of this issue and are working on it but this will take some time as we have to change how I/O works at the moment. At the moment we use the kernel cache of the host to speed up reading/wrinting the data from/to the image. If the kernel cache is full the data is written to the disk which can block any other I/O operation for a long time especially if the cache is quite big. We could just disable the cache but this will cause bad I/O performance so we have to implement our own caching. Can you try if the workaround mentioned in chapter 11.1.2 in the manual fixes the issue please? You may need to try different values before the timeouts disappear while still getting decent I/O performance.
comment:3 by , 16 years ago
Thanks for the feedback and the pointer to your docs.
I did experiment with lower FlushInterval settings, and at the values I tried it did slow down the frequency of error messages. But ultimately what I'm trying to do takes way too long with lower values, so I ended up doing it on another system. I did not confirm whether there was a setting that stopped all errors (I would have needed to wait for days for the process to finish.)
follow-up: 6 comment:5 by , 15 years ago
Having this error but slightly different, maybe the same ? I'm using 3.0.10 on Windows (XP SP2) with a Debian guest. Lost a database on this possible bug / HD failing :-(
Could be my HD ? I don't see any events in the Windows Event Viewer
Dec 11 06:41:35 vbox kernel: [108262.205567] hdb: dma_intr: status=0x41 { DriveReady Error } Dec 11 06:41:35 vbox kernel: [108262.205567] hdb: dma_intr: error=0x10 { SectorIdNotFound }, LBAsect=18749623, sector=18749623 Dec 11 06:41:35 vbox kernel: [108262.205567] ide: failed opcode was: unknown Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: status=0x41 { DriveReady Error } Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: error=0x10 { SectorIdNotFound }, LBAsect=18749623, sector=18749623 Dec 11 06:41:35 vbox kernel: [108262.237569] ide: failed opcode was: unknown Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: status=0x41 { DriveReady Error } Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: error=0x10 { SectorIdNotFound }, LBAsect=18749623, sector=18749623 Dec 11 06:41:35 vbox kernel: [108262.237569] ide: failed opcode was: unknown Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: status=0x41 { DriveReady Error } Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: error=0x10 { SectorIdNotFound }, LBAsect=18749623, sector=18749623 Dec 11 06:41:35 vbox kernel: [108262.237569] ide: failed opcode was: unknown Dec 11 06:41:35 vbox kernel: [108262.237569] hda: DMA disabled Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: DMA disabled Dec 11 06:41:35 vbox kernel: [108262.300669] ide0: reset: success Dec 11 06:41:35 vbox kernel: [108262.348672] hda: task_out_intr: status=0x41 { DriveReady Error } Dec 11 06:41:36 vbox kernel: [108262.348672] hda: task_out_intr: error=0x10 { SectorIdNotFound }, LBAsect=17011429, sector=17011429 Dec 11 06:41:36 vbox kernel: [108262.348672] ide: failed opcode was: unknown Dec 11 06:41:36 vbox kernel: [108262.348672] hdb: task_out_intr: status=0x41 { DriveReady Error } Dec 11 06:41:36 vbox kernel: [108262.348672] hdb: task_out_intr: error=0x10 { SectorIdNotFound }, LBAsect=18749623, sector=18749623 Dec 11 06:41:36 vbox kernel: [108262.348672] ide: failed opcode was: unknown
comment:6 by , 15 years ago
Replying to cokegen:
Having this error but slightly different, maybe the same ? I'm using 3.0.10 on Windows (XP SP2) with a Debian guest. Lost a database on this possible bug / HD failing :-(
Could be my HD ? I don't see any events in the Windows Event Viewer
From the symptoms it looks like an unrelated issue (corrupted hard disk image or something), so please open a new ticket for your problem, and attach VBox.log from a VirtualBox run where those errors showed up.
Ah, just saw you already did. Thanks.
comment:7 by , 15 years ago
Still seeing this in VirtualBox 3.1, trying to sign on to this ticket for updates. The workaround noted in the manual is subject to tweaking, and I'm still working at that. Is there a recommended filesystem for hosting VirtualBox VMs? ZFS on OpenSolaris? ext3/4 on Linux?
comment:8 by , 15 years ago
Yes, these file systems should be fine, although I'm not sure about ext4 because this file system isn't that stable as ext3 yet.
comment:9 by , 15 years ago
Having same problem with 3.1.4 on Ubuntu 9.10 host with Debian4 guest. I already use ext3 with my debian guest.
Is there already a workaround available?!
comment:10 by , 15 years ago
*aaah* I now read the ticket description more attentively... It says: "if the host disk is on heavy load, the vbox disk is having the described DMA etc. problems." Okay, in that case, I have to be more precise:
Actually, I have two machines:
- VBox machine: powerful CPU, enough RAM
- Storage-Server with lot of disks and RAID etc...
Both machines are directly connected (x-link) with 1Gbit network. The virtualized harddisks are located on the storage server. The VBox server mounts a special folder (that contains the virtualized harddisks) via NFS via this 1GBit x-link. This is actually very fast: I can transfer >80MB/sec without any problem.
But VBox has the above described problem if the guest system issues heavy load on the virtualized harddisk. I think the problem is the same, but with a different derivation...
I would love to see some progress on this bug-ticket ... ;-)
comment:11 by , 15 years ago
Can we put forth donations for someone to work on this? I really don't want to uproot from vbox, but this is occasionally causing some pretty huge issues, and I really don't have the skills necessary to help much. :(
comment:12 by , 15 years ago
This was actually fixed with 3.2. Make sure to disable the host I/O cache in the controller settings and don't store the image on ext4 because there seems to be some bug in ext4 causing data corruption.
comment:13 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I can confirm this. I'm seeing the same "dma_timer_expiry" and "DMA disabled" sequence, until eventually the VM becomes unresponsive and has to be rebooted.
(I started seeing this during a batch load of a large PostgreSQL database, i.e. under high I/O load -- which I wouldn't consider an uncommon situation. I can't successfully load the data in one go, will try to split it up. So the bug seems critical to me indeed :)