Opened 10 years ago
Last modified 10 years ago
#14033 new defect
VM Losing time guest running network backups
Reported by: | Timothe Litt | Owned by: | |
---|---|---|---|
Component: | other | Version: | VirtualBox 4.3.26 |
Keywords: | Cc: | ||
Guest type: | all | Host type: | all |
Description
One of my VMs is seeing major loss of time when it runs backups.
It does not have this issue at any other time.
I did not see this behavior until I upgraded from the last pre-packaged version of VirtualBox for Fedora 17 to built-from-sources 4.3.(the one before 26).
The host:
Linux hagrid.sb.litts.net 2.6.43.8-1.fc15.x86_64 #1 SMP Mon Jun 4 20:33:44 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
The guest:
Linux overkill.sb.litts.net 2.6.22.14-72.fc6 #1 SMP Wed Nov 21 15:12:59 EST 2007 i686 i686 i386 GNU/Linux
VirtualBox:
4.3.26_OSE r98988 built from sources
I do not see this on other VMs, but this one has the most data. The other VMs also run Fedora, but not this exact version.
Disk being backed-up:
/dev/hda3 415G 79G 316G 20% /
Resulting backup (on an NFS-mounted NAS):
-rw-rw---- 1 root root 58G Apr 8 05:18 /mnt/backup/overkill/Wed.tar.z
The host and the NAS both have Gb ethernet connections to the same (switched) physical LAN.
The virtual adapter is "Bridged", Intel PRO/1000 MT Desktop (8254OEM).
The virtual disk is a normal VDI 400GB (170.94 actual) dynamic disk on an (emulated) IDE controller. It is the only disk configured on this VM (except for an empty CDROM).
The VM runs headless.
The tar command:
tar --totals --exclude var/www/servers/PhotoGallery/FileCache --anchored --wildcards --wildcards-match-slash --exclude proc --exclude sys --exclude dev/pts --exclude dev/shm --exclude var/lib/nfs/rpc_pipefs --exclude proc/fs/nfsd --exclude proc/sys/fs/binfmt_misc --exclude mnt/castaway --exclude mnt/hagrid --exclude mnt/CindyTV --exclude mnt/nas148001 --exclude mnt/backup --exclude etc/lvm --exclude etc/mtab --exclude dev --exclude proc --exclude run --exclude sys --exclude mnt --exclude tmp --exclude var/cache -czpf /mnt/backup/overkill/Wed.tar.z bin boot dev etc home lib lost+found media misc mnt net opt proc root sbin selinux srv sys tftpboot tmp usr var
The backup runs for several hours. In today's case, from about 0100 to about 0515. Accumulated drift is about 40 minutes:
date Wed Apr 8 04:43:05 EDT 2015 Stop/restart NTP (to get correct time from network) date Wed Apr 8 05:23:13 EDT 2015
The VM runs NTP, but the time drift is too large for NTP to compensate.
NTP ends up locking to the local time-of-year clock, which is also slow.
VBox.log (attached) contains lines showing the drift accumulating of the form:
124:38:08.239525 TM: Giving up catch-up attempt at a 60 001 255 770 ns lag; new total: 60 001 255 770 ns 124:44:06.724982 TM: Giving up catch-up attempt at a 60 001 273 374 ns lag; new total: 120 002 529 144 ns
Attachments (3)
Change History (9)
by , 10 years ago
comment:1 by , 10 years ago
Your guest runs with 1000 HZ timer frequency and this is surely one reason for the timing problems. There might be a way to force the guest kernel to use 100 HZ. With RHEL5 kernels there is a divider=10 switch but I don't know if there is a way for old Fedora kernels.
comment:2 by , 10 years ago
Thanks for the hint. divider=10 is ignored with this version of Fedora.
I tracked down the kernel sources for the guest and rebuilt for 100Hz. During the kernel build, time slip was extreme - so it seems to be compute or disk related rather than network IO. (Tar compression is also compute intensive.) I had to run ntpdate manually every few minutes to keep time even roughly in sync. Errors were on the order of 1 min/min.
Booting the new kernel VBox.log has:
180:35:41.965175 PIT: mode=2 count=0x2e9c (11932) - 99.99 Hz (ch=0)
So the new kernel has reduced the timer frequency as intended.
After waiting for ntp to stabilize, I tried another backup.
After 30 minutes of backup (6.2G transferred), the guest is 15 seconds behind wallclock time. NTP has given up. VBox.log has:
182:17:27.164916 TM: Giving up catch-up attempt at a 60 001 449 423 ns lag; new total: 8 411 549 965 330 ns
(Just the one catch-up line since the reboot.)
I stopped the backup.
follow-up: 4 comment:3 by , 10 years ago
Could you attach another VBox.log from your VM session when your guest kernel runs with 100 Hz?
by , 10 years ago
Attachment: | VBox.2.log added |
---|
comment:4 by , 10 years ago
Replying to frank:
Could you attach another VBox.log from your VM session when your guest kernel runs with 100 Hz?
Done. Note that I didn't restart the emulation, so you'll see the previous run, then the reboot to the 100 Hz kernel and the aborted backup.
comment:5 by , 10 years ago
The 100 Hz kernel seems to have improved things, but not enough.
Backup ran overnight. At the start, NTP was active and synchronized. Wall-clock start at 01:00, system time end 05:45. At 0620, system time was 228 seconds (3.8 min) slow. So the wall-clock time would be about 4 hours, 38 min. That works out to an error of 13,669 PPM (1.4%).
Of course, NTP had given up as the drift exceeds its ability to compensate.
Since the error at 1 KHz was ~ 40 min and the error at 100 Hz ~4 min, it does seem that the error is proportional to the timer frequency.
I didn't mention it before, but the host also runs NTP, and remains locked to the same servers. So VirtualBox has a stable time reference on the host.
I will attach the new VBox.log - but the only new lines are some more TM: Giving up catch-up attempt
messages.
Thoughts on what else I can try would be appreciated.
# date Fri Apr 10 06:13:11 EDT 2015 # ntpq -pnw |head -n6 remote refid st t when poll reach delay offset jitter ============================================================================== 192.168.148.43 .GPPS. 1 u 59 1024 377 1.754 227739. 17290.6 192.168.208.148 146.51.136.107 2 u 695 1024 377 1.069 227737. 27477.0 192.168.148.136 192.168.148.43 2 u 842 1024 377 0.401 281790. 46613.6 192.168.148.10 192.168.148.43 2 u 874 1024 377 0.514 227739. 29683.8 # service ntpd stop Shutting down ntpd: [ OK ] # ntpdate 192.168.148.43 10 Apr 06:17:56 ntpdate[23274]: step time server 2001:4830:11a2:1::1 offset 228.108723 sec # date Fri Apr 10 06:18:02 EDT 2015 # grep 'Backup\[' /var/log/messages Apr 10 01:11:05 overkill Backup[5907]: Backup starting for overkill Apr 10 05:45:42 overkill Backup[5907]: Backup of / successful, stored on nas148010.sb.litts.net:/volume1/backups/overkill
comment:6 by , 10 years ago
Most likely not the reason but it's always preferred to use the VBox Guest Additions (GA) time service instead of NTP. The GA time service synchronizes the guest with the host and can also handle bigger time differences (if the difference is bigger than 20 minutes, the time is just set) and it can also handle VM savestate (on resume, the new time is automatically set). There might be also a conflict between NTP and the VBox GA time service. So disabling NTP might be worth a try though I don't expect much improvement.
One reason for the timing problems could also be the lack of VMX preemption timers in your host CPU. But still, I wouldn't expect that major impact.
It might be also worth a try to use the Beta 5.0 to check if there is any improvement as VirtualBox 5.0 will contain a couple of timing-related fixes.
VBox.log