#14033 new defect

VM Losing time guest running network backups

Reported by:	Timothe Litt	Owned by:
Component:	other	Version:	VirtualBox 4.3.26
Keywords:		Cc:
Guest type:	all	Host type:	all

Description

One of my VMs is seeing major loss of time when it runs backups.

It does not have this issue at any other time.

I did not see this behavior until I upgraded from the last pre-packaged version of VirtualBox for Fedora 17 to built-from-sources 4.3.(the one before 26).

The host:

Linux hagrid.sb.litts.net 2.6.43.8-1.fc15.x86_64 #1 SMP Mon Jun 4 20:33:44 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

The guest:

Linux overkill.sb.litts.net 2.6.22.14-72.fc6 #1 SMP Wed Nov 21 15:12:59 EST 2007 i686 i686 i386 GNU/Linux

VirtualBox: 4.3.26_OSE r98988 built from sources

I do not see this on other VMs, but this one has the most data. The other VMs also run Fedora, but not this exact version.

Disk being backed-up:

/dev/hda3 415G 79G 316G 20% /

Resulting backup (on an NFS-mounted NAS):

-rw-rw---- 1 root root 58G Apr 8 05:18 /mnt/backup/overkill/Wed.tar.z

The host and the NAS both have Gb ethernet connections to the same (switched) physical LAN.

The virtual adapter is "Bridged", Intel PRO/1000 MT Desktop (8254OEM).

The virtual disk is a normal VDI 400GB (170.94 actual) dynamic disk on an (emulated) IDE controller. It is the only disk configured on this VM (except for an empty CDROM).

The VM runs headless.

The tar command:

tar --totals --exclude var/www/servers/PhotoGallery/FileCache --anchored --wildcards --wildcards-match-slash --exclude proc --exclude sys --exclude dev/pts --exclude dev/shm --exclude var/lib/nfs/rpc_pipefs --exclude proc/fs/nfsd --exclude proc/sys/fs/binfmt_misc --exclude mnt/castaway --exclude mnt/hagrid --exclude mnt/CindyTV --exclude mnt/nas148001 --exclude mnt/backup --exclude etc/lvm --exclude etc/mtab --exclude dev --exclude proc --exclude run --exclude sys --exclude mnt --exclude tmp --exclude var/cache -czpf /mnt/backup/overkill/Wed.tar.z bin boot dev etc home lib lost+found media misc mnt net opt proc root sbin selinux srv sys tftpboot tmp usr var

The backup runs for several hours. In today's case, from about 0100 to about 0515. Accumulated drift is about 40 minutes:

date
Wed Apr  8 04:43:05 EDT 2015
 Stop/restart NTP (to get correct time from network)
date
Wed Apr  8 05:23:13 EDT 2015

The VM runs NTP, but the time drift is too large for NTP to compensate.

NTP ends up locking to the local time-of-year clock, which is also slow.

VBox.log (attached) contains lines showing the drift accumulating of the form:

124:38:08.239525 TM: Giving up catch-up attempt at a 60 001 255 770 ns lag; new total: 60 001 255 770 ns

124:44:06.724982 TM: Giving up catch-up attempt at a 60 001 273 374 ns lag; new total: 120 002 529 144 ns

Attachments (3)

VBox.log (63.2 KB ) - added by Timothe Litt 10 years ago.: VBox.log
VBox.2.log (83.8 KB ) - added by Timothe Litt 10 years ago.
VBox.3.log (84.2 KB ) - added by Timothe Litt 10 years ago.: Latest snapshot of VBox.log

Download all attachments as: .zip

Change History (9)

by Timothe Litt, 10 years ago

Attachment:	VBox.log added

VBox.log

comment:1 by Frank Mehnert, 10 years ago

Your guest runs with 1000 HZ timer frequency and this is surely one reason for the timing problems. There might be a way to force the guest kernel to use 100 HZ. With RHEL5 kernels there is a divider=10 switch but I don't know if there is a way for old Fedora kernels.

comment:2 by Timothe Litt, 10 years ago

Thanks for the hint. divider=10 is ignored with this version of Fedora.

I tracked down the kernel sources for the guest and rebuilt for 100Hz. During the kernel build, time slip was extreme - so it seems to be compute or disk related rather than network IO. (Tar compression is also compute intensive.) I had to run ntpdate manually every few minutes to keep time even roughly in sync. Errors were on the order of 1 min/min.

Booting the new kernel VBox.log has:

180:35:41.965175 PIT: mode=2 count=0x2e9c (11932) - 99.99 Hz (ch=0)

So the new kernel has reduced the timer frequency as intended.

After waiting for ntp to stabilize, I tried another backup.

After 30 minutes of backup (6.2G transferred), the guest is 15 seconds behind wallclock time. NTP has given up. VBox.log has:

182:17:27.164916 TM: Giving up catch-up attempt at a 60 001 449 423 ns lag; new total: 8 411 549 965 330 ns

(Just the one catch-up line since the reboot.)

I stopped the backup.

follow-up: 4 comment:3 by Frank Mehnert, 10 years ago

Could you attach another VBox.log from your VM session when your guest kernel runs with 100 Hz?

by Timothe Litt, 10 years ago

Attachment:	VBox.2.log added

in reply to: 3 comment:4 by Timothe Litt, 10 years ago

Replying to frank:

Could you attach another VBox.log from your VM session when your guest kernel runs with 100 Hz?

Done. Note that I didn't restart the emulation, so you'll see the previous run, then the reboot to the 100 Hz kernel and the aborted backup.

comment:5 by Timothe Litt, 10 years ago

The 100 Hz kernel seems to have improved things, but not enough.

Backup ran overnight. At the start, NTP was active and synchronized. Wall-clock start at 01:00, system time end 05:45. At 0620, system time was 228 seconds (3.8 min) slow. So the wall-clock time would be about 4 hours, 38 min. That works out to an error of 13,669 PPM (1.4%).

Of course, NTP had given up as the drift exceeds its ability to compensate.

Since the error at 1 KHz was ~ 40 min and the error at 100 Hz ~4 min, it does seem that the error is proportional to the timer frequency.

I didn't mention it before, but the host also runs NTP, and remains locked to the same servers. So VirtualBox has a stable time reference on the host.

I will attach the new VBox.log - but the only new lines are some more TM: Giving up catch-up attempt messages.

Thoughts on what else I can try would be appreciated.

# date
Fri Apr 10 06:13:11 EDT 2015
# ntpq -pnw |head -n6
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 192.168.148.43  .GPPS.           1 u   59 1024  377    1.754  227739. 17290.6
 192.168.208.148 146.51.136.107   2 u  695 1024  377    1.069  227737. 27477.0
 192.168.148.136 192.168.148.43   2 u  842 1024  377    0.401  281790. 46613.6
 192.168.148.10  192.168.148.43   2 u  874 1024  377    0.514  227739. 29683.8
# service ntpd stop
Shutting down ntpd:                                        [  OK  ]
# ntpdate 192.168.148.43
10 Apr 06:17:56 ntpdate[23274]: step time server 2001:4830:11a2:1::1 offset 228.108723 sec
# date
Fri Apr 10 06:18:02 EDT 2015
# grep 'Backup\[' /var/log/messages
Apr 10 01:11:05 overkill Backup[5907]: Backup starting for overkill
Apr 10 05:45:42 overkill Backup[5907]: Backup of / successful, stored on nas148010.sb.litts.net:/volume1/backups/overkill

by Timothe Litt, 10 years ago

Attachment:	VBox.3.log added

Latest snapshot of VBox.log

comment:6 by Frank Mehnert, 10 years ago

Most likely not the reason but it's always preferred to use the VBox Guest Additions (GA) time service instead of NTP. The GA time service synchronizes the guest with the host and can also handle bigger time differences (if the difference is bigger than 20 minutes, the time is just set) and it can also handle VM savestate (on resume, the new time is automatically set). There might be also a conflict between NTP and the VBox GA time service. So disabling NTP might be worth a try though I don't expect much improvement.

One reason for the timing problems could also be the lack of VMX preemption timers in your host CPU. But still, I wouldn't expect that major impact.

It might be also worth a try to use the Beta 5.0 to check if there is any improvement as VirtualBox 5.0 will contain a couple of timing-related fixes.

Note: See TracTickets for help on using tickets.

Download in other formats: