Opened 14 years ago
Closed 13 years ago
#8631 closed defect (fixed)
Large page allocation times out - causes guest hang
Reported by: | Jakob Østergaard Hegelund | Owned by: | |
---|---|---|---|
Component: | VMM | Version: | VirtualBox 4.0.4 |
Keywords: | large pages | Cc: | |
Guest type: | Linux | Host type: | Solaris |
Description
On a host currently running 11 guests (mixed linux and windows), one of the Debian 5 64-bit Linux guests quickly freeze up as soon as it starts running its batch jobs (CPU intensive and NFS I/O intensive).
The last message in the log file before the freeze is:
00:00:51.020 PGMR3PhysAllocateLargePage: allocating large pages takes too long (last attempt 3960 ms; nr of timeouts 2); DISABLE
The host system processors are only utilized some 11-20% while all this is happening. Also, vmstat reports a 7.5GB free list.
Attaching to the VM console never results in an updated console window - I get the RDP window, but the contents of it never update.
Worse yet... When I attempt to power off the VM using VBoxMange controlvm poweroff, The poweroff hangs before 30%:
$ VBoxManage controlvm sparrow.rd.evalesco.com poweroff 0%...10%...20%...
After several minutes, I try to kill the VBoxHeadless process that is hanging. This causes VBoxManage showvminfo to hang for the VM, but the process does not go away.
A kill -9 is not effective either. The process lives until I reboot the system.
Attachments (1)
Change History (8)
by , 14 years ago
comment:1 by , 14 years ago
comment:2 by , 14 years ago
Sure:
module: zfs instance: 0 name: arcstats class: misc c 1073109888 c_max 1073741824 c_min 1073109888 crtime 170.747016437 data_size 499514368 deleted 3778 demand_data_hits 232539 demand_data_misses 3417 demand_metadata_hits 550561 demand_metadata_misses 2721 evict_l2_cached 0 evict_l2_eligible 3102720 evict_l2_ineligible 313006080 evict_skip 0 hash_chain_max 3 hash_chains 1371 hash_collisions 37633 hash_elements 38570 hash_elements_max 38571 hdr_size 7510848 hits 791813 l2_abort_lowmem 0 l2_cksum_bad 0 l2_evict_lock_retry 0 l2_evict_reading 0 l2_feeds 0 l2_free_on_write 0 l2_hdr_size 0 l2_hits 0 l2_io_error 0 l2_misses 0 l2_read_bytes 0 l2_rw_clash 0 l2_size 0 l2_write_bytes 0 l2_writes_done 0 l2_writes_error 0 l2_writes_hdr_miss 0 l2_writes_sent 0 memory_throttle_count 0 mfu_ghost_hits 2136 mfu_hits 591228 misses 40713 mru_ghost_hits 59 mru_hits 192945 mutex_miss 0 other_size 7910240 p 520093696 prefetch_data_hits 1101 prefetch_data_misses 734 prefetch_metadata_hits 7612 prefetch_metadata_misses 33841 recycle_miss 554 size 514935456 snaptime 538993.564543555 module: zfs instance: 0 name: vdev_cache_stats class: misc crtime 170.747033753 delegations 79240 hits 243639 misses 67903 snaptime 538993.565089241 module: zfs instance: 0 name: zfetchstats class: misc bogus_streams 0 colinear_hits 24 colinear_misses 259227 crtime 170.745427598 hits 1307794 misses 259251 reclaim_failures 241396 reclaim_successes 17831 snaptime 538993.565153127 streams_noresets 5876 streams_resets 2 stride_hits 1301918 stride_misses 84
I meant to limit the ZFS ARC - I have this in /etc/system:
* Limit ZFS ARC to 1GiB because we really just want to use our * memory for VMs and do not use local disk... set zfs:zfs_arc_max = 1073741824
comment:3 by , 14 years ago
The arc-cache looks good. How much memory are the 11 VMs using in total? RAM+vRAM of the VMs. The log you posted indicates 2 GB. Are all VMs identical?
Is there anything interesting in the syslog (/var/adm/messages)?
comment:4 by , 14 years ago
The dmesg reported this some days ago, but that is a long time before the last failure (I had another VM fail the same way yesterday):
Mar 25 08:57:13 turkey vboxdrv: [ID 456520 kern.notice] NOTICE: vbi_internal_alloc() failure for 2097152 bytes Mar 25 10:31:59 turkey vboxdrv: [ID 456520 kern.notice] NOTICE: vbi_internal_alloc() failure for 2097152 bytes
/var/adm/messages is empty.
No the VMs are not identical. Some are Linux, some are Windows, most are dual processor, most have around 2-4G memory.
If I grep out RAM and VRAM for the VMs on the system, I get:
512 12 3072 18 2048 18 3072 18 3072 18 4096 18 1024 12 1024 12 4096 12 (Large page alloc crash yesterday) 2048 12 (Original large page crash) 2048 18 512 18
The total is some 26G if I am not much mistaken. Is that simply too much?
comment:5 by , 14 years ago
I decreased the memory allocations for many of the VMs (and added another VM). Now I have a total memory reserved to VMs of just below 21GB.
The system seem to be perfectly stable this way. So far so good.
But I feel that I am flying blind. It seems that with 32GB of host memory, utilizing 26GB will get me in trouble but 21GB will not. There seems to be no "early warning"; when I use too much, a VM will crash. This is not very reassuring.
Is there any way that I can get an indication of beginning memory shortage, or an idea of the memory remaining? It seems that just using "vmstat" is not the answer as 26GB utilization got me in trouble (with the ZFS ARC limited to 1GB). (If only VM memory could be paged to disk, beginning page traffic would be an indicator, but I assume that there are significant problems with this..).
As for the bug report, let us close it for now. If the hang is fixed, this is more of a host utilization issue than an actual bug (unless you consider it a bug that 7GB of free memory is not enough to keep the system afloat).
comment:6 by , 14 years ago
I did some allocation fixes now but they are relevant for Solaris 11 not Solaris 10 (which your log indicates is what you're using). So I'm not really sure what might be failing here. For now you could try disabling large pages for all your VMs and seeing if that makes a difference (VBox-4.0.x turns on large pages by default), you can disable them using:
VBoxManage modifyvm <vmname> --largepages off
comment:7 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
This should be fixed in 4.1.x but 4.0.x will probably still need the above mentioned workaround.
Please reopen if bug still persists in 4.1.2.
These are two separate issues. First is the large page allocation failure, and the next is the poweroff hang leading to unkillable VM processes.
I'm 99% sure the VM poweroff hang issue is the one we just internally fixed and will be part of the next VirtualBox release. FTR, it's a bug in the kernel-side event-semaphore code that was fixed.
As for the large-page issue, are you limiting your ZFS arc-cache? Could you post the output of "kstat zfs"?