Opened 9 years ago
Closed 7 years ago
#14779 closed defect (obsolete)
Kernel panics with VirtualBox 5.0.8, possible network problem
Reported by: | Thomas Dreibholz | Owned by: | |
---|---|---|---|
Component: | network | Version: | VirtualBox 5.0.8 |
Keywords: | Cc: | ||
Guest type: | Linux | Host type: | Linux |
Description
I get regular kernel panics with VirtualBox 5.0.8 under Ubuntu Server 14.04 LTS (64-bit), when running a Ubuntu Server 12.04 LTS VM (64-bit). The issue seems to be a problem with IPv6 TCP offloading in vboxnetflt. See the attached picture of a stack trace. Both, the host and the VM use IPv6 on 3 network interfaces. All 3 interfaces are bridged to Ethernet ports.
The problem is reproducible on my system, i.e. it occurs when the system runs for some minutes. I did not observe the issue when running VirtualBox 4.3 on a Ubuntu Server 12.04 LTS; the issue appeared after upgrading the system to Ubuntu Server 14.04 LTS and VirtualBox 5.0.8.
Attachments (9)
Change History (43)
by , 9 years ago
Attachment: | VBox-Bug1.png added |
---|
follow-up: 3 comment:2 by , 9 years ago
follow-up: 4 comment:3 by , 9 years ago
Replying to Thomas Dreibholz:
Replying to vushakov:
Please, can you provide real dmesg as text.
I attached a dmesg output.
I mean can you provide the dmesg from the crash as text, not as a screenshot.
comment:4 by , 9 years ago
Replying to vushakov:
Replying to Thomas Dreibholz:
Replying to vushakov:
Please, can you provide real dmesg as text.
I attached a dmesg output.
I mean can you provide the dmesg from the crash as text, not as a screenshot.
No, unfortunately, the machine is remote. All I can get is a screenshot of the HP iLO Java applet showing the console screen. It is graphics output.
comment:5 by , 9 years ago
I see gre_gso_segment
in both stack traces. Can you provide more details about your GRE setup? Do you have a lot of GRE traffic, or does the crash happen on the first GRE packet? On the first large GRE packet? A packet capture might be handy.
follow-up: 7 comment:6 by , 9 years ago
I have indeed a lot of GRE tunnels inside the VM. The VM is part of the NorNet Core setup (see https://www.nntb.no/nornet-core/ and https://www.nntb.no/pub/nornet-configuration/NorNetCore-Sites.html). The VM has 36 IPv4 GRE tunnels, as well as 26 IPv6-over-IPv6 tunnels configured. It definitely does not crash on the first packet via GRE, since there is an almost steady flow of packets due to RTT measurements. I am not sure about fragmentation. Unfortunately, I cannot easily generate a packet trace since the machine is remote. However, I could e.g. set up some test GRE tunnels and try to investigate behaviour with fragmentation if this could help debugging.
comment:7 by , 9 years ago
Replying to Thomas Dreibholz:
I have indeed a lot of GRE tunnels inside the VM. The VM is part of the NorNet Core setup (see https://www.nntb.no/nornet-core/ and https://www.nntb.no/pub/nornet-configuration/NorNetCore-Sites.html). The VM has 36 IPv4 GRE tunnels, as well as 26 IPv6-over-IPv6 tunnels configured. It definitely does not crash on the first packet via GRE, since there is an almost steady flow of packets due to RTT measurements. I am not sure about fragmentation. Unfortunately, I cannot easily generate a packet trace since the machine is remote. However, I could e.g. set up some test GRE tunnels and try to investigate behaviour with fragmentation if this could help debugging.
Note, that the GRE tunnels are only inside the VM. The host machine has no GRE tunnels configured.
comment:8 by , 9 years ago
So it looks like you are hitting BUG_ON(len);
in skb checksum code. I wonder if this happens on the first packet for which GRO kicks in.
follow-up: 12 comment:9 by , 9 years ago
I now also observe the kernel panics on 4 other systems with the same Ubuntu Server 14.04 LTS/VirtualBox 5.0.8 combination. All 5 affected systems have in common that the primary GRE tunnel (transporting a lot of management traffic) also transports IPv6 traffic, together with IPv4 traffic, over IPv4. I have 8 more systems of the same Ubuntu/VirtualBox versions, but these systems do not use IPv6 on their primary GRE tunnel. These systems seem to be stable. So, I assume there is some issue with IPv6 over GRE.
I will perform some further tests. At least, a problem when sending the first IPv6 packet over a tunnel seems to be highly unlikely. The machines at least run some minutes to hours before the kernel panic happens.
comment:10 by , 9 years ago
I now installed kdump and I can now generate kernel dumps of the crashes. I already attached one of the resulting dmesg files (dmesg.201511040959). This may be interesting:
... (many more of the following messages) ... [ 419.959822] VBoxNetFlt: Failed to segment a packet (-93). [ 420.265421] VBoxNetFlt: Failed to segment a packet (-93). [ 420.875506] VBoxNetFlt: Failed to segment a packet (-93). [ 421.478203] VBoxNetFlt: Failed to segment a packet (-93). [ 422.029902] VBoxNetFlt: Failed to segment a packet (-93). [ 473.713682] VBoxNetFlt: Failed to segment a packet (-93). [ 473.959018] VBoxNetFlt: Failed to segment a packet (-93). [ 474.096466] VBoxNetFlt: Failed to segment a packet (-93). [ 474.309785] VBoxNetFlt: Failed to segment a packet (-93). [ 474.334235] VBoxNetFlt: Failed to segment a packet (-93). [ 474.414036] ------------[ cut here ]------------ [ 474.414118] kernel BUG at /build/linux-lts-vivid-Nr0FoT/linux-lts-vivid-3.19.0/net/core/skbuff.c:2135!
comment:11 by , 9 years ago
I have now also stored the full output of some kernel dumps under https://www.nntb.no/~nornetpp/temp/crash.tar.gz . The file size is about 750 MiB.
comment:12 by , 9 years ago
Replying to Thomas Dreibholz:
All 5 affected systems have in common that the primary GRE tunnel (transporting a lot of management traffic) also transports IPv6 traffic, together with IPv4 traffic, over IPv4.
Please, can you provide example GRE setup instructions for this?
by , 9 years ago
Attachment: | tunnel4.txt added |
---|
GRE tunnel configuration of the VM on the crashed system
comment:13 by , 9 years ago
I attached IPv4, IPv6 and GRE tunnel configurations.
Probably most interesting is the main tunnel gre1-1-1:
nornetpp@tromsoe:~$ ip tunnel show gre1-1-1 gre1-1-1: gre/ip remote 158.39.4.2 local 129.242.157.228 ttl 255 key 16843777 nornetpp@tromsoe:~$ ip -4 addr show dev gre1-1-1 47: gre1-1-1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1472 qdisc noqueue state UNKNOWN group default inet 192.168.43.150 peer 192.168.43.151/32 scope global gre1-1-1 valid_lft forever preferred_lft forever nornetpp@tromsoe:~$ ip -6 addr show dev gre1-1-1 47: gre1-1-1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1472 inet6 2001:700:4100:ff:ffff:401:101:1/112 scope global valid_lft forever preferred_lft forever inet6 fe80::4444:1:4:1:1/64 scope link valid_lft forever preferred_lft forever
comment:14 by , 9 years ago
I prepared some more kernel crash dumps collected from 4 machines as packages:
comment:15 by , 9 years ago
I added a patch to VirtualBox that seems to prevent the issue by turning the offloading features in VBoxNetFlt-linux.c off. So far, I have not observed further crashes after installing a patched VirtualBox on my machines.
comment:16 by , 9 years ago
No far, I have not observed any more kernel panics after applying my patch. The patch simply comments out these settings:
# define VBOXNETFLT_WITH_GSO 1 # define VBOXNETFLT_WITH_GSO_XMIT_HOST 1 # define VBOXNETFLT_WITH_GSO_XMIT_WIRE 1 # define VBOXNETFLT_WITH_GSO_RECV 1 # define VBOXNETFLT_WITH_GRO 1
That is, one of these settings causes the crashes.
comment:17 by , 9 years ago
If necessary, I could test some specific combinations of the options and/or debug code on one of my machines.
comment:18 by , 9 years ago
Thanks for the update. Yes, the crash is triggered by us calling skb_gso_segment
(comment:5) and we probably do some wrong modifications to the skb before that. I haven't yet had a chance to look closer into this. I'll try to get to it this week. Sorry, other stuff needs attention too...
by , 9 years ago
Attachment: | VBoxNetFlt-linux.diff added |
---|
comment:19 by , 9 years ago
Actually, on a hunch... please, can you try the above patch (after re-enabling the GSO code your patch disables)? Not tested, but it looks like a cut-n-paste typo which might result in BUG_ON(len)
later because of wrong header size.
comment:20 by , 9 years ago
I just installed a version with your patch on one of my machines. I will report what happens ...
follow-up: 23 comment:21 by , 9 years ago
Unfortunately, the patch does not solve the problem. The kernel panics still happen. I will again provide some kernel dumps ...
Note, I tried with VirtualBox-5.0.8. I could try VirtualBox-5.0.10 with your patch now.
comment:22 by , 9 years ago
The new kernel crash dumps are here: https://www.nntb.no/~nornetpp/temp/skjennungen2.simula.nornet.tar.xz .
comment:23 by , 9 years ago
Replying to Thomas Dreibholz:
Unfortunately, the patch does not solve the problem. The kernel panics still happen. I will again provide some kernel dumps ...
Note, I tried with VirtualBox-5.0.8. I could try VirtualBox-5.0.10 with your patch now.
Thanks for trying it, it's not necessary to try 5.0.10. I also don't think I need any more crash dumps for now.
comment:24 by , 9 years ago
I already tried 5.0.10 -> no change, i.e. the kernel panics happen as before.
comment:25 by , 9 years ago
I varied my work-around patch by keeping
# define VBOXNETFLT_WITH_GSO 1 # define VBOXNETFLT_WITH_GSO_XMIT_HOST 1 # define VBOXNETFLT_WITH_GSO_XMIT_WIRE 1 # define VBOXNETFLT_WITH_GSO_RECV 1
and just commenting out one setting:
# define VBOXNETFLT_WITH_GRO 1
So far, the kernel worked during the whole week-end without kernel panic. VBOXNETFLT_WITH_GRO seems to cause the problem.
comment:26 by , 9 years ago
Yes, I'm also looking at that code right now. The skjennungen2
crash you posted in comment:22 crashes on an skb with
len = 4294904698, mac_len = 65430, network_header = 0, mac_header = 106,
where mac_len
, which is -106
, must be coming from VBoxNetFlt-linux.c:1109
inside VBOXNETFLT_WITH_GRO
ifdef.
follow-up: 28 comment:27 by , 9 years ago
Unfortunately, # define VBOXNETFLT_WITH_GRO is not the problem. It crashed again.
I am currently trying to comment out just:
# define VBOXNETFLT_WITH_GSO_XMIT_HOST 1 # define VBOXNETFLT_WITH_GSO_XMIT_WIRE 1 # define VBOXNETFLT_WITH_GSO_RECV 1
comment:28 by , 9 years ago
This also results in the crashes.
I am currently trying to comment out all VBOXNETFLT_WITH_GSO* settings, just leaving VBOXNETFLT_WITH_GRO.
follow-up: 30 comment:29 by , 9 years ago
Is there any news on locating the bug? If a possible fix is available, I could test it.
follow-up: 32 comment:30 by , 9 years ago
Replying to Thomas Dreibholz:
Is there any news on locating the bug? If a possible fix is available, I could test it.
Unfortunately, I have been unable to reproduce the problem locally so far.
comment:31 by , 9 years ago
I tried to replace VirtualBox by KVM, to see whether this should solve the problem. However, when using KVM directly, the same problem still appears. I therefore filed a kernel bug report as well: https://bugzilla.kernel.org/show_bug.cgi?id=109071 .
comment:32 by , 9 years ago
Replying to vushakov:
Replying to Thomas Dreibholz:
Is there any news on locating the bug? If a possible fix is available, I could test it.
Unfortunately, I have been unable to reproduce the problem locally so far.
If it may help, I could build and install a custom kernel e.g. with some kprintf() calls. It may also be possible to provide you access to a test setup machine.
comment:33 by , 9 years ago
Thank you for the update.
If you see the problem with KVM as well, then it's most likely a kernel bug with GRO/GSO of GRE. The VBox code basically does nothing much beyond skb_copy()
and skb_gso_segment()
on the passed skb, so I was starting to suspect as much. I'd rather wait for kernel folks to do their investigation. Unfortunately, we don't have enough resources to duplicate their effort, so we appreciate your offer, but won't take you up on it just yet.
comment:34 by , 7 years ago
Resolution: | → obsolete |
---|---|
Status: | new → closed |
Screenshot of the stack trace