From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: samuel@sholland.org Received: from krantz.zx2c4.com (localhost [127.0.0.1]) by krantz.zx2c4.com (ZX2C4 Mail Server) with ESMTP id f3965dbc for ; Thu, 16 Feb 2017 18:38:26 +0000 (UTC) Received: from out3-smtp.messagingengine.com (out3-smtp.messagingengine.com [66.111.4.27]) by krantz.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 6382fff0 for ; Thu, 16 Feb 2017 18:38:26 +0000 (UTC) Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id A38422069A for ; Thu, 16 Feb 2017 13:38:40 -0500 (EST) Received: from [10.7.33.183] (unknown [161.130.188.44]) by mail.messagingengine.com (Postfix) with ESMTPA id 4B4257E0D2 for ; Thu, 16 Feb 2017 13:38:40 -0500 (EST) To: wireguard@lists.zx2c4.com From: Samuel Holland Subject: Instability during large transfers Message-ID: <7a3a5f6d-1eb5-ef32-097e-e24c2f9d2805@sholland.org> Date: Thu, 16 Feb 2017 12:38:38 -0600 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed List-Id: Development discussion of WireGuard List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hello, Since I started using wireguard in August 2016, my main firewall has been resetting every few weeks to a month. Since the switch from my previous openvpn setup to wireguard coincided with some hardware changes, I haven't been confident about the source of the crashes. However, the last three crashes have been directly connected to a large file transfer, meaning sustained, even if not large, bandwidth use. (For comparison, this firewall can handle >400Mbps VPN traffic over a gigabit WAN link.) The first was a `zfs send` job over SSH, around 15Mbps, where the panic happened after around 18 hours. The second was a video transfer over SSH, around 10Mbps, and the panic happened after about 30 minutes. The most recent was another `zfs send` over SSH, this time 40-50Mbps. The transfer started at 2017-02-15 20:23:39. At 2017-02-16 01:32:13, the firewall reset, and it came back up at 01:32:47. Through the magic of wireguard, the SSH connection stayed intact. The transfer continued, and my sshd logs continue showing key rotations until 04:43:55. This coincides with the following warning from the firewall at 04:44:50: [11547.285960] ------------[ cut here ]------------ [11547.285976] WARNING: CPU: 1 PID: 0 at kernel/workqueue.c:1440 __queue_work+0x1e0/0x450 [11547.285980] Modules linked in: cfg80211 rfkill wireguard(O) bonding mei_txe mei [11547.285995] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G O 4.7.10-hardened #1 [11547.285999] Hardware name: To be filled by O.E.M. To be filled by O.E.M./D180S/D190S/D290S Series, BIOS DB1F1P05_x64 12/23/2014 [11547.286003] 0000000000000000 ffffffff814b4d1f 0000000000000007 0000000000000000 [11547.286010] 0000000000000000 ffffffff810d8059 ffffe8ffffc07900 0000000000000004 [11547.286016] 0000000000000000 ffff880077a9b600 ffffe8ffffc07a38 00000000000105b0 [11547.286023] Call Trace: [11547.286027] [] ? dump_stack+0x47/0x68 [11547.286040] [] ? __warn+0xb9/0xe0 [11547.286046] [] ? __queue_work+0x1e0/0x450 [11547.286052] [] ? queue_work_on+0x10/0x20 [11547.286059] [] ? padata_do_parallel+0xe8/0x110 [11547.286071] [] ? packet_consume_data+0x4b5/0x760 [wireguard] [11547.286080] [] ? packet_send_queue+0x330/0x600 [wireguard] [11547.286087] [] ? vlan_dev_hard_start_xmit+0x89/0x100 [11547.286093] [] ? fib_validate_source+0x101/0x470 [11547.286102] [] ? packet_receive+0x440/0x470 [wireguard] [11547.286110] [] ? packet_receive+0x456/0x470 [wireguard] [11547.286116] [] ? udp_queue_rcv_skb+0x1fb/0x470 [11547.286122] [] ? __udp4_lib_rcv+0x427/0x9b0 [11547.286129] [] ? ip_local_deliver_finish+0x63/0x1b0 [11547.286135] [] ? ip_local_deliver+0x56/0xd0 [11547.286140] [] ? ip_rcv_finish+0x390/0x390 [11547.286146] [] ? ip_rcv+0x244/0x370 [11547.286151] [] ? inet_del_offload+0x30/0x30 [11547.286158] [] ? __netif_receive_skb_core+0x909/0xa20 [11547.286163] [] ? br_handle_vlan+0xe3/0x180 [11547.286168] [] ? br_handle_vlan+0xe3/0x180 [11547.286174] [] ? netif_receive_skb_internal+0x1a/0x80 [11547.286181] [] ? br_pass_frame_up+0x8e/0x130 [11547.286187] [] ? find_busiest_group+0xe5/0x950 [11547.286192] [] ? br_allowed_ingress+0x26d/0x390 [11547.286198] [] ? br_handle_frame_finish+0x20f/0x4f0 [11547.286205] [] ? br_handle_frame+0x13f/0x2d0 [11547.286210] [] ? __netif_receive_skb_core+0x2fd/0xa20 [11547.286216] [] ? udp_gro_receive+0x4a/0x110 [11547.286222] [] ? inet_gro_receive+0x1b9/0x240 [11547.286227] [] ? netif_receive_skb_internal+0x1a/0x80 [11547.286232] [] ? napi_gro_receive+0xb6/0x100 [11547.286240] [] ? igb_poll+0x699/0xde0 [11547.286247] [] ? rcu_report_qs_rnp+0x11d/0x170 [11547.286252] [] ? net_rx_action+0x219/0x350 [11547.286258] [] ? __do_softirq+0xea/0x280 [11547.286263] [] ? irq_exit+0x8c/0x90 [11547.286270] [] ? smp_apic_timer_interrupt+0x39/0x50 [11547.286277] [] ? apic_timer_interrupt+0x8b/0x90 [11547.286280] [] ? poll_idle+0xc/0x60 [11547.286291] [] ? cpuidle_enter_state+0xf1/0x2b0 [11547.286298] [] ? cpu_startup_entry+0x259/0x2e0 [11547.286304] [] ? start_secondary+0x1ad/0x270 [11547.286308] ---[ end trace 0439ccab88b5fb0e ]--- At that point, the VPN stopped responding, the SSH connection to the server behind the firewall was broken, and the zfs send job failed. When I woke up this morning, none of the firewall's wireguard peers were able to ping or SSH into it or any machine behind it. However, they all reported their last handshake being less than (25s) ago, even though my laptop had been suspended all night (so it was able to connect fine from a new IP, but not transfer any data). All machines involved are running the same kernel (Linux 4.7.10-hardened from Gentoo's sys-kernel/hardened-sources package), and they all have wireguard version 0.0.20170214 loaded. They had all been rebooted since upgrading to that wireguard version, so within the last two days. I will enable netconsole on the firewall now that I can hopefully reproduce the panic, but since the networking setup there is rather complicated (vlans on top of bridges) I'm not sure if I will get all of the panic messages. I have seen no wireguard-related warnings or panics on any other machine, but this firewall machine has by far the most traffic and the weakest CPU (http://ark.intel.com/products/78866). Is there any more information I can provide to help resolve this? Thanks, Samuel