Debian 10 server goes offline after 24h running

#1

I’m using 2 testing servers with identical configuration for a project, with Debian 10 (Scaleway default edition).
No change was made on the network interface configuration.
After 3 days, server #2 I was working was offline, not even SSH. So the only way to get back was to reboot it from web Scaleway admin. I didn’t take time to investigate.
Today, I was back working on server #1, and it shown the exact same issue. As it was disconnected or fully crashed. So I investigate what about the root cause. Server was rebooted today July 5th on 11:14 UTC.

I can estimate the loss of connectivity with the ufw logging, as it logs a packet drop every one or two minutes. The policy is very restrictive, so many packets from the web are dropped.
Jul 3 22:06:28
That means that shortly after this date, the server was offline.

Seeking for the logs for some events around this date. I found this in /var/daemon.log :

Jul  3 22:06:32 scw-machine-test01 ntpd[567]: Deleting interface #3 ens2, 10.69.X.X#123, interface stats: received=410, sent=410, dropped=0, active_time=86393 secs
Jul  3 22:06:32 scw-machine-test01 ntpd[567]: 10.5.X.X local addr 10.69.X.X -> <null>
Jul  3 22:06:32 scw-machine-test01 ntpd[567]: 10.5.X.X local addr 10.69.X.X -> <null>

Here’s a theory :
After 24 hours of running the server, ntpd is performing some update tasks, flushing interfaces.
The network interface is deleted and not recovered.

At installation, here some noticeable changes made on network configurations :

  • bind9-host was installed with apt-get
  • The /etc/sysctl.conf config file was set to :
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv6.conf.all.accept_source_route = 0
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
net.ipv6.conf.eth0.disable_ipv6 = 1

We are using this Debian configuration on many other VPS providers without any issue. But there’s seems to be an incompatibility with the Debian Scaleway flavor.

These external resources are related and can help for specialists :


Any ideas, suggestions ?

#2

Have you open ticket. I think they will procceed faster if you open the ticket on the console. Last time I encounter this and they force me to backup the vps, turn it offline, delete it and restore it so it will go to the other HW (if it’s failed). or such

#3

I have dug down this issue.

NTP deleting the interface is not the root cause. This is just ntpd taking into account the interface disappearance. The root cause is not this.
“The message is a reaction to interfaces being removed within the underlying networking stack. The NTP service does not itself removed networking interfaces.”

So I investigate into DHCP. Because the issue arises after 86400s of running. Looks like a DHCP lease.
Here’s the /var/log files about “dhcp” :

/var/log/cloud-init.log:1857:2020-07-02 22:06:28,083 - dhcp.py[DEBUG]: Performing a dhcp discovery on ens2
/var/log/cloud-init.log:1858:2020-07-02 22:06:28,083 - util.py[DEBUG]: Copying /sbin/dhclient to /var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhclient
/var/log/cloud-init.log:1860:2020-07-02 22:06:28,102 - util.py[DEBUG]: Running command ['/var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhclient', '-1', '-v', '-lf', '/var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhcp.leases', '-pf', '/var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhclient.pid', 'ens2', '-sf', '/bin/true'] with allowed return codes [0] (shell=False, capture=True)
/var/log/cloud-init.log:1861:2020-07-02 22:06:28,228 - util.py[DEBUG]: Waiting up to 5 seconds for the following files: ['/var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhclient.pid', '/var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhcp.leases']
/var/log/cloud-init.log:1862:2020-07-02 22:06:28,238 - util.py[DEBUG]: All files appeared after 0.01 seconds: ['/var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhclient.pid', '/var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhcp.leases']
/var/log/cloud-init.log:1863:2020-07-02 22:06:28,239 - util.py[DEBUG]: Reading from /var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhclient.pid (quiet=False)
/var/log/cloud-init.log:1864:2020-07-02 22:06:28,239 - util.py[DEBUG]: Read 4 bytes from /var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhclient.pid
/var/log/cloud-init.log:1865:2020-07-02 22:06:28,239 - util.py[DEBUG]: Reading from /var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhcp.leases (quiet=False)
/var/log/cloud-init.log:1866:2020-07-02 22:06:28,240 - util.py[DEBUG]: Read 523 bytes from /var/tmp/cloud-init/cloud-init-dhcp-83d0d68b/dhcp.leases
/var/log/cloud-init.log:1867:2020-07-02 22:06:28,240 - dhcp.py[DEBUG]: Received dhcp lease on ens2 for 10.69.X.X/255.255.255.254
/var/log/cloud-init.log:1951:2020-07-02 22:06:29,829 - dhcp.py[DEBUG]: Performing a dhcp discovery on ens2
/var/log/cloud-init.log:1952:2020-07-02 22:06:29,830 - util.py[DEBUG]: Copying /sbin/dhclient to /var/tmp/cloud-init/cloud-init-dhcp-9ic6m8m8/dhclient
/var/log/cloud-init.log:1954:2020-07-02 22:06:29,836 - util.py[DEBUG]: Running command ['/var/tmp/cloud-init/cloud-init-dhcp-9ic6m8m8/dhclient', '-1', '-v', '-lf', '/var/tmp/cloud-init/cloud-init-dhcp-9ic6m8m8/dhcp.leases', '-pf', '/var/tmp/cloud-init/cloud-init-dhcp-9ic6m8m8/dhclient.pid', 'ens2', '-sf', '/bin/true'] with allowed return codes [0] (shell=False, capture=True)
/var/log/cloud-init.log:1955:2020-07-02 22:06:29,903 - util.py[DEBUG]: All files appeared after 0 seconds: ['/var/tmp/cloud-init/cloud-init-dhcp-9ic6m8m8/dhclient.pid', '/var/tmp/cloud-init/cloud-init-dhcp-9ic6m8m8/dhcp.leases']
/var/log/cloud-init.log:1956:2020-07-02 22:06:29,903 - util.py[DEBUG]: Reading from /var/tmp/cloud-init/cloud-init-dhcp-9ic6m8m8/dhclient.pid (quiet=False)
/var/log/cloud-init.log:1957:2020-07-02 22:06:29,904 - util.py[DEBUG]: Read 4 bytes from /var/tmp/cloud-init/cloud-init-dhcp-9ic6m8m8/dhclient.pid
/var/log/cloud-init.log:1958:2020-07-02 22:06:29,904 - util.py[DEBUG]: Reading from /var/tmp/cloud-init/cloud-init-dhcp-9ic6m8m8/dhcp.leases (quiet=False)
/var/log/cloud-init.log:1959:2020-07-02 22:06:29,904 - util.py[DEBUG]: Read 523 bytes from /var/tmp/cloud-init/cloud-init-dhcp-9ic6m8m8/dhcp.leases
/var/log/cloud-init.log:1960:2020-07-02 22:06:29,905 - dhcp.py[DEBUG]: Received dhcp lease on ens2 for 10.69.X.X/255.255.255.254
/var/log/cloud-init.log:2069:2020-07-02 22:06:32,415 - dhcp.py[DEBUG]: Performing a dhcp discovery on ens2
/var/log/cloud-init.log:2070:2020-07-02 22:06:32,417 - util.py[DEBUG]: Copying /sbin/dhclient to /var/tmp/cloud-init/cloud-init-dhcp-fgmmgyn7/dhclient
/var/log/cloud-init.log:2072:2020-07-02 22:06:32,425 - util.py[DEBUG]: Running command ['/var/tmp/cloud-init/cloud-init-dhcp-fgmmgyn7/dhclient', '-1', '-v', '-lf', '/var/tmp/cloud-init/cloud-init-dhcp-fgmmgyn7/dhcp.leases', '-pf', '/var/tmp/cloud-init/cloud-init-dhcp-fgmmgyn7/dhclient.pid', 'ens2', '-sf', '/bin/true'] with allowed return codes [0] (shell=False, capture=True)
/var/log/cloud-init.log:2073:2020-07-02 22:06:32,521 - util.py[DEBUG]: All files appeared after 0 seconds: ['/var/tmp/cloud-init/cloud-init-dhcp-fgmmgyn7/dhclient.pid', '/var/tmp/cloud-init/cloud-init-dhcp-fgmmgyn7/dhcp.leases']
/var/log/cloud-init.log:2074:2020-07-02 22:06:32,521 - util.py[DEBUG]: Reading from /var/tmp/cloud-init/cloud-init-dhcp-fgmmgyn7/dhclient.pid (quiet=False)
/var/log/cloud-init.log:2075:2020-07-02 22:06:32,522 - util.py[DEBUG]: Read 4 bytes from /var/tmp/cloud-init/cloud-init-dhcp-fgmmgyn7/dhclient.pid
/var/log/cloud-init.log:2076:2020-07-02 22:06:32,522 - util.py[DEBUG]: Reading from /var/tmp/cloud-init/cloud-init-dhcp-fgmmgyn7/dhcp.leases (quiet=False)
/var/log/cloud-init.log:2077:2020-07-02 22:06:32,522 - util.py[DEBUG]: Read 521 bytes from /var/tmp/cloud-init/cloud-init-dhcp-fgmmgyn7/dhcp.leases
/var/log/cloud-init.log:2078:2020-07-02 22:06:32,523 - dhcp.py[DEBUG]: Received dhcp lease on ens2 for 10.69.X.X/255.255.255.254

/var/log/daemon.log.1:1067:Jul  2 22:06:39 scw-machine-test01 dhclient[387]: For info, please visit https://www.isc.org/software/dhcp/
/var/log/daemon.log.1:1080:Jul  2 22:06:39 scw-machine-test01 dhclient[402]: For info, please visit https://www.isc.org/software/dhcp/
/var/log/daemon.log.1:1098:Jul  2 22:06:39 scw-machine-test01 ifup[413]: For info, please visit https://www.isc.org/software/dhcp/
/var/log/daemon.log.1:1101:Jul  2 22:06:39 scw-machine-test01 dhclient[435]: For info, please visit https://www.isc.org/software/dhcp/
/var/log/daemon.log.1:1126:Jul  2 22:06:39 scw-machine-test01 dhclient[510]: For info, please visit https://www.isc.org/software/dhcp/
/var/log/daemon.log.1:1185:Jul  2 22:06:39 scw-machine-test01 ntpd[549]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -c /run/ntp.conf.dhcp -u 107:113

/var/log/daemon.log.1:1064:Jul  2 22:06:39 scw-machine-test01 dhclient[387]: Internet Systems Consortium DHCP Client 4.4.1
/var/log/daemon.log.1:1077:Jul  2 22:06:39 scw-machine-test01 dhclient[402]: Internet Systems Consortium DHCP Client 4.4.1
/var/log/daemon.log.1:1094:Jul  2 22:06:39 scw-machine-test01 dhclient[435]: Internet Systems Consortium DHCP Client 4.4.1
/var/log/daemon.log.1:1095:Jul  2 22:06:39 scw-machine-test01 ifup[413]: Internet Systems Consortium DHCP Client 4.4.1
/var/log/daemon.log.1:1123:Jul  2 22:06:39 scw-machine-test01 dhclient[510]: Internet Systems Consortium DHCP Client 4.4.1

/var/log/daemon.log.1:1077:Jul  2 22:06:39 scw-machine-test01 dhclient[402]: bound to 10.69.X.X -- renewal in 34254 seconds.

The following ufw script is applied on server comissionning :

ufw default deny incoming
ufw default allow outgoing
ufw allow 22/tcp
ufw limit 22/tcp
ufw allow 443/tcp
ufw allow 80/tcp
ufw deny 68
ufw deny 5100
ufw allow 53
ufw -f enable

The tricky thing is there is no error on 3rd July 22:06.
So I seek for errors whatever the date, and found a relevant one :

Jul  2 22:06:39 scw-machine-test01 systemd[1]: Starting Initial cloud-init job (metadata service crawler)...
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: Internet Systems Consortium DHCP Client 4.4.1
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: Copyright 2004-2018 Internet Systems Consortium.
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: All rights reserved.
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: For info, please visit https://www.isc.org/software/dhcp/
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: 
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: Listening on LPF/ens2/de:1c:xx:xx:xx:xx
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: Sending on   LPF/ens2/de:1c:xx:xx:xx:xx
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: Sending on   Socket/fallback
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: DHCPDISCOVER on ens2 to 255.255.255.255 port 67 interval 3
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: DHCPOFFER of 10.69.X.X+1 from 10.69.XX.XX
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: DHCPREQUEST for 10.69.X.X+1 on ens2 to 255.255.255.255 port 67
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: DHCPACK of 10.69.X.X+1 from 10.69.XX.XX
Jul  2 22:06:39 scw-machine-test01 dhclient[510]: bound to 10.69.X.X+1 -- renewal in 34311 seconds.
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: Cloud-init v. 18.3-24-gf6249277-0ubuntu2~scaleway1 running 'init' at Thu, 02 Jul 2020 22:06:32 +0000. Up 9.14 seconds.
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: +++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: +--------+------+-------------+-----------------+--------+-------------------+
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: | Device |  Up  |   Address   |       Mask      | Scope  |     Hw-Address    |
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: +--------+------+-------------+-----------------+--------+-------------------+
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: |  ens2  | True | 10.69.X.X+1 | 255.255.255.254 | global | de:1c:xx:xx:xx:xx |
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: |   lo   | True |  127.0.0.1  |    255.0.0.0    |  host  |         .         |
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: +--------+------+-------------+-----------------+--------+-------------------+
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: ++++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: +-------+-------------+-------------+-----------------+-----------+-------+
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: | Route | Destination |   Gateway   |     Genmask     | Interface | Flags |
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: +-------+-------------+-------------+-----------------+-----------+-------+
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: |   0   |   0.0.0.0   | 10.69.XX.XX |     0.0.0.0     |    ens2   |   UG  |
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: |   1   | 10.69.XX.XX |   0.0.0.0   | 255.255.255.254 |    ens2   |   U   |
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: ci-info: +-------+-------------+-------------+-----------------+-----------+-------+
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: 2020-07-02 22:06:33,959 - DataSourceScaleway.py[WARNING]: Error while trying to get vendor-data data: HTTPConnectionPool(host='169.254.42.42', port=80): Max retries exceeded with url: /vendor_data/cloud-init (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xxxxx>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]: /usr/lib/python3/dist-packages/cloudinit/config/cc_rsyslog.py:205: FutureWarning: Possible nested set at position 23
Jul  2 22:06:39 scw-machine-test01 cloud-init[495]:   r'^(?P<proto>[@]{0,2})'

So it might be this is not about DHCP, but an Scaleway API which is blocked. But port 80 is reachable for remote reading. Or this is the Scaleway cloud API which can’t allow using the given DHCP IP address?

Error while trying to get vendor-data data: HTTPConnectionPool(host='169.254.42.42', port=80): Max retries exceeded with url: /vendor_data/cloud-init (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xxxxx>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))

I’m quite lost about this issue, mainly because the cloud-init brings an added layer for the server management.

#4

I found this in the syslog at startup.

Jul  6 13:02:42 scw-authentik-test02 dhclient[436]: bound to 10.64.X.X-- renewal in 42025 seconds.
Jul  6 13:02:42 scw-authentik-test02 ifup[414]: bound to 10.64.X.X -- renewal in 42025 seconds.
Jul  6 13:02:42 scw-authentik-test02 ifup[414]: RTNETLINK answers: Permission denied
Jul  6 13:02:42 scw-authentik-test02 ifup[414]: ifup: failed to bring up ens2
Jul  6 13:02:42 scw-authentik-test02 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Jul  6 13:02:42 scw-authentik-test02 systemd[1]: networking.service: Failed with result 'exit-code'.
Jul  6 13:02:42 scw-authentik-test02 systemd[1]: Failed to start Raise network interfaces.
Jul  6 13:02:42 scw-authentik-test02 systemd[1]: Reached target Network.

It happens mostly when the filter “forbidden port 68” is enabled

I’m starting to wonder whether the issue is more related to the custom sysctl.conf, especially :

net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv6.conf.all.accept_source_route = 0
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
net.ipv6.conf.eth0.disable_ipv6 = 1

An other point. Maybe ufw is not configured properly when the interface gets its initial configuration, then it fails after the lease expiration after 24h, because of this blocking port 68.

#5

Opening port 68 did not resolve the issue.

I now think this is related to the custom sysctl, configured for security. Here’s the custom file used in the servers. Of course, no issue like this one were reported from using it in others VPS vendor. Maybe the NAT system or the internal network or routers at Scaleway make this is not compatible.

The conf file comments state this can bring some issues :

net.ipv6.conf.all.forwarding=1
-> disables Stateless Address Autoconfiguration based on Router Advertisements
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv6.conf.all.accept_source_route = 0
-> Some network environments require that these settings are disabled

So this is either one “disallow” rule which needs to be disactivated, or the IPv6 needs to be enabled.

About the firewall, does the machine conf requires to enabled any “hosted” service?

And I’m still puzzled by this error when the machine boot, right after getting its DHCP configuration. It happened also when it booted yesterday (and then went offline today).

ifup[414]: RTNETLINK answers: Permission denied
ifup[414]: ifup: failed to bring up ens2

Today, I rebooted the machine with the default sysctl conf file. We’ll see how it goes.

#6

The VPS did not switch off its network interface after 24 hours with the sysctl.conf back to the original (blank, no option enabled). That means that at least one of theses lines are not compatible with the Scaleway network, and cause this annoying issue :

net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv6.conf.all.accept_source_route = 0
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
net.ipv6.conf.eth0.disable_ipv6 = 1 

I can’t go deeper in the tests, for example testing line by line, as every single test is requiring 24 hours to perform.

#7

I also had these connectivity problems, and after a long time investigating and asking for a solution for support, the solution was very simple.

Exit the Scaleway and go somewhere else …

Problem solved!!