Server Stability Issues
Posted: Mon Feb 27, 2017 6:36 pm
Good evening,
I felt I needed to explain a bit about what's going on with CLOK, as it's been down pretty much all day, and I find that pretty much unacceptable.
I felt I needed to explain a bit about what's going on with CLOK, as it's been down pretty much all day, and I find that pretty much unacceptable.
- I received an email from our hosting provider at 10:10AM this morning informing me that there was a critical kernel vulnerability affecting my server, and that they had installed the kernel update but needed me to reboot the server to apply it.
- From work, around 10:12AM I used my tablet to remote into the server, gracefully shut down the MUD engine, and reboot the server. I immediately reconnected to the server, ensured everything was up and running, and had the server, flash policy listener (for the HMUD web based client), and discord bot all up and running around 10:40AM.
- The new version of the CentOS kernel included a problem that was causing the network interface controller to randomly enter power saving mode. At around 10:49AM, several players were disconnected from the server and were unable to access the MUD or website.
- This continued to be an issue for the majority of the day, while I was unaware and busy at work. I had some downtime around 1:35PM and checked Discord, only to see talk about the server being down. I also received a call from Rias moments later telling me the same.
- Upon investigation, I was able to connect at 1:50PM, noting that the server's official uptime was 9 minutes. I restarted the MUD engine at that time. I opened a ticket with the hosting provider and informed them of the unexpected downtime, and then proceeded with work.
- While I was working, they had sent me a reply stating that a technician had connected to the server and "gracefully" rebooted it (not properly killing the MUD engine) for the aforementioned security update, not bothering to check that I had already done it. They also informed me of the NIC issue and that they had applied a fix but I would need to restart the server to correct it.
- After I got back from dinner with my girlfriend, around 7:47PM, I checked discord to see that the MUD was down again. I logged in and checked the uptime to see that it had been up for about 3 and a half hours, meaning that the server was rebooted again sometime around 4:30PM this afternoon, but the MUD engine had not been started back up (it has to be manually run).
- I currently have an open ticket with the hosting provider to investigate the second unexpected shutdown, though I'm betting it's going to be another case of them rebooting it "for" me, to fix the aforementioned NIC issue. Right now, everything is back up, and I'll continue to monitor it as closely as my time will allow tonight to ensure it doesn't go down again.