As of about 17 hours ago (~3 am PDT), about 20% of game servers fail to start up properly after StartGame is called.
It's odd, because I haven't touched anything on the live version since July 19th. I would appreciate any insight on what might be going wrong.
On the lobby server side, everything looks fine.
- No errors in logs
- In the Event History, the gamelobby_started event looks fine. (Attached example for entityId 34795355B5D79DF0)
On the client side, the client gets stuck on the Match Found screen (which pops up when they receive connection info, but before they are connected to the game server)
- The client receives a "GameServerReady" signal, which contains the server hostname and port from the StartGame call. These are the correct hostname and port. Other arguments all look correct, as well.
- The client then tries to connect to the hostname:port
- The client never succeeds in connecting, and my reconnect logic tries to kick in, but fails again and again
On the game server side, the bugged server seems to fail to start up correctly.
- I log some pretty verbose stuff to the output files of the server; however, with the 'bugged' servers they only have empty 22 byte zip files as output files.
- This happens whether I terminate the server after 1 minute or 30 minutes, it doesn't matter.
Some observations I've had
- As mentioned above, I haven't touched anything on the live version since over a week ago. The only changes I've made other than that are some attempted hotfixes in the past couple hours, in an attempt to fix this issue.
- As mentioned above, just to reiterate, about 80% of the game servers spin up correctly. This only applies to about 20% of them.
- The issue doesn't seem any better/worse no matter how long the lobby server has been running (whether newly started up vs. running for 8 hours)
- The gamelobby_started event looks identical (other than different playerIds and some player game data) between working and bugged game servers
- Our StartGame() calls are pretty spaced out (usually like once every 20 seconds), so there isn't an issue of that being called too frequently.
- Players who are able to get in-game don't experience any server issues.
- There doesn't seem to be any correlation with the issue happening in bursts or not. I thought at first maybe a few StartGame() calls would fail in a row, then succeed for a while, then fail, but it seems randomly distributed as far as I can tell.
- Failsafe timer does not work - There is a very simple "one hour failsafe timer" right in the main() method which is supposed to shut down the server after 1 hour. This is not working for bugged servers, for some reason.
- The issue doesn't seem restricted to any particular port(s); at least 9000, 9001, 9002 have been used by bugged servers, but also by correct servers. The same goes for IPs.
Edit 1: Fixed formatting
Edit 2: Added info about ports
Edit 3: Added info about ips