question

Brent Batas (Lisk) avatar image
Brent Batas (Lisk) asked

Urgent: Game servers not starting up correctly

As of about 17 hours ago (~3 am PDT), about 20% of game servers fail to start up properly after StartGame is called.

It's odd, because I haven't touched anything on the live version since July 19th. I would appreciate any insight on what might be going wrong.

On the lobby server side, everything looks fine.

  • No errors in logs
  • In the Event History, the gamelobby_started event looks fine. (Attached example for entityId 34795355B5D79DF0)

On the client side, the client gets stuck on the Match Found screen (which pops up when they receive connection info, but before they are connected to the game server)

  • The client receives a "GameServerReady" signal, which contains the server hostname and port from the StartGame call. These are the correct hostname and port. Other arguments all look correct, as well.
  • The client then tries to connect to the hostname:port
  • The client never succeeds in connecting, and my reconnect logic tries to kick in, but fails again and again

On the game server side, the bugged server seems to fail to start up correctly.

  • I log some pretty verbose stuff to the output files of the server; however, with the 'bugged' servers they only have empty 22 byte zip files as output files.
  • This happens whether I terminate the server after 1 minute or 30 minutes, it doesn't matter.

Some observations I've had

  • As mentioned above, I haven't touched anything on the live version since over a week ago. The only changes I've made other than that are some attempted hotfixes in the past couple hours, in an attempt to fix this issue.
  • As mentioned above, just to reiterate, about 80% of the game servers spin up correctly. This only applies to about 20% of them.
  • The issue doesn't seem any better/worse no matter how long the lobby server has been running (whether newly started up vs. running for 8 hours)
  • The gamelobby_started event looks identical (other than different playerIds and some player game data) between working and bugged game servers
  • Our StartGame() calls are pretty spaced out (usually like once every 20 seconds), so there isn't an issue of that being called too frequently.
  • Players who are able to get in-game don't experience any server issues.
  • There doesn't seem to be any correlation with the issue happening in bursts or not. I thought at first maybe a few StartGame() calls would fail in a row, then succeed for a while, then fail, but it seems randomly distributed as far as I can tell.
  • Failsafe timer does not work - There is a very simple "one hour failsafe timer" right in the main() method which is supposed to shut down the server after 1 hour. This is not working for bugged servers, for some reason.
  • The issue doesn't seem restricted to any particular port(s); at least 9000, 9001, 9002 have been used by bugged servers, but also by correct servers. The same goes for IPs.

Edit 1: Fixed formatting

Edit 2: Added info about ports

Edit 3: Added info about ips

Custom Game Servers
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

brendan avatar image
brendan answered

We don't tend to push any updates over the weekend, apart from bugfixes. For your hosted server behavior to be different, I would have to assume either the AMI changed (which it didn't), the server model changed (it hasn't - you're still set to c4.large), the game build zip changed (which you said it hasn't), the build configuration changed (ditto), or there's a more fundamental issue affecting either the EC2 region in question.

While we can't debug your custom game servers, we can look into the things we have data for, like the API calls to our service. Looking in our tracking, the only correlation I see around 3am for your title is that your matchmaking server got busier - it pushed past 5 StartGame calls per minute at that time, and stayed there throughout much of today. I'm not seeing anything suspicious in the graphs for your title though, and I'm not seeing any monitoring alerts.

What happens when you run the build yourself, on a Win2K12 R2 machine? What, if any, error messages are you getting on the client side? Have you done a Wireshark capture of the attempts to talk to the servers? What do they show?

1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

cool-daniel avatar image cool-daniel commented ·

I don't know if it helps but recently it's always about 10-15 ghost games that boot up around the same time and after that there are no more lobbies.
I just killed about 10 instances about 1 hour ago and when I checked now there were again 13 instances that started around the same time (few minutes or seconds difference) but there were only those 13 empty instances running for about 1 hour and thats it, rest seemed normal. So they dont seem to instantiate in a loop or random but as soon as we kill them they come back and stay again until they are killed in a limited amount.

List of those instances:

  • D8E7E84FF0104F50
  • 9F086BEC07518AF0
  • 1312475F013D9A6E
  • FA9E3204830C992
  • 36CC5EE288A93623
  • 43C2C74854638BD0
  • D2C9E0B2699F38D0
  • F9AF681A905D02B1
  • DBE591EA7F4EA1D7
  • 20DE2BA665424EEE
  • 685EB1E0AD9F519B
  • C86DC1BA7363FC30
  • 4D5993A87577405A
1 Like 1 ·
Brent Batas (Lisk) avatar image
Brent Batas (Lisk) answered

Update: After many hours spent debugging a memory dump file, I am still not certain of the issue, but I applied a workaround that seems to work. According to the memory dump, it seems like the code was hanging, but not erroring, on a File.ReadAllBytes() call.

I added a try/catch block around that line as well as moving it earlier in our initialization sequence, and it seems to work.

I'm still not sure what caused the issue or why this workaround seems to fix it.

2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

brendan avatar image brendan commented ·

A try/catch would only trigger if there's an exception. Are you logging that, to see what it says, exactly? It sounds like you're having some sort of contention issue with file access due to all the processes trying to grab the data at the same time. ReadAllBytes is, I believe supposed to be safe in this context - it should be running Using with Read sharing. But you may have found a corner case where many processes trying to read a file at once cause an issue. Can you find out what the exception is? Also, how large is the file in question?

0 Likes 0 ·
Brent Batas (Lisk) avatar image Brent Batas (Lisk) brendan commented ·

It turns out the issue is happening again. No output files are outputted, so I can't tell any exceptions. The file in question is 530 KB in size.

I can try to enclose the ReadAllBytes calls in using blocks.

Edit: Just re-read what you were saying--- yeah it's pretty weird that it is running into problems then... no idea what to try now.

0 Likes 0 ·
Brent Batas (Lisk) avatar image
Brent Batas (Lisk) answered

Update:

Still running into problems, although it's a bit rarer.

I moved around some logic to ensure ReadAllBytes is only called once per instance. Given we only have 7 instances per server, and each session lasts about 25 minutes, it seems like we aren't calling ReadAllBytes *that* often.

Not sure what to try next.

4 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

brendan avatar image brendan commented ·

I've separately sent you a memory dump we got from one of the instances. Can you have a look at that, to see if it sheds any light on this?

0 Likes 0 ·
Brent Batas (Lisk) avatar image Brent Batas (Lisk) brendan commented ·

Can you send me an updated one from today? I pushed a big patch today that was hopefully meant to fix it (but didn't). The old one suggested the File.Read call was problematic, but I can't confirm if the new issue has the same cause.

0 Likes 0 ·
Brent Batas (Lisk) avatar image Brent Batas (Lisk) Brent Batas (Lisk) commented ·

Friendly Bump :) Example instance:

24798CC59B2258A

0 Likes 0 ·
Show more comments

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.