question

Alex avatar image
Alex asked

GSDK | Issue with OnGSDKHealthCheck

Hello,

We have implemented OnGSDKHealthCheck in our Unreal game server and we are using multiplayer servers for our game. We return the server with the lowest player count upon Requesting or Getting a game server.

Today one of our servers went to 100% CPU, game server stopped logging anything, so was completely dead. But somehow OnGSDKHealthCheck didn't do the work I thought it would do, detect the dead server and kill the container. I guess OnGSDKHealthCheck didn't return anything or possibly timed out when your system calls it.

So we ended up routing players to a dead server. From Playfab side, it was active and with connected players.

How does OnGSDKHealthCheck work? What can we do to avoid this issue next time?

Thanks.

apissdksCustom Game Serversmultiplayer
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Alex avatar image
Alex answered

I see that GSDK pings VMAgent here: https://github.com/PlayFab/gsdk/commit/627f9edf817431664391c39437d1a01f2ab1940a.

I implemented a solution in our server selection logic to avoid this issue: I ping the selected active server, and if it's not reachable, I shut down the server with the API and look for the next available server.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Dimitris Gkanatsios avatar image
Dimitris Gkanatsios answered

if the process goes to 100%, it will make everything difficult on the VM unfortunately (including our game server orchestrator - VmAgent). Did you manage to log into the VM and grab logs to make sure to understand what caused the 100% CPU?

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Alex avatar image
Alex answered

Hi Dimitris,

We retrieved the logs from Playfab archived servers, but I could not find any clue for now. About the VmAgent, is that on the VM right? Because what went to 100% CPU was one of the two docker images we have per VM.

So, the VM had 1 game server in one core working correctly, and the other server at 100% CPU not working. I understand the agent in the VM should be able to call OnGSDKHealthCheck the docker image as it is still doing it for the second server?

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Dimitris Gkanatsios avatar image
Dimitris Gkanatsios answered

Hey Alex, yeah, VmAgent is our game server orchestrator running on the VM. If CPU usage is 100%, VmAgent will probably be impacted as well. The OnGSDKHealthCheck won't help here.

If this happens again, any chance you can RDP/SSH to the VM and grab a dump of the server process? This might help in your bug investigation.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Alex avatar image
Alex answered

Dimitri, what I mean is that the VM is not at 100% of CPU usage. CPU at 100% is only in one of the two docker images running on the VM (container game servers). The other game server is running perfectly on the VM at around 10-20% CPU, so the VmAgent is not affected.

Does the GSDK connects and sends information to the VmAgent? Or is the VmAgent the one that pings the game server?

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.