question

Toni Palm avatar image
Toni Palm asked

Azure Functions timeout in cold start

Hi,

When an Azure Function is cold starting, it can take more than ten seconds, and the Playfab ExecuteFunction times out. Is there anything useful we can do about this?

CloudScript
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Seth Du avatar image Seth Du ♦ commented ·

Do you mean the time when you deploy the function app to Azure? The time cannot be skipped. May I ask do you have any specific scenario that will need it to be live immediately?

0 Likes 0 ·
Toni Palm avatar image Toni Palm commented ·

This particular case that I was more worried about was actually a problem with my queue handling: the messages were not handled properly and ended up being retried -> the system bombarded the app with the messages, this caused the HTTP requests to need new instances, which took a long time to start. I have fixed this on my end, and am considering to separate the queue handlers to separate app or something.

But like you said, the problem remains with deployments, and I assume (correct me if I'm wrong) also under heavy load when Azure needs to spin up new instances.

There's no specific case when the app needs to respond immediately, but we obviously don't want the players to experience unnecessary timeouts.

0 Likes 0 ·
brendan avatar image
brendan answered

In live titles, this should be rare - you'll generally only run into it while in development, or if you have a sudden, very large increase in traffic. Azure Function calls have to cold start when there hasn't been any activity for some period of time - in general, about 20 minutes, though it can be shorter - or when there's a greater influx of traffic than their scaling can easily keep up with. For any game with a non-trivial player base, unless your usage of Functions is rare in the game, that means is really possible until you get to the long tail for the game. And at the point where the player base is so small that calls are that infrequent, you'll likely be looking to sunset the game, in any case.

That said, it's important to handle any error conditions you get back, so you will need to make sure the game can handle that response. One important thing to consider is that if the HTTP request reached Functions, the code in question will be run - all that's happened with the timeout is that the connection from our servers closed. So part of your logic should be to check that you're not doubling down on the operations.

Also, since you're using your own Azure account for Functions at the moment, you can also turn on the option to have there always be at least one VM running (they recently added that). That would prevent the cold-start-due-to-no-traffic case.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

kamyker avatar image
kamyker answered

Try linux app function, in my tests it has much lower cold starts than windows. Try switching.

Cold starts and overhead of Azure is why I tried something else and spent few days optimizing dotnet runtime of openwhisk. The start dropped almost by 1 sec and my results for now:
Empty function cold start ~350ms
Bigger one with PlayFab sdk (6kb) ~570ms

I'll finish it soon https://github.com/kamyker/openwhisk-dotnet-csharp/

3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Toni Palm avatar image Toni Palm commented ·

Thanks, I made a quick test but there are some other problems with Linux: 1) Can't create Linux app in the same resource group/region with Windows App (but I tested in another region) 2) Can't stream the logs of the Linux app to vscode.

So, it might make sense to use Linux for production, but for development the log streaming is pretty useful and therefore I'll stick to Windows there. Unless there's some other nice way to view the Linux logs in realtime, I couldn't find anything myself.

0 Likes 0 ·
Toni Palm avatar image Toni Palm Toni Palm commented ·

We switched to using Linux (without your optimizations) and that alone doesn't seem to be enough, we still get those timeouts sometimes. Next thing I'm going to try is probably the "always on" plan for dev.

0 Likes 0 ·
kamyker avatar image kamyker Toni Palm commented ·

I using different provider than Azure, running function every 5 mins - only 30% of them fail despite being on free plan (5 sec) and making 4 get requests to steam and azure cosmos db.

Additionally you don't have to rely on PlayFab. Simply send request to directly to your function and authenticate player there. You could even cache it and generate key valid for 30mins to speed up future requests.

0 Likes 0 ·
Toni Palm avatar image
Toni Palm answered

Another nightmare popped up in my mind. @Brendan, you told me specifically: "...for timeouts, I’d recommend moving to Functions as soon as you can. That should pretty much eliminate any issues due to long-running calls."

All our timeout problems in the past have been due to the PlayFab API timing out or taking really long to execute. So do we now have the problem that an API call from Azure might take long to execute, or even timeout, and then the Azure Function Execution Limit kicks in and the call times out?

5 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

brendan avatar image brendan commented ·

You should run them as queued functions, if you think they could run long. We simply wait asynchronously for them to complete, in that case.

0 Likes 0 ·
Toni Palm avatar image Toni Palm brendan commented ·

The PlayFab API randomly takes long to respond, so are you saying we should somehow run these calls in queues? It is not our code that takes long to execute, but the PlayFab API. Also, in most cases, if not all, the client wants a response.

0 Likes 0 ·
Toni Palm avatar image
Toni Palm answered

I'm pretty sure this will be an issue with live titles as well unless the load is pretty constant. That is, when using the Consumption plan. It should not, according to MS papers, be an issue with Premium or App Service (dedicated instance) plans. Dedicated instance has other issues as it doesn't autoscale, AFAIK.

So in our setup now we are planning to run the DEV with App Service Plan (no need to scale), and production with Premium plan. Consumption plan is pretty much useless with this Playfab timeout.

1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

brendan avatar image brendan commented ·

Any title with a reasonable user base, and which uses Cloud Script as part of the core game loop (so, it runs at least once a minute for any given player, on average), should only see cold starts when there are sudden increases to the number of connected players. In that case, Azure may not have enough VMs allocated for the sudden increase.

If you're in soft-launch or beta, and so have a very small pool of users, it would be reasonable to assume you'd run into this more often.

0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.