Original post

Yeah, absolutely. Our newest product is called Nova, and it’s our kind of cloud-native-focused scalable ADC. An important component of that is that we run many ADCs centrally, so it’s like a control plane/data plane model; we are collecting a lot of data from the data plane to display on the control plane… But we had a lot of learnings in our traditional product sample, which is like a standalone ADC.

But what’s interesting is that we’ve tried to tackle it in a very different way. We collect mostly the same data – how many of every type of HTTP reply code are you getting? How many requests are you getting? How many TCP connections? How many TCP connection failures? How many timeouts are there? What’s the reply time? And when you look at the response times, there’s a lot of information there. Like, what was the TCP connect time to the server – is there a network issue? What was the HTTP reply time from the server – is there a back-end issue? What was the response like to the client? How long until we closed that session with the client – is there a front-side network issue?

There’s all of these metrics, but what we’ve tried to do – and time will tell if our approach is interesting enough or right enough now… What we’ve tried to do is not put any hardcoded values in for any of those, but rather to do just like anomaly detection and predictive profiling of what we expect the data to look like. Because one of the things is our system autoscales, so it will pre-scale, so it needs to do a lot of prediction off of those numbers. So we’ve wound up in this system where we collect a huge amount of telemetry and we set no hard lines for what should be alerted, but rather just if it changes too much… And so far that’s going well, but I think it’s a little bit odd for some people, because they wanna say “Well, I expect my website to respond in 200 milliseconds, so if it’s ever more than 250, please tell me.” And instead, we’re saying “Well, if it always responds in 200, then we will tell you if it’s 250. But if it doesn’t, then we won’t.”

So all of that type of stuff is your traditional things that you expect, like what’s throughput of the collect, or the request rates, the response codes… Because you can pick up a problem long before by saying “Oh, I normally generate 0.1% errors, and now I’m generating 0.5% errors.” You might not notice that, but it means something’s changed, and it could mean that something’s about to get a lot worse; it could mean that there’s a security issue, and it could mean all of those things.

But by the same token, we will also check for variances between two things. For example, if the average user sends far more GET requests than POST requests, but one user is sending far more POST requests than GET requests – is this a security issue? Are they trying to brute force a password, is this something weird? Is a specific user getting way more 404 errors than everyone else? Why is that? It’s probably some script, or something. So telemetry is often a combination of two values, like “What is this value versus that value?” as opposed to just a single value. So that’s a lot of the stuff we focus on.

[00:48:09.09] The client connects to us, we connect to the web servers and we send their data back. That’s our model… So everything in that communication chain is the telemetry that we care a lot about, because it could mean that there’s a problem with the client servers, it could mean that there’s latency or issues that are affecting the user, or it could mean a security issue… So that’s the type of stuff we need to obviously track for scaling up and scaling down, as well as for alerting the user to problems with their service.