Conversation
| * This might also be addressed by inverting the metric | ||
|
|
||
| A delay percentile is expressed as a delay, so it shares the same failings, and the same potential mitigations. But a percentile carries additional baggage as well: | ||
| A delay percentile is expressed as a delay, so it shares the same failings, and the same potential mitigation [huh? it's hard for me to see how one would invert a P99 metric in a useful way]. But a percentile carries additional baggage as well: |
There was a problem hiding this comment.
I forgot to mention this in my email, but I agree that inverting the metric is probably not the direction.
Delay in milliseconds is perfectly fine I think and I think users know for latency, lower is better. Inverting it could make it more confusing.
There was a problem hiding this comment.
That is a great comment! I agree. Anyone with a 3rd grade education understands that for some metrics (like cost, for example), lower is better. Let's strike any concern about latency being a 'smaller is better' metric. The other concern (units are really small) is a true concern and could be addressed more directly.
There was a problem hiding this comment.
I'll certainly remove the stuff about inverting the metric, and I'll merge the discussion about "higher is better" into the bullet about the units seeming too small to matter.
bbriscoe
left a comment
There was a problem hiding this comment.
I'll attempt a final draft Fri morning UK time. I've responded to most of your review comments. Pls push back on any if you think it appropriate.
| ## Introduction | ||
|
|
||
| A real-time application invariably discards any packet that arrives after the jitter buffer's play-out deadline. This applies for any real-time application, whether streamed video, interactive media or online gaming. For this broad class of applications a median delay metric is a distraction---waiting for the median delay before play-out would discard half the packets. To characterize the delay experienced by an application it would be more useful to quote the delay of a high percentile of packets. But which percentile? The 99th? The 99.99th? The 98th? The answer is application- and implementation-dependent, because it depends on how much discard can effectively be concealed (1%, 0.01% or 2% in the above examples, assuming no other losses). Nonetheless, it would be useful to settle on a single industry-wide percentile to characterize delay, even if it isn't perfect in every case. | ||
| A real-time application invariably discards any packet that arrives after the jitter buffer's play-out deadline. This applies for any real-time application, whether streamed video, interactive media or online gaming. [Not all games use jitter buffers] For this broad class of applications a median delay metric is a distraction---waiting for the median delay before play-out would discard half the packets. To characterize the delay experienced by an application it would be more useful to quote the delay of a high percentile of packets. But which percentile? The 99th? The 99.99th? The 98th? The answer is application- and implementation-dependent, because it depends on how much discard can effectively be concealed (1%, 0.01% or 2% in the above examples, assuming no other losses). [This text might be sufficient, but jitter buffers are usually adaptive, and my understanding is that they aim to maximize QoE. At any particular operating point they convert variable delay into a fixed delay and residual loss. Since QoE typically depends on both of those resulting attributes, I don't think that they generally ignore the former.] Nonetheless, it would be useful to settle on a single industry-wide percentile to characterize delay, even if it isn't perfect in every case. |
There was a problem hiding this comment.
[Not all games use jitter buffers]
As I said, I didn't want to mention implementation specifics like jitter buffers, but I was trying to accommodate the parenthesis about jitter buffers that you had added. I'm going to remove the mention of a jitter buffer - these sentences are getting too complicated for the start of a paper.
There was a problem hiding this comment.
[...Since QoE typically depends on both {delay and loss}, I don't think that they generally ignore the former.]
No, of course QoE depends on delay and loss, but that's not the point. The point is that the implementation has to ensure a certain percentage of packets arrive within the deadline, otherwise the application doesn't function at all. It does not pick a level of delay out of the air that must be achieved whatever. Consider both extremes:
- imagine the chosen level of delay persistently led to 70% loss - it would have to increase the (temporarily) fixed level of delay until loss was below a usable level.
- If on the other hand the chosen level of delay persistently led to 0.00001% loss, the (temporarily) fixed level of delay could be adapted down.
IOW, discard has to be lower than a specific level otherwise the application doesn't function; then delay just has to be as low as possible having satisfied the loss constraint.
There was a problem hiding this comment.
FWIW I think you are either misunderstanding my comment or we have a very different view as to how this works, because what I wrote is exactly the point. I don't agree with your statement that an implementation has to ensure a certain percentage of packets arrive within the deadline. In this context, I think QoE is a function of both factors, e.g. QoE = Qmax - k1*latency - k2*log(loss) (this is a simplification of the Cole & Rosenbluth formula for VoIP).
The network provides a certain delay distribution (likely isn't stationary, but let's assume it is).
The job of the adaptive jitter buffer is to tune to the operating point that produces the fixed latency and residual loss that maximizes the QoE function.
Consider the case where the network provides a P95 latency of 50ms and a P96 latency of 200ms. In that case, the optimal point might be 50ms and 5% loss.
Or, if the network provides P99 latency of 60ms and P99.999 latency of 62ms, the optimal point might be 62ms and 0.00001% loss.
So, even for a specific application, you can't simply say that the metric should match the residual loss target of its jitter buffer, because I don't think there is a set loss target. But, I still think that something like a P99 metric is useful.
There was a problem hiding this comment.
Yes, in theory there is an optimization. But surely the slope of the delay CDF is so steep that it's hardly a 2-variable optimization. For instance, with 2nd gen AQMs (PIE, FQ-CoDel) the slope of the delay-percentile CDF is 5-10 ms / decade of loss, and with 3rd gen (L4S) it's <1ms / decade. So surely, pragmatically, it's solely about increasing the de-jitter buffer enough to reach a discard percentage that is low enough to be easily concealable... whatever delay that takes.
I don't know whether you're only going on the comment stream, but I had also changed the text to give a nod to a trade-off with delay. I didn't really want to because it muddies the rationale for choosing a particular percentile. But I think it's OK (and you'd already said "the text might be sufficient" earlier). I think we should now both understand each other's positions, so we can now just focus on the text - if it's not acceptable, pls say.
| ## Don't we need two metrics? | ||
|
|
||
| In systems that aim for a certain delay, it has been common to quote mean delay and jitter. The distribution of delay is usually asymmetric, mostly clustered around the lower end, but with a long tail of higher delays. A traditional jitter metric is insensitive to the shape of this tail, because it is dominated by the *average* variability in the bulk of the traffic around the mean. However, it doesn't matter how little or how much variability there is in all the traffic that arrives before the play-out time. It only matters how much traffic arrives too late. The size of all the lower-than-average delay should not be allowed to counterbalance a long tail of above-average delay. | ||
| In systems that aim for a certain delay, it has been common to quote mean delay and jitter. The distribution of delay is usually asymmetric, mostly clustered around the lower end, but with a long tail of higher delays. Many jitter metrics are insensitive to the shape of this tail, because they are dominated by the *average* variability in the bulk of the traffic around the mean. [FYI, in our SCTE paper we quoted 9 different definitions of jitter in common usage in the industry] However, it doesn't matter how little or how much variability there is in all the traffic that arrives before the play-out time. It only matters how much traffic arrives too late. The size of all the lower-than-average delay should not be allowed to counterbalance a long tail of above-average delay. |
There was a problem hiding this comment.
I've now the read the beginning and middle of the SCTE paper. I think we can say "Most jitter metrics..." because all but one of those in the paper are insensitive to the shape of the tail.
I'd like to cite the SCTE paper here. Agree?
And I also intend to cite RFC5481 (which I wasn't previously aware of) in an added a para to the introduction that gives some potential uses for the metric under discussion (based on the list in §3 of RFC5481).
There was a problem hiding this comment.
Yes, ok to cite the SCTE paper.
I guess I exaggerated when I said there were 9 different metrics :), but in any case there are 5 different ones.
Not sure I agree that most (or many) of the metrics "are dominated by the average variability in the bulk of the traffic around the mean." Two of them probably are (the IPDV metric and the std.dev(PDV) metric), but the other three (P99 PDV, max(PDV), RMS(PDV)) are not. My guess is that std.dev or IPDV are used more commonly than the others, so perhaps we could say that in predominant usage (or for "the most commonly used metrics"), your statement is true.
FYI, this is sort of a pet peeve of mine. We ran one sequence of packet latency measurements from an ns3 application scenario through all 5 of those jitter definitions, and got results that ranged from 0.06ms with one metric to 142.6ms with another. So, I would rather say something that equates to "jitter is largely meaningless, unless a precise definition is given".
There was a problem hiding this comment.
Surely RMS(PDV) "measures the average variability in the bulk of traffic", much like std.dev(PDV). So 3 out of 5 measure average variability. Of the other 2, MLab NDT uses max(PDV), which is "obviously" unpredictable (I guess it gave your 142.6ms result). And Sam Knows uses P99, 'cos obviously Sam Knows best :)
So strictly, 'most' is not incorrect, and I wouldn't describe 3 as 'many'. I would rather not say "most commonly used", because SamKnows is pretty widespread.
Also I would rather not say anything about how disparate the jitter metrics are, 'cos that's off-topic, even if it is your pet peeve.
So what are we going to say? We must ensure that no-one reviewing or reading it feels it is insulting their approach. I think we have given a technical enough argument against focusing on variability in the cluster. So pls suggest any changes, otherwise I will leave it as is.
| * The degree of late packet discard that can be efficiently concealed by real-time media coding (both that which is typical today and that which could be typical in future). [Again, I think there are two dimensions to this.] | ||
| * The lag before results can be produced. | ||
| For instance, to measure 99th percentile delay requires of the order of 1,000 packets minimum (an order of magnitude greater than 1/(1 - 0.99) = 100). In contrast 99.999th percentile requires 1,000,000 packets. At a packet rate of say 10k packet/s, they would take respectively 100 ms or 100 s. | ||
| For instance, to measure 99th percentile delay requires of the order of 1,000 packets minimum (an order of magnitude greater than 1/(1 - 0.99) = 100) [I think I've seen others use this rule-of-thumb as well, but I don't think it has any basis. The number of samples needed is highly dependent on the shape of the distribution]. In contrast 99.999th percentile requires 1,000,000 packets. At a packet rate of say 10k packet/s [many real-time applications send more like 30-100 pps, would it be better to use a number in that ballpark?], they would take respectively 100 ms or 100 s. |
There was a problem hiding this comment.
Rule of thumb: If one is solely measuring a single percentile by sorting the samples without binning, I think one only needs to choose a number of samples that will ensure there is a 'large' number of samples on the tail side of the percentile. One order of magnitude makes that 'large' number about 10, which I admit might not be enough. Tomorrow, I intend to take the data used for the CCDF in Olga's paper and see where the regularity of the plot starts to break down, which will give an idea of what 'large' needs to mean.
Whatever, I don't think the answer depends on the shape of the tail does it?
pps question: The lowest rate you suggest with the order of mag rule of thumb would translate to a lag of 30 s before outputting the first P99 metric. I suggest I use the 10kpps example, then say how it also needs to work within a practical duration in the lower pps range you mention.
There was a problem hiding this comment.
Agreed on the PPS question.
On the percentile estimation rule of thumb, I'll send you via Slack the slides summarizing my conclusions on that when I looked into it last fall.
There was a problem hiding this comment.
Thanks. Read and understood. Is the text now acceptable (if I don't get round to doing anything more accurate before submitting)?
| ## The 'Benchmark Effect' | ||
|
|
||
| As explained in the introduction, defining a delay metric is not just about choosing a percentile. The layer to measure and the background traffic pattern to use also have to be defined. As soon as these have been settled on, researchers, product engineers, etc. tend to optimize around this set of conditions---the so-called 'benchmark effect'. It is possible that harmonizing around one choice of percentile will lead to a benchmark effect. However, a percentile metric seems robust against such perverse incentives, because it seems hard to contrive performance results that fall off a cliff just beyond a certain percentile. Nonetheless, even if there were a benchmark effect, it would be harmless if the percentile chosen for the benchmark realistically reflected the needs of most applications. {ToDo: better wording for last sentence.} | ||
| As explained in the introduction, defining a delay metric is not just about choosing a percentile. The layer to measure and the background traffic pattern to use also have to be defined. As soon as these have been settled on, researchers, product engineers, etc. tend to optimize around this set of conditions---the so-called 'benchmark effect'. It is possible that harmonizing around one choice of percentile will lead to a benchmark effect. However, a percentile metric seems robust against such perverse incentives, because it seems hard to contrive performance results that fall off a cliff just beyond a certain percentile. [This is a little weak. An intermittent outtage on a link that causes packets to queue up would cause delay spikes that might only show up in percentiles greater than P99. These outtages would be ignored if P99 were the metric.] Nonetheless, even if there were a benchmark effect, it would be harmless if the percentile chosen for the benchmark realistically reflected the needs of most applications. {ToDo: better wording for last sentence.} |
There was a problem hiding this comment.
Whether low likelihood effects might be concealed by choosing P99 as the metric is surely a different question from whether people can contrive to conceal poor performance when they have to report against a benchmark test.
Nonetheless, I agree that this argument is weak. Any ideas on improving it? I could delete the section without harming the paper.
There was a problem hiding this comment.
I guess the contrivance would involve ensuring that the glitch affected less than 1% of packets, with no concern for how bad those <1% were affected. e.g. having a timer that reboots the device every so often to avoid the affects of a memory leak showing up in the P99 score.
I wonder whether we need a strong argument that P99 is more robust against the benchmark effect than other metrics would be. It doesn't seem to me that it is any more or any less robust. The final point you make is the more important one. To date, the de facto single metric that gets quoted is average latency, which we argue is pretty worthless in predicting QoE. So the industry has been suffering terribly with the benchmark effect of using average latency for years (and people ignored bufferbloat, etc.).
| * This might also be addressed by inverting the metric | ||
|
|
||
| A delay percentile is expressed as a delay, so it shares the same failings, and the same potential mitigations. But a percentile carries additional baggage as well: | ||
| A delay percentile is expressed as a delay, so it shares the same failings, and the same potential mitigation [huh? it's hard for me to see how one would invert a P99 metric in a useful way]. But a percentile carries additional baggage as well: |
There was a problem hiding this comment.
I'll certainly remove the stuff about inverting the metric, and I'll merge the discussion about "higher is better" into the bullet about the units seeming too small to matter.
No description provided.