-
Notifications
You must be signed in to change notification settings - Fork 0
additional comments and edits #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,15 +4,15 @@ | |
|
|
||
| ## Introduction | ||
|
|
||
| A real-time application invariably discards any packet that arrives after the jitter buffer's play-out deadline. This applies for any real-time application, whether streamed video, interactive media or online gaming. For this broad class of applications a median delay metric is a distraction---waiting for the median delay before play-out would discard half the packets. To characterize the delay experienced by an application it would be more useful to quote the delay of a high percentile of packets. But which percentile? The 99th? The 99.99th? The 98th? The answer is application- and implementation-dependent, because it depends on how much discard can effectively be concealed (1%, 0.01% or 2% in the above examples, assuming no other losses). Nonetheless, it would be useful to settle on a single industry-wide percentile to characterize delay, even if it isn't perfect in every case. | ||
| A real-time application invariably discards any packet that arrives after the jitter buffer's play-out deadline. This applies for any real-time application, whether streamed video, interactive media or online gaming. [Not all games use jitter buffers] For this broad class of applications a median delay metric is a distraction---waiting for the median delay before play-out would discard half the packets. To characterize the delay experienced by an application it would be more useful to quote the delay of a high percentile of packets. But which percentile? The 99th? The 99.99th? The 98th? The answer is application- and implementation-dependent, because it depends on how much discard can effectively be concealed (1%, 0.01% or 2% in the above examples, assuming no other losses). [This text might be sufficient, but jitter buffers are usually adaptive, and my understanding is that they aim to maximize QoE. At any particular operating point they convert variable delay into a fixed delay and residual loss. Since QoE typically depends on both of those resulting attributes, I don't think that they generally ignore the former.] Nonetheless, it would be useful to settle on a single industry-wide percentile to characterize delay, even if it isn't perfect in every case. | ||
|
|
||
| This brief discussion paper aims to start a debate on whether a percentile is the best single delay metric and, if so, which percentile the industry should converge on. | ||
|
|
||
| Note that the question addressed here is how to characterize a varying delay metric. That is orthogonal to the questions of what delay to measure and where to measure it. For instance, whether delay is measured in the application, at the transport layer or just at the bottleneck queue depends on the topic of interest and is an orthogonal question to the focus of this paper. Similarly, for delay under load, the question of which background traffic pattern to use depends on the scenario of interest and is an orthogonal question to the focus of this paper; which is solely about how to characterize delay variability most succinctly and usefully in *any* of these cases. | ||
|
|
||
| ## Don't we need two metrics? | ||
|
|
||
| In systems that aim for a certain delay, it has been common to quote mean delay and jitter. The distribution of delay is usually asymmetric, mostly clustered around the lower end, but with a long tail of higher delays. A traditional jitter metric is insensitive to the shape of this tail, because it is dominated by the *average* variability in the bulk of the traffic around the mean. However, it doesn't matter how little or how much variability there is in all the traffic that arrives before the play-out time. It only matters how much traffic arrives too late. The size of all the lower-than-average delay should not be allowed to counterbalance a long tail of above-average delay. | ||
| In systems that aim for a certain delay, it has been common to quote mean delay and jitter. The distribution of delay is usually asymmetric, mostly clustered around the lower end, but with a long tail of higher delays. Many jitter metrics are insensitive to the shape of this tail, because they are dominated by the *average* variability in the bulk of the traffic around the mean. [FYI, in our SCTE paper we quoted 9 different definitions of jitter in common usage in the industry] However, it doesn't matter how little or how much variability there is in all the traffic that arrives before the play-out time. It only matters how much traffic arrives too late. The size of all the lower-than-average delay should not be allowed to counterbalance a long tail of above-average delay. | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've now the read the beginning and middle of the SCTE paper. I think we can say "Most jitter metrics..." because all but one of those in the paper are insensitive to the shape of the tail. I'd like to cite the SCTE paper here. Agree? And I also intend to cite RFC5481 (which I wasn't previously aware of) in an added a para to the introduction that gives some potential uses for the metric under discussion (based on the list in §3 of RFC5481).
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, ok to cite the SCTE paper. I guess I exaggerated when I said there were 9 different metrics :), but in any case there are 5 different ones. Not sure I agree that most (or many) of the metrics "are dominated by the average variability in the bulk of the traffic around the mean." Two of them probably are (the IPDV metric and the std.dev(PDV) metric), but the other three (P99 PDV, max(PDV), RMS(PDV)) are not. My guess is that std.dev or IPDV are used more commonly than the others, so perhaps we could say that in predominant usage (or for "the most commonly used metrics"), your statement is true. FYI, this is sort of a pet peeve of mine. We ran one sequence of packet latency measurements from an ns3 application scenario through all 5 of those jitter definitions, and got results that ranged from 0.06ms with one metric to 142.6ms with another. So, I would rather say something that equates to "jitter is largely meaningless, unless a precise definition is given".
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Surely RMS(PDV) "measures the average variability in the bulk of traffic", much like std.dev(PDV). So 3 out of 5 measure average variability. Of the other 2, MLab NDT uses max(PDV), which is "obviously" unpredictable (I guess it gave your 142.6ms result). And Sam Knows uses P99, 'cos obviously Sam Knows best :) So strictly, 'most' is not incorrect, and I wouldn't describe 3 as 'many'. I would rather not say "most commonly used", because SamKnows is pretty widespread. Also I would rather not say anything about how disparate the jitter metrics are, 'cos that's off-topic, even if it is your pet peeve. So what are we going to say? We must ensure that no-one reviewing or reading it feels it is insulting their approach. I think we have given a technical enough argument against focusing on variability in the cluster. So pls suggest any changes, otherwise I will leave it as is. |
||
|
|
||
| The argument for a single percentile delay metric is strongest for real-time applications, including real-time media [[Bouch00](#Bouch00)], [[Yim11](#Yim11)] and online games. But a delay metric is also important for non-real-time applications, e.g. web and transactional traffic more generally (e.g. RPC). Here, average delay is indeed important. But still, the user's perception is dominated by the small proportion of longer delays [[Wilson11](#Wilson11)]. | ||
|
|
||
|
|
@@ -24,9 +24,9 @@ Arguments can be made for more than one delay metric to better characterize the | |
|
|
||
| The factors that influence the choice of percentile are: | ||
|
|
||
| * The degree of late packet discard that can be efficiently concealed by real-time media coding (both that which is typical today and that which could be typical in future). | ||
| * The degree of late packet discard that can be efficiently concealed by real-time media coding (both that which is typical today and that which could be typical in future). [Again, I think there are two dimensions to this.] | ||
| * The lag before results can be produced. | ||
| For instance, to measure 99th percentile delay requires of the order of 1,000 packets minimum (an order of magnitude greater than 1/(1 - 0.99) = 100). In contrast 99.999th percentile requires 1,000,000 packets. At a packet rate of say 10k packet/s, they would take respectively 100 ms or 100 s. | ||
| For instance, to measure 99th percentile delay requires of the order of 1,000 packets minimum (an order of magnitude greater than 1/(1 - 0.99) = 100) [I think I've seen others use this rule-of-thumb as well, but I don't think it has any basis. The number of samples needed is highly dependent on the shape of the distribution]. In contrast 99.999th percentile requires 1,000,000 packets. At a packet rate of say 10k packet/s [many real-time applications send more like 30-100 pps, would it be better to use a number in that ballpark?], they would take respectively 100 ms or 100 s. | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rule of thumb: If one is solely measuring a single percentile by sorting the samples without binning, I think one only needs to choose a number of samples that will ensure there is a 'large' number of samples on the tail side of the percentile. One order of magnitude makes that 'large' number about 10, which I admit might not be enough. Tomorrow, I intend to take the data used for the CCDF in Olga's paper and see where the regularity of the plot starts to break down, which will give an idea of what 'large' needs to mean. pps question: The lowest rate you suggest with the order of mag rule of thumb would translate to a lag of 30 s before outputting the first P99 metric. I suggest I use the 10kpps example, then say how it also needs to work within a practical duration in the lower pps range you mention.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed on the PPS question. On the percentile estimation rule of thumb, I'll send you via Slack the slides summarizing my conclusions on that when I looked into it last fall.
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. Read and understood. Is the text now acceptable (if I don't get round to doing anything more accurate before submitting)? |
||
| * The latter would be impractical to display live, while the former makes it possible to display the 99th percentile nearly immediately after a flow has started (see for instance the open source GUI we provide to display the distribution of delays for comparison between congestion controllers and AQMs [[Bond16](#Bond16)], [[l4sdemo](#l4sdemo)]). | ||
| * Similarly, for research or product testing where a large matrix of tests has to be conducted, it would be far more practical if each test run took 100 ms, rather than 100 s. | ||
| * To calculate a high percentile requires a significant number of bins to hold the data. This can make a high percentile prohibitively expensive to maintain, e.g. on cost-reduced consumer-grade network equipment. | ||
|
|
@@ -35,18 +35,18 @@ As a strawman, we propose **the 99th percentile (P99)** as a lowest common denom | |
|
|
||
| ## The 'Benchmark Effect' | ||
|
|
||
| As explained in the introduction, defining a delay metric is not just about choosing a percentile. The layer to measure and the background traffic pattern to use also have to be defined. As soon as these have been settled on, researchers, product engineers, etc. tend to optimize around this set of conditions---the so-called 'benchmark effect'. It is possible that harmonizing around one choice of percentile will lead to a benchmark effect. However, a percentile metric seems robust against such perverse incentives, because it seems hard to contrive performance results that fall off a cliff just beyond a certain percentile. Nonetheless, even if there were a benchmark effect, it would be harmless if the percentile chosen for the benchmark realistically reflected the needs of most applications. {ToDo: better wording for last sentence.} | ||
| As explained in the introduction, defining a delay metric is not just about choosing a percentile. The layer to measure and the background traffic pattern to use also have to be defined. As soon as these have been settled on, researchers, product engineers, etc. tend to optimize around this set of conditions---the so-called 'benchmark effect'. It is possible that harmonizing around one choice of percentile will lead to a benchmark effect. However, a percentile metric seems robust against such perverse incentives, because it seems hard to contrive performance results that fall off a cliff just beyond a certain percentile. [This is a little weak. An intermittent outtage on a link that causes packets to queue up would cause delay spikes that might only show up in percentiles greater than P99. These outtages would be ignored if P99 were the metric.] Nonetheless, even if there were a benchmark effect, it would be harmless if the percentile chosen for the benchmark realistically reflected the needs of most applications. {ToDo: better wording for last sentence.} | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Whether low likelihood effects might be concealed by choosing P99 as the metric is surely a different question from whether people can contrive to conceal poor performance when they have to report against a benchmark test. Nonetheless, I agree that this argument is weak. Any ideas on improving it? I could delete the section without harming the paper.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess the contrivance would involve ensuring that the glitch affected less than 1% of packets, with no concern for how bad those <1% were affected. e.g. having a timer that reboots the device every so often to avoid the affects of a memory leak showing up in the P99 score. I wonder whether we need a strong argument that P99 is more robust against the benchmark effect than other metrics would be. It doesn't seem to me that it is any more or any less robust. The final point you make is the more important one. To date, the de facto single metric that gets quoted is average latency, which we argue is pretty worthless in predicting QoE. So the industry has been suffering terribly with the benchmark effect of using average latency for years (and people ignored bufferbloat, etc.). |
||
|
|
||
| ## How to articulate a percentile to the public? | ||
|
|
||
| Delay is not an easy metric for public consumption, because it exhibits the following undesirable features: | ||
|
|
||
| * Larger is not better. | ||
| * It might be possible to invert the metric [[RPM21](#RPM21)], but rounds per minute carries an implication that it is only for repetitive tasks, which would limit the scope of the metric | ||
| * It might be possible to invert the metric [[RPM21](#RPM21)], but rounds per minute carries an implication that it is only for repetitive tasks for which average delay is important, which would limit the scope of the metric | ||
| * it is measured in time units (ms) that seem too small to matter, and which are not common currency for a lay person | ||
| * This might also be addressed by inverting the metric | ||
|
|
||
| A delay percentile is expressed as a delay, so it shares the same failings, and the same potential mitigations. But a percentile carries additional baggage as well: | ||
| A delay percentile is expressed as a delay, so it shares the same failings, and the same potential mitigation [huh? it's hard for me to see how one would invert a P99 metric in a useful way]. But a percentile carries additional baggage as well: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I forgot to mention this in my email, but I agree that inverting the metric is probably not the direction.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is a great comment! I agree. Anyone with a 3rd grade education understands that for some metrics (like cost, for example), lower is better. Let's strike any concern about latency being a 'smaller is better' metric. The other concern (units are really small) is a true concern and could be addressed more directly.
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll certainly remove the stuff about inverting the metric, and I'll merge the discussion about "higher is better" into the bullet about the units seeming too small to matter. |
||
|
|
||
| * It's not immediately obvious why the particular percentile has been chosen. | ||
| * It would need some indication that it was an industry-standard metric, perhaps IETF-P99. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Not all games use jitter buffers]
As I said, I didn't want to mention implementation specifics like jitter buffers, but I was trying to accommodate the parenthesis about jitter buffers that you had added. I'm going to remove the mention of a jitter buffer - these sentences are getting too complicated for the start of a paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...Since QoE typically depends on both {delay and loss}, I don't think that they generally ignore the former.]
No, of course QoE depends on delay and loss, but that's not the point. The point is that the implementation has to ensure a certain percentage of packets arrive within the deadline, otherwise the application doesn't function at all. It does not pick a level of delay out of the air that must be achieved whatever. Consider both extremes:
IOW, discard has to be lower than a specific level otherwise the application doesn't function; then delay just has to be as low as possible having satisfied the loss constraint.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think you are either misunderstanding my comment or we have a very different view as to how this works, because what I wrote is exactly the point. I don't agree with your statement that an implementation has to ensure a certain percentage of packets arrive within the deadline. In this context, I think QoE is a function of both factors, e.g.
QoE = Qmax - k1*latency - k2*log(loss)(this is a simplification of the Cole & Rosenbluth formula for VoIP).The network provides a certain delay distribution (likely isn't stationary, but let's assume it is).
The job of the adaptive jitter buffer is to tune to the operating point that produces the fixed latency and residual loss that maximizes the QoE function.
Consider the case where the network provides a P95 latency of 50ms and a P96 latency of 200ms. In that case, the optimal point might be 50ms and 5% loss.
Or, if the network provides P99 latency of 60ms and P99.999 latency of 62ms, the optimal point might be 62ms and 0.00001% loss.
So, even for a specific application, you can't simply say that the metric should match the residual loss target of its jitter buffer, because I don't think there is a set loss target. But, I still think that something like a P99 metric is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in theory there is an optimization. But surely the slope of the delay CDF is so steep that it's hardly a 2-variable optimization. For instance, with 2nd gen AQMs (PIE, FQ-CoDel) the slope of the delay-percentile CDF is 5-10 ms / decade of loss, and with 3rd gen (L4S) it's <1ms / decade. So surely, pragmatically, it's solely about increasing the de-jitter buffer enough to reach a discard percentage that is low enough to be easily concealable... whatever delay that takes.
I don't know whether you're only going on the comment stream, but I had also changed the text to give a nod to a trade-off with delay. I didn't really want to because it muddies the rationale for choosing a particular percentile. But I think it's OK (and you'd already said "the text might be sufficient" earlier). I think we should now both understand each other's positions, so we can now just focus on the text - if it's not acceptable, pls say.