Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric.Timer has NaN for Mean #80

Open
hhansell opened this issue Jul 4, 2017 · 10 comments
Open

Metric.Timer has NaN for Mean #80

hhansell opened this issue Jul 4, 2017 · 10 comments

Comments

@hhansell
Copy link

hhansell commented Jul 4, 2017

After leaving a process running for some time, I've come across the following issue.
Could someone confirm if this is a metrics.net issue or whether it is an issue for the Metrics.NET.InfluxDB project?

Thanks

public static readonly Timer ResponseTime = Metric.Timer("Response Time", Unit.Requests);

2017-07-05 05:45:05, 226 [22 ] Metrics: Unhandled exception in Metrics.NET Library Error while uploading 51 measurements (5.72 KiB) to InfluxDB over HTTP [http://localhost:8070/write?db=metrics&u=user&p=password&precision=s] [ResponseStatus: ProtocolError] [Response: {
"error":"partial write: unable to parse '
Response Time Active Sessions=0i,
Total Time=4i,
Count=1i,
Mean Rate=1.9858004319199677E-05,
1 Min Rate=2.96439387504748E-314,
5 Min Rate=5.0414622397472423E-76,
15 Min Rate=5.9199171854273278E-28,
Last=4.0835409999999994,
Min=4.0835409999999994,
Mean=NaN,
Max=4.0835409999999994,
StdDev=0,
Median=4.0835409999999994,
Sample Size=1i,
Percentile 75%=4.0835409999999994,
Percentile 95%=4.0835409999999994,
Percentile 98%=4.0835409999999994,
Percentile 99%=4.0835409999999994,
Percentile 99.9%=4.0835409999999994 1499197505': invalid number"}

@sunkor
Copy link

sunkor commented Jul 20, 2017

Issue arises in Metrics.Net - ExponentiallyDecayingReservoir.Update method (itemWeight goes to Infinity causing Nan for aggregates). The startTime field is instantiated in the constructor and reset in ResetReservoir method.

If the instance of ExponentiallyDecayingReservoir is alive for 47319 seconds (with Default Alpha at 0.015), the Math.Exp (in Update method) returns double.Infinity causing Mean to be NaN.

var itemWeight = Math.Exp(this.alpha * (timestamp - this.startTime.GetValue()));
var sample = new WeightedSample(value, userValue, itemWeight);

The patch is to reset the reservoir if we hit infinity:
var itemWeight = Math.Exp(this.alpha * (timestamp - this.startTime.GetValue()));
if (double.IsInfinity(itemWeight))
{
ResetReservoir();
itemWeight = Math.Exp(this.alpha * (timestamp - this.startTime.GetValue()));
}

Patch available at https://github.com/sunkor/Metrics.NET

sunkor added a commit to sunkor/Metrics.NET that referenced this issue Jul 21, 2017
@hhansell
Copy link
Author

@PaulParau could this be merged into trunk?

@PaulParau
Copy link
Member

@sunkor, I understand what you're saying, however the ExponentiallyDecayingReservoir gets rescaled every hour (see the ExponentiallyDecayingReservoir.Rescale method). The startTime gets reset in this method, which should ensure that the maximum difference in seconds (timestamp - this.startTime.GetValue()) should never exceed 3600.

Under which conditions have you observed double.IsInfinity(itemWeight) in your solution to be true?

@sunkor
Copy link

sunkor commented Aug 15, 2017

Hi Paul

I have left the organization where the dev team uses Metrics.Net, so unfortunately have to rely on my memory on how the issue gets triggered.

If you initialize a Timer (internally using ExponentiallyDecayingReservoir) and keep it alive for 47319 seconds (just more than 13 hours), the Math.Exp returns double.Infinity.

So if we init ResponseTime, and use it in a long running process (a service perhaps) to record metrics, we will run into the issue.
public static readonly Timer ResponseTime = Metric.Timer("Response Time", Unit.Requests);

@PaulParau
Copy link
Member

I cannot reproduce the issue. The ExponentiallyDecayingReservoir.Rescale method gets triggered, as expected, 1 hour after reservoir creation, and startTime gets reset. Thus, the difference (timestamp - this.startTime.GetValue()) does not get larger than 3600 and Math.Exp(this.alpha * (timestamp - this.startTime.GetValue())) should never get larger than 4.64385... × 10^23, which is well beneath double.MaxValue.

I believe the issue @hhansell is experiencing is caused by something other than this.

@hhansell
Copy link
Author

hhansell commented Nov 3, 2017

@PaulParau, after logging out some values, I see that the NaN comes ultimately as the result of a list where all items have a weight of zero. Perhaps a change to the Mean function to handle this and return 0? This won't fix the bigger issue further down though.

            var list = new SortedList<double, WeightedSample>(100, ReverseOrderDoubleComparer.Instance);         
            list[double.Parse("1.48219693752374E-323")] = new WeightedSample(1957668, null, 0.000000000000000000000000000000);
            var snapshot = new WeightedSnapshot(list.Count, list.Values);
            var mean = snapshot.Mean;
            Debug.Assert(!double.IsNaN(mean));

Having logged out the values, here is an example using real data (from a logged version of the Rescale method in the ExponentiallyDecayingReservoir) that has lead to a weight of zero:

            var alpha = 0.015;
            var sampleWeight = double.Parse("6.30093530948943E-302");
            var startTime = 294463;
            var oldStartTime = 290906;
            double weight = sampleWeight * Math.Exp(-alpha * (startTime - oldStartTime));

Ultimately, this comes down to the 'sampleWeight * scalingFactor' being too small for a double. 6.73314423785015E-24 * 6.30093530948943E-302 == double(0).

I presume that the purpose of ResetReservoir is to prevent the weights from decaying to this point, however, I cannot see how this is ultimately called though.

@GitAtSmsPortal
Copy link

GitAtSmsPortal commented Nov 17, 2017

We also had found this issue within our application when using timers.

We had a metric that posted once on application startup, and we would get the exact same error logged about 13hrs later.

We found that the issue resided in the constructor of WeightedSnapshot class on the following line
ln52: this.normWeights[i] = sample[i].Weight / sumWeight;

As time moves on and no updates to the metric were received i.e. not calling NewContext() in our case, the weight of the sample tends towards 0, thus after a period of time we have a divide by zero issue causing the mean to become NaN. The reason that there were no exceptions thrown for the divide by zero case, is because the Weight is a double and not an int, resulting in NaN (our sample being 0 too) instead of an exception.

This is the fix we recommend.
this.normWeights[i] = sumWeight == 0 ? 0 : sample[i].Weight / sumWeight;

@sdiakovskyi-gd
Copy link

I can confirm existence of this issue, and proposed patch has fixed this issue for our application.

@PaulParau can it be merged and new version published to nuget?

@PaulParau
Copy link
Member

This fix does indeed prevent NaN values for the weights, and ultimately for the mean value of a histogram snapshot.

But this only fixes the symptoms, not the cause - the reservoir will essentially 'lose' values after ~13 hours of inactivity. I believe that this issue should be fixed at the RescaleReservoir level, to prevent weights from becoming zero - but I can't tell right now what the implications of such a change would be.

So I'm going to make the change and publish a new package. Thanks @GitAtSmsPortal for the detailed analysis!

@PaulParau
Copy link
Member

I've fixed the issue & pushed a new package (v0.5.6-pre) to nuget. Thanks for all the help everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants