-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use tower trace layer #698
base: next
Are you sure you want to change the base?
Conversation
crates/block-producer/src/server.rs
Outdated
) | ||
.on_failure( | ||
|_error: GrpcFailureClass, _latency: Duration, _span: &tracing::Span| todo!(), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mirko-von-Leipzig I agree that this is probably the correct abstraction to be using here (as you mentioned in this comment).
Any suggestions on how to test this locally? TY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its definitely a bit awkward to test. There is the text exporter which at least lets you debug/inspect the manually.
If we do want to uni/integration test these things, we'll probably have to write our own exporter - e.g. something to aggregate the data which we can then assert against. I'm unsure why this isn't already available tbh.
What I've been doing in the meantime is using https://www.honeycomb.io/ with a free account to test things out. There's some setup info in the wip guide here.
I essentially start a local node, configure otel to use the honeycomb endpoint and then generate traffic using the miden-client
integration test against my local node.
This might be a pita though - let me know if you run into issues, or if you can think of something smarter :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually here we go: https://docs.rs/opentelemetry_sdk/latest/opentelemetry_sdk/testing/trace/index.html
They do actually have test infra :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks just getting back to this now. Will have a crack at integrating those trace test capabilities in the unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit tests
Or integration tests rather. Which would involve allowing the node to be configured with different SpanExporters to allow the test one to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I imagine this may take some more consideration, given that the exporters are (always?) global.
Might be good to simply explore the options and then discuss what is actually testable without much fuss.
Integration tests are a sore point at the moment - and likely will get worse before getting better. e.g. this thread. Its not quite trivial to just spin up a node at the moment, but depending on what you discover here maybe that's something we should aim at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mirko-von-Leipzig This is how I think the testing would have to work:
- Update a few functions to take in an
impl SpanExporter
. - Update node and faucet main.rs to construct either a test
SpanExporter
or an actual one based on flags or toml. - Implement tests that call the main functionality and assert on the
SpanData
coming out of the testSpanExporter
. These would probably be the existing integration tests but I haven't had a look yet.
But I don't think it is worthwhile. I have seen Rust projects do similar things where they run unit or integration tests and assert against log lines produced by the stack. I have generally avoided this kind of thing in the past, but I can imagine situations where its worth the maintenance cost and brittleness (tests being coupled to log lines).
I think it would become practical to just test trace output changes on the dev deployment/cluster once the trace scaffolding/impl stabilizes for the node. Rather than testing every change locally or in CI.
I have connected to honeycomb successfully and eye-balled the info!
output coming from my latest changes, e.g.:
2025-02-21T06:22:59.610587Z INFO block-producer.rpc/SubmitProvenTransaction: miden_node_block_producer::server: crates/block-producer/src/server.rs:218: request: POST /block_p
roducer.Api/SubmitProvenTransaction {"te": "trailers", "content-type": "application/grpc", "traceparent": "00-20616281d08ea7970f0701e71dd2ef80-453f6f8340af0947-01", "tracestate": "",
"user-agent": "tonic/0.12.3"}
Still getting my head around exactly what we want to achieve and whether the approach I've added so far fulfils that. The example above only has grpc related headers so I think I'll need to swap
let trace_layer = TraceLayer::new_for_grpc()
to
let trace_layer = TraceLayer::new_for_http()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't think it is worthwhile. I have seen Rust projects do similar things where they run unit or integration tests and assert against log lines produced by the stack. I have generally avoided this kind of thing in the past, but I can imagine situations where its worth the maintenance cost and brittleness (tests being coupled to log lines).
I agree; isn't worth it especially not at our current project maturity level. I think I was hoping we could do something like:
#[test]
async fn block_builder_trace() {
let store = mock_store();
let exporter = TestSpanExporter::new()...;
exporter.register_for_this_test_only();
BlockBuilder::build_block(...).await;
assert_eq!(exporter.spans, expected_spans);
}
But I suspect its not trivially possible. Though maybe there is a way to have the global exporter registered as TestSpanExporter
and somehow associate the spans we get from this test only.
The example above only has grpc related headers so I think I'll need to swap
Yeah. I think the important things to get in are the server/client IP addresses which will (probably) only be available for http I imagine? Long term we'll also want to add interesting headers and/or CORS information maybe, but I'm unsure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The http fn only differs from the grpc one w.r.t the error type handled by error callback:
/// Create a new [`TraceLayer`] using [`ServerErrorsAsFailures`] which supports classifying
/// regular HTTP responses based on the status code.
pub fn new_for_http() -> Self {
Self {
make_classifier: SharedClassifier::new(ServerErrorsAsFailures::default()),
// vs: make_classifier: SharedClassifier::new(GrpcErrorsAsFailures::default()),
Think the best we can do is uri.host
. We won't be able to see source/client IP unless its put into a header by a proxy or some other part of this stack.
I suppose there is still value in the tower_http::trace::TraceLayer
because we can register callbacks here:
.on_request(|request: &http::Request<_>, _span: &tracing::Span| {
tracing::info!(
"request: {} {} {} {:?}",
request.method(),
request.uri.host
request.uri().host().unwrap_or("NOHOST"),
request.uri().path(),
request.headers()
);
})
.on_response(
|response: &http::Response<_>, latency: Duration, _span: &tracing::Span| {
tracing::info!("response: {} {:?}", response.status(), latency);
},
)
.on_failure(|error: GrpcFailureClass, _latency: Duration, _span: &tracing::Span| {
tracing::error!("error: {}", error);
Here is the diff in error type for the on failure callback:
/// The failure class for [`ServerErrorsAsFailures`].
#[derive(Debug)]
pub enum ServerErrorsFailureClass {
/// A response was classified as a failure with the corresponding status.
StatusCode(StatusCode),
/// A response was classified as an error with the corresponding error description.
Error(String),
}
...
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct StatusCode(NonZeroU16);
...
vs
/// The failure class for [`GrpcErrorsAsFailures`].
#[derive(Debug)]
pub enum GrpcFailureClass {
/// A gRPC response was classified as a failure with the corresponding status.
Code(std::num::NonZeroI32),
/// A gRPC response was classified as an error with the corresponding error description.
Error(String),
}
So might as well go with the http one I guess.
I'll look at the unit test approach you mentioned above and see if its possible.
ee29fb9
to
a8897a1
Compare
.count(), | ||
1, | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mirko-von-Leipzig I added this unit test to illustrate the otel trace test capability. It doesn't relate to the functionality added in this PR. This was just the easiest unit test I could find to impl. The block producer rpc stack doesn't have mocks etc atm.
LMK if you want to keep/rm this test or move it to another PR with other tests etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it for a separate PR.
Is it possible to separate by test itself - or do we need to run these kind of tests sequentially to ensure we get the spans recorded that we expect?
As in, running tests in parallel probably muddles the span exporter unless we somehow mark the spans with a test ID?
But this looks promising, we can figure out the specifics in the other PR/issue.
}) | ||
.on_failure(|error: ServerErrorsFailureClass, latency: Duration, _span: &Span| { | ||
error!("error: {} {:?}", error, latency); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mirko-von-Leipzig I have checked this locally and on honeycomb. LMK any changes you would make here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now you're emitting events, but I'd like to attach whatever we can as attributes on the span.
For context, the open-telemetry specification has a bunch of (experimental) suggestions:
- https://opentelemetry.io/docs/specs/semconv/rpc/rpc-spans/#server-attributes
- https://opentelemetry.io/docs/specs/semconv/rpc/grpc/
We have access to only some of them here I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just can't find a way to surface lower level details like client addr to the relevant stacks here.
The tonic::Request
struct has ::remote_addr() -> Option<SocketAddr>
but we can't get that type wired through into our trace stack here AFAICT, only http::Request
. And I can't find a way to get headers to contain the remote IP address via http::Request
in any way.
Relevant links for reference
hyperium/tonic#430
https://github.com/tower-rs/tower-http/blob/main/examples/tonic-key-value-store/src/main.rs#L191
Also don't see where those server attributes you linked above can be accessed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to get socket addr via this when building the tonic server:
https://docs.rs/tonic/0.2.1/tonic/transport/server/trait.Connected.html
But that again relies on tonic::Request
:
https://github.com/hyperium/tonic/blob/master/examples/src/uds/server.rs#L32
Is it possible for us to alter our code gen to use tonic::Request
rather than http::Request
? E.G. this part
impl<T, B> tonic::codegen::Service<http::Request<B>> for ApiServer<T>
where
T: Api,
B: Body + std::marker::Send + 'static,
B::Error: Into<StdError> + std::marker::Send + 'static,
{
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this was crazy hard to find: tower-rs/tower-http#428.
Only by chance did I go check in the discussions - no other google/search-kungfu found it.
I believe this is what we need? It may also mean that we have to just do our own layer to extract some of the info before handling the routes/request/responses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, well done! This looks to have worked. e.g.:
.on_request(|request: &http::Request<_>, _span: &Span| {
info!(
"request: {} {} {} {} {:?}",
request
.extensions()
.get::<tonic::transport::server::TcpConnectInfo>() // as per axum example above
.unwrap()
.remote_addr()
.unwrap(),
I'll put together all the changes from your comments and put this in review next.
0e36c47
to
73fbb93
Compare
73fbb93
to
2b2bc7e
Compare
let trace_layer = TraceLayer::new_for_grpc() | ||
.make_span_with(miden_node_utils::tracing::grpc::block_producer_trace_fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider moving this entire layer into the utils
crate since its likely going to be identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the reason I haven't done that is because I don't think we can avoid returning a type that involves like 7 generics or so (because the fn that uses it doesn't take in a trait iirc). Will have a look at doing it as cleanly as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially the way to do it:
https://stackoverflow.com/questions/71178212/how-to-configure-tower-http-tracelayer-in-a-separate-function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh damn I see. Little corner cases everywhere.
Thanks, yeah I think lets go with the stackoverflow answer - and just document on the function why we're doing it this way.
Relates to #681.
WIP