Releases: openzipkin/zipkin
Zipkin 1.15
Zipkin 1.15 completes the transition to support 128-bit trace IDs, notably considering high resolution ids when querying and grouping traces.
Regular zipkin usage is unimpacted as this is all behind the scenes. However, the below details will be interesting to some and particularly of note during any transition from 64-128 bit trace IDs.
128-bit trace IDs
Zipkin supports 64 and 128-bit trace identifiers, typically serialized
as 16 or 32 character hex strings. By default, spans reported to zipkin
with the same trace ID will be considered in the same trace.
For example, 463ac35c9f6413ad48485a3953bb6124
is a 128-bit trace ID,
while 48485a3953bb6124
is a 64-bit one.
Note: Span (or parent) IDs within a trace are 64-bit regardless of the
length or value of their trace ID.
Migrating from 64 to 128-bit trace IDs
Unless you only issue 128-bit traces when all applications support them,
the process of updating applications from 64 to 128-bit trace IDs results
in a mixed state. This mixed state is mitigated by the setting
STRICT_TRACE_ID=false
, explained below. Once a migration is complete,
remove the setting STRICT_TRACE_ID=false
or set it to true.
Here are a few trace IDs the help what happens during this setting.
- Trace ID A: 463ac35c9f6413ad48485a3953bb6124
- Trace ID B: 48485a3953bb6124
- Trace ID C: 463ac35c9f6413adf1a48a8cff464e0e
- Trace ID D: 463ac35c9f6413ad
In a 64-bit environment, trace IDs will look like B or D above. When an
application upgrades to 128-bit instrumentation and decides to create a
128-bit trace, its trace IDs will look like A or C above.
Applications who aren't yet 128-bit capable typically only retain the
right-most 16 characters of the trace ID. When this happens, the same
trace could be reported as trace ID A or trace ID B.
By default, Zipkin will think these are different trace IDs, as they are
different strings. During a transition from 64-128 bit trace IDs, spans
would appear split across two IDs. For example, it might start as trace
ID A, but the next hop might truncate it to trace ID B. This would render
the system unusable for applications performing upgrades.
One way to address this problem is to not use 128-bit trace IDs until
all applications support them. This prevents a mixed scenario at the cost
of coordination. Another way is to set STRICT_TRACE_ID=false
.
When STRICT_TRACE_ID=false
, only the right-most 16 of a 32 character
trace ID are considered when grouping or retrieving traces. This setting
should only be applied when transitioning from 64 to 128-bit trace IDs
and removed once the transition is complete.
See openzipkin/b3-propagation#6 for the status
of known open source libraries on 128-bit trace identifiers.
Cassandra
There's no impact to the cassandra
(Cassandra 2.x) schema. The experimental cassandra3
schema has changed and needs to be recreated.
Elasticsearch
When STRICT_TRACE_ID=false
, the indexing template will be less efficient as it tokenizes trace IDs. Don't set STRICT_TRACE_ID=false
unless you really need to.
MySQL
There are no schema changes since last versions, but you'll likely want to add indexes in consideration of 128bit trace IDs.
ALTER TABLE zipkin_spans ADD INDEX(`trace_id_high`, `trace_id`, `id`);
ALTER TABLE zipkin_spans ADD INDEX(`trace_id_high`, `trace_id`);
ALTER TABLE zipkin_annotations ADD INDEX(`trace_id_high`, `trace_id`, `span_id`);
ALTER TABLE zipkin_annotations ADD INDEX(`trace_id_high`, `trace_id`);
Java Api
The STRICT_TRACE_ID
variable above corresponds to zipkin.storage.StorageComponent.Builder.strictTraceId
. Those using storage components directly will want to set this to false under similar circumstances to those described above.
We've added methods to SpanStore
, in support of high-resolution gets. Traces with 64-bit ids are retrieved by simply passing 0 as traceIdHigh.
@Nullable
List<Span> getTrace(long traceIdHigh, long traceIdLow);
@Nullable
List<Span> getRawTrace(long traceIdHigh, long traceIdLow);
Zipkin 1.14
Zipkin 1.14 introduces support for 128-bit trace identifiers
Most zipkin sites store traces for a limited amount of time (like 2 days) and also trace a small percentage of operations (via sampling). For these reasons and also those of simplicity, 64-bit trace identifiers have been the norm since zipkin started over 4 years ago.
Starting with Zipkin 1.14, 128-bit trace identifiers are also supported. This can be useful in sites that have very large traffic volume, persist traces forever, or are re-using externally generated 128-bit IDs as trace IDs. You can also use 128-bit trace ids to interop with other 128-bit systems such as Google Stackdriver Trace. Note: span IDs within a trace are still 64-bit.
When 128-bit trace ids are propagated, they will be twice as long as before. For example, the X-B3-TraceId
header will hold a 32-character value like 163ac35c9f6413ad48485a3953bb6124
. Prior to Zipkin 1.14, we updated all major tracing libraries to silently truncate long trace ids to 64-bit. With the example noted, its 64-bit counterpart would be 48485a3953bb6124
. For the foreseeable future, you will be able to lookup a trace by either its 128-bit or 64-bit ID. This allows you to upgrade your instrumentation and environment in steps.
Should you want to use 128-bit tracing today, you'll need to update to latest Zipkin, and if using MySQL, issue the following DDL update:
ALTER TABLE zipkin_spans ADD `trace_id_high` BIGINT NOT NULL DEFAULT 0;
ALTER TABLE zipkin_annotations ADD `trace_id_high` BIGINT NOT NULL DEFAULT 0;
ALTER TABLE zipkin_spans
DROP INDEX trace_id,
ADD UNIQUE KEY(`trace_id_high`, `trace_id`, `id`) COMMENT 'ignore insert on duplicate';
ALTER TABLE zipkin_annotations
DROP INDEX trace_id,
ADD UNIQUE KEY(`trace_id_high`, `trace_id`, `span_id`, `a_key`, `a_timestamp`) COMMENT 'Ignore insert on duplicate';
Next, you'll need to use a library that supports generating 128-bit ids. The first two to support this are zipkin-go-opentracing v0.2 and Brave (java) v3.5. The supporting change in thrift is a new trace_id_high field.
If you have any further questions on this feature, reach out to us on gitter: https://gitter.im/openzipkin/zipkin
Zipkin 1.13
Zipkin 1.13 most notably refines our Elasticsearch code. It is now easier for us to tune as self-tracing is built-in.
For example, let's say I created a domain in Amazon's Elasticsearch service named 'zipkin'. As I'm doing testing, I'll run our Docker image and share my AWS credentials with it.
$ docker run -d -p 9411:9411 \
-e SELF_TRACING_ENABLED=true \
-e STORAGE_TYPE=elasticsearch -e ES_AWS_DOMAIN=zipkin \
-v $HOME/.aws:/root/.aws:ro \
openzipkin/zipkin
Once zipkin starts up, SELF_TRACING_ENABLED=true
indicates that it should trace each api request. As I click in the UI, more traces appear under the service zipkin-server
. Here's one which shows the overall latency of a request (from my laptop to amazon), for a zipkin trace search.
With tools like this, we can use Zipkin to improve zipkin.
The Elasticsearch experience was created by @anuraaga and extended to Amazon by @sethp-jive. The tracing functionality is thanks to our Brave OkHttp interceptor initially written by @tburch. Watch for more news as we head towards Elasticsearch 5 compatibility.
Zipkin 1.11
Zipkin 1.11 allows you to see instrumented clients in the dependency view. It also fixes a search collision problem.
Before, the dependency view (ex http://your_host:9411/dependency
) presented a server-centric diagram. This worked well enough as traces usually start at the first server. Especially with new projects like zipkin-js, client-originated traces are becoming more common. For example, the trace could start in your web browser instead of on a server. Zipkin's dependency linker is now trained to look for client send annotations in the root span, and if present, add them to the far-left of the dependency graph. Thanks to @rogeralsing for reporting.
We also fixed a search bug where a query like http.method=GET
matched against any service in a trace as opposed to the service specified in the UI. This affected all storage types except cassandra and is now fixed.
Note: While seemingly simple, this smoked out a latent problem in our Elasticsearch indexing template. Please re-index at your earliest convenience, or drop the index and let Zipkin recreate it.
Zipkin 1.10
Zipkin 1.10 addresses a couple long-term problems relating to span timestamp and duration.
Firstly, we no longer attempt to support duration queries on the "cassandra" storage type. Cassandra 2.2+ doesn't support SASI indexing, and trying to work around that resulted in a feature most couldn't use. @michaelsembwever from The Last Pickle has a more sustainable solution in mind that uses Cassandra 3.8+. Please look for announcements on the experimental cassandra3 storage type.
Next is something that applies to all storage types. When trace instrumentation don't record Span.timestamp and duration, the Zipkin server tries to guess by looking at annotations. Previously, when we guessed wrong, the trace would render strangely. We now guess much more conservatively so as to avoid this.
Here's the impact:
- Span duration is no longer derived by collectors, as it is often wrong. Duration queries won't work unless traces reported to zipkin include duration.
- Span timestamp is derived only when needed, usually to support indexing
- Span timestamp and duration are still backfilled at query time, as otherwise the UI wouldn't work.
Note: The Span.timestamp and duration fields were added a year ago, but many tracers still don't record them. We hope our documentation on how to record timestamp and duration will help ease the task of updating them. If you use a tracer that doesn't yet record Span.timestamp and duration, please raise an issue or PR to the corresponding repository so that it is eventually fixed.
Zipkin 1.8
Zipkin 1.8 is a library change focused on encoding performance. If you are instrumenting apps and use Zipkin's Codec, you'll want to upgrade.
Span encoding has been completely rewritten in order to get common-case overhead in microsecond or less range.
Zipkin 1.7 Codec.writeSpan() vs libthrift (pace car)
CodecBenchmarks.writeClientSpan_json_zipkin avgt 15 17.131 ± 0.446 us/op
CodecBenchmarks.writeClientSpan_thrift_libthrift avgt 15 1.952 ± 0.043 us/op
CodecBenchmarks.writeClientSpan_thrift_zipkin avgt 15 0.996 ± 0.021 us/op
CodecBenchmarks.writeLocalSpan_json_zipkin avgt 15 10.124 ± 0.177 us/op
CodecBenchmarks.writeLocalSpan_thrift_libthrift avgt 15 1.168 ± 0.016 us/op
CodecBenchmarks.writeLocalSpan_thrift_zipkin avgt 15 0.593 ± 0.010 us/op
CodecBenchmarks.writeRpcSpan_json_zipkin avgt 15 43.495 ± 1.086 us/op
CodecBenchmarks.writeRpcSpan_thrift_libthrift avgt 15 4.878 ± 0.046 us/op
CodecBenchmarks.writeRpcSpan_thrift_zipkin avgt 15 2.666 ± 0.018 us/op
CodecBenchmarks.writeRpcV6Span_json_zipkin avgt 15 49.759 ± 0.867 us/op
CodecBenchmarks.writeRpcV6Span_thrift_libthrift avgt 15 5.390 ± 0.073 us/op
CodecBenchmarks.writeRpcV6Span_thrift_zipkin avgt 15 3.147 ± 0.026 us/op
Zipkin 1.8 Codec.writeSpan() vs libthrift (pace car)
CodecBenchmarks.writeClientSpan_json_zipkin avgt 15 1.445 ± 0.036 us/op
CodecBenchmarks.writeClientSpan_thrift_libthrift avgt 15 1.951 ± 0.014 us/op
CodecBenchmarks.writeClientSpan_thrift_zipkin avgt 15 0.433 ± 0.011 us/op
CodecBenchmarks.writeLocalSpan_json_zipkin avgt 15 0.813 ± 0.010 us/op
CodecBenchmarks.writeLocalSpan_thrift_libthrift avgt 15 1.191 ± 0.016 us/op
CodecBenchmarks.writeLocalSpan_thrift_zipkin avgt 15 0.268 ± 0.004 us/op
CodecBenchmarks.writeRpcSpan_json_zipkin avgt 15 3.606 ± 0.068 us/op
CodecBenchmarks.writeRpcSpan_thrift_libthrift avgt 15 5.134 ± 0.081 us/op
CodecBenchmarks.writeRpcSpan_thrift_zipkin avgt 15 1.384 ± 0.078 us/op
CodecBenchmarks.writeRpcV6Span_json_zipkin avgt 15 3.912 ± 0.115 us/op
CodecBenchmarks.writeRpcV6Span_thrift_libthrift avgt 15 5.488 ± 0.098 us/op
CodecBenchmarks.writeRpcV6Span_thrift_zipkin avgt 15 1.323 ± 0.014 us/op
Why encoding speed matters
Applications that report to Zipkin typically record timing information and metadata on the calling thread. After the operation completes, this is encoded into a Span and scheduled to go out of process, usually via http or Kafka. When the encoding overhead is measurable, it can confuse timing information, particularly when operations are in single-digit or less milliseconds.
For example, if a local operation takes 400us, and your encoding overhead is 40us, there will be a 10% gap between the end of one span and the start of the next. This will notably skew the duration of the parent, particularly if there are a lot of spans like this. When encoding overhead in single-digit microseconds or less, this problem is less noticeable.
Zipkin 1.7
Zipkin 1.7 has a lot to offer, thanks to users for telling us what they'd like.
@dragontree101 wanted to be able to know which version of zipkin his server was running. @shakuzen landed the /info endpoint, which prints out something like this:
{
"zipkin": {
"version": "1.7.0"
}
}
@mikewrighton wants to run zipkin-ui from a different host than zipkin-server. @hyleung spiked a new variable you can use to control cross-origin policy. For example, you can export ZIPKIN_QUERY_ALLOWED_ORIGINS=http://foo.bar.com
, if you are the lucky owner of foo.bar.com!
@dan-tr uses Zipkin with Elasticsearch, but found our microsecond timestamps didn't work out-of-box with Kibana. He suggested we add a field timestamp_millis
, and we did! because it was a smart idea.
@ivansenic works on an APM called inspectIT. He rightly noted there's still a ton of Java 6 VMs out there that need to be traceable by Java agents. Now, zipkin.jar is an agent-friendly, 152k jar full of Java 6 bytecode (still with no dependencies!).
We're occasionally asked where javadocs are published. Thanks to @abesto's automation expertise, historical javadocs can now be found here http://zipkin.io/zipkin/
Finally, we're looking for incremental and compatible ways to improve zipkin's model, particularly for asynchronous activity (like tracing Kafka). If you are interested in steering us, please comment on..
Thanks for keeping with us,
OpenZipkin
Zipkin 1.6
Zipkin 1.6 server has been updated to use Spring Boot 1.4.
We've also corrected default values around the UI, which should lead to better search performance. Most notably, startTs defaults to 1 hour back instead of 7 days back. #1212
- Note: You can reset the lookback value to whatever you like. For example, you might set
JAVA_OPTS="-Dzipkin.ui.default-lookback=86400000"
for 1 day. Settings like this are documented in the README
Zipkin 1.5
Zipkin 1.5 is all about the dependency view in the UI.
Many of you may have seen the dependency tab, and never any data in it. This would be the case if you were running Cassandra or Elasticsearch.
What you should have seen is a diagram showing the relative amount of calls between services, something like this (except with your services present!):
Zipkin 1.5 includes support to populate the data under this screen for all storage options (mysql, cassandra and elasticsearch).
The job that produces this data is called zipkin-dependencies. Zipkin Dependencies aggregates links between services into a daily bucket. This means you should run it daily, like a batch job (eventhough underneath it is spark). In fact, our docker image includes cron setup to do that for you!
For example, here's a run against a small cassandra DB using spark standalone (default):
$ STORAGE_TYPE=cassandra CASSANDRA_CONTACT_POINTS=192.168.99.100 java -jar zipkin-dependencies.jar
Running Dependencies job for 2016-07-23: 1469232000000000 ≤ Span.timestamp 1469318399999999
11:05:09.653 [main] WARN o.a.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
11:05:09.706 [main] WARN org.apache.spark.util.Utils - Your hostname, acole resolves to a loopback address: 127.0.0.1; using 192.168.1.10 instead (on interface en0)
11:05:09.706 [main] WARN org.apache.spark.util.Utils - Set SPARK_LOCAL_IP if you need to bind to another address
11:05:11.078 [main] WARN com.datastax.driver.core.NettyUtil - Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead.
Saved with day=2016-07-23
Dependencies: [{"parent":"brave-resteasy-example","child":"brave-resteasy-example","callCount":1}, {"parent":"zipkin-server","child":"cassandra","callCount":14}]
Upgrading
If you are using cassandra or elasticsearch, you should upgrade to zipkin 1.5, but there's no schema-related change required.
If you are using mysql, you'll need to add a new table for this to work. Here's a copy/paste of the DDL for your convenience.
CREATE TABLE IF NOT EXISTS zipkin_dependencies (
`day` DATE NOT NULL,
`parent` VARCHAR(255) NOT NULL,
`child` VARCHAR(255) NOT NULL,
`call_count` BIGINT
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED;
ALTER TABLE zipkin_dependencies ADD UNIQUE KEY(`day`, `parent`, `child`);
Credits
The spark job was originally written by @yurishkuro, based on a hadoop job originally written by @eirslett years ago. IOTW, the job itself isn't new, rather the accessibility of it. Before, it only worked with cassandra and wasn't published to maven central or integrated with docker. Now, it should be easy for anyone to include this functionality into their deployment.
Zipkin 1.4
Zipkin 1.4 most notably includes the ability to store and show IPv6 addresses associated with services.
Endpoint.ipv6
Zipkin span data can now include an ipv6 address of an Endpoint, binary encoded in thrift or text-encoded in json. If using MySQL, you need to add a column to store this. No action is needed in Cassandra or Elasticsearch. See #1178
Operational Improvements
- Adds
SCRIBE_ENABLED
: set to false to disable scribe - Adds
SELF_TRACING_SAMPLE_RATE
: set to a low value like 0.001 to safely self-trace production