Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tracing support for internals and JSON-RPC #1557

Merged
merged 12 commits into from
Jan 4, 2021

Conversation

atoulme
Copy link
Contributor

@atoulme atoulme commented Nov 12, 2020

Signed-off-by: Antoine Toulme [email protected]

PR description

Adds tracing support for internals of Besu and the JSON-RPC HTTP service.

This was tested with soaking a Besu node for 48 hours against mainnet, with no sync issues oberved.

Changelog

@atoulme atoulme force-pushed the tracing_with_otel branch 3 times, most recently from 9303e60 to be459ee Compare November 13, 2020 00:25
@atoulme atoulme requested a review from AbdelStark November 17, 2020 21:04
@AbdelStark
Copy link
Contributor

Please can you do a full sync against Goerli and post the sync time here ? Just want to make sure this PR does not introduce performance issues.

@atoulme
Copy link
Contributor Author

atoulme commented Nov 20, 2020

Sync time was 11 hours 32 minutes.
Screen Shot 2020-11-20 at 2 09 41 PM

@shemnon
Copy link
Contributor

shemnon commented Nov 21, 2020

Our 20.10.0-RC2 full sync was 12h9m. That was prior to the native libs improvements, so I'll run another full sync against 20.10.1.

@shemnon
Copy link
Contributor

shemnon commented Nov 21, 2020

A 20.10.1 full sync clocked in at 11h6m, so the overhead of this change is under 4%. (20.10.1 had some significant crypto upgrades that significantly impact Goerli specifically) I understand this is "fully loaded" with OLTP spans being fully reported. What's the speed with the original prometheus metrics?

@atoulme
Copy link
Contributor Author

atoulme commented Nov 21, 2020

There a number of things happening on the instance, as it also runs Splunk, the OpenTelemetry Collector and ethlogger.
Ethlogger in particular constantly polls Besu for all blocks, all transactions, all transaction receipts.
I haven't tried to sync without tracing on.

Copy link
Contributor

@shemnon shemnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A full sync with only Prometheus turned on and no open telemetry items resulted in a sync of 11h7h, only one minute longer than our 10.20.1 sync, well within noise levels of the network.

  • change io.hyperledger to org.hyperledger
  • remove usage of the shaded jar

@@ -169,6 +198,7 @@ private JsonRpcHttpService(
this.authenticationService = authenticationService;
this.livenessService = livenessService;
this.readinessService = readinessService;
this.tracer = OpenTelemetry.getGlobalTracer("io.hyperledger.besu.jsonrpc", "1.0.0");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.hyperledger instead of io.hyperledger?

@@ -279,6 +310,32 @@ private Router buildRouter() {
return router;
}

private void createSpan(final RoutingContext routingContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is OpenTelemetry "always on"? Should there be a way for the end user to "opt out" like we do for other metrics collection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking for the best way on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was just the one create span call it would be easy, but it looks like every request has to possibly add a span, so plumbing the metrics configs through seems a bit much... unless performance suffers. We have a benchmark where rpcs is the bottleneck, so if it's a problem I'd expect those benchmarks to fail. If you are willing to risk a call for a revert if performance takes a hit it can stay as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm asking, and so far if we don't use a metrics backend, the spans are created, but in the words of OpenTelemetry devs, they then "fall to the floor" as there is no export enabled.
If we want to even avoid the object allocation of creating span objects, we can add a check in the Besu code that disables creating span objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object allocation is my worry, but I'm thinking if it matters it will only show up under load, which we have a test set up to detect (but only for committed code).

So if the test shows a regression the worst case scenario of a commit is a rollback and then a fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am willing to try this out and see how the performance is impacted.
Initially I had a span collecting trie performance, but I reverted it, as it was too verbose. I believe the AbstractBlockProcessor might be too verbose as well.
We'll see - it would be great to understand better if there are specific use cases (block imports, mining) where tracing helps specifically.

@shemnon
Copy link
Contributor

shemnon commented Dec 8, 2020

Any progress on the requirement for non-shaded jars?

@atoulme
Copy link
Contributor Author

atoulme commented Dec 8, 2020

Still working on it, slowly. I did a search to make sure I had all the libraries lined up, but since it tends to go wrong at runtime, I wanted to add an integration test to test this out.

Signed-off-by: Antoine Toulme <[email protected]>
Signed-off-by: Antoine Toulme <[email protected]>
@shemnon
Copy link
Contributor

shemnon commented Dec 22, 2020

Our next release is 13 Jan. Reviews will be infrequent until the new year. Ping me on 1 Jan if this isn't approved.

@atoulme
Copy link
Contributor Author

atoulme commented Dec 22, 2020

@shemnon I appreciate you chiming in. Happy holidays to you, your family and the contributors to the Besu project!
We can look at this together next year.

// OpenTelemetry Wiring for windows scripts
def windowsRegex = $/"%JAVA_EXE%" %DEFAULT_JVM_OPTS%/$
def windowsReplacement = $/if Defined TRACING (TRACING_AGENT="-javaagent:" "%APP_HOME%\agent\/$ + agentFileName + '")\r\n"%JAVA_EXE%" %DEFAULT_JVM_OPTS% %TRACING_AGENT%'
createScriptTask.windowsScript.text = createScriptTask.windowsScript.text.replace(windowsRegex, windowsReplacement)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changed so that we no longer need the java agent for OTEL? I'm not seeing anything code-wise so I'm concerned this is an accidental deletion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent inspects the code and injects changes at runtime to create traces from well-known libraries such as SQL querying or HTTP. We now do this manually and in a much more targeted fashion. We no longer need to use automatic instrumentation. Automatic instrumentation could also create extra traces that don't provide as much value.

Signed-off-by: Antoine Toulme <[email protected]>
@atoulme atoulme merged commit cd66968 into hyperledger:master Jan 4, 2021
eum602 pushed a commit to lacchain/besu that referenced this pull request Nov 3, 2023
* Add tracing support for internals and JSON-RPC

Signed-off-by: Antoine Toulme <[email protected]>

* Remove rocksdb tracing as it slows down execution too much

Signed-off-by: Antoine Toulme <[email protected]>

* Add B3 headers extraction on JSON-RPC requests

Signed-off-by: Antoine Toulme <[email protected]>

* Remove traces around trie tree as they slow down syncing significantly

Signed-off-by: Antoine Toulme <[email protected]>

* Add tracing to fast sync pipeline

Signed-off-by: Antoine Toulme <[email protected]>

* Add tracing for all pipelines

Signed-off-by: Antoine Toulme <[email protected]>

* Address code review

Signed-off-by: Antoine Toulme <[email protected]>

* Add acceptance tests and break out the shaded dependency

Signed-off-by: Antoine Toulme <[email protected]>

* Fix tracer id

Signed-off-by: Antoine Toulme <[email protected]>

* Revert changes to trie

Signed-off-by: Antoine Toulme <[email protected]>

* Upgrade otel to latest, remove old tracing

Signed-off-by: Antoine Toulme <[email protected]>

* Code review comments

Signed-off-by: Antoine Toulme <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants