Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client loses connection to gateway peer and cannot recover without a restart when endorsement policy is changed #332

Closed
davidkel opened this issue Nov 26, 2021 · 5 comments
Assignees

Comments

@davidkel
Copy link
Contributor

davidkel commented Nov 26, 2021

First I will note that the client could probably recover but would have to be coded to detect the scenario and create a new client and gateway to recover. The problem I see reports 2 errors

{"component":"CLIENT","timestamp":"2021-11-25T17:05:58.440Z","txnId":"f4bcbe05d1c9257ebfbaa7492d3a853935ab69789c2d2feb661b413662d10a76","stage":"Endorsing","message":"addUpdateAssets({\"arguments\":[\"1\",\"2000\"]})"}
{"component":"CLIENT","timestamp":"2021-11-25T17:06:28.454Z","txnId":"f4bcbe05d1c9257ebfbaa7492d3a853935ab69789c2d2feb661b413662d10a76","stage":"Failed","message":"4 DEADLINE_EXCEEDED: Deadline exceeded"}
{"component":"CLIENT","timestamp":"2021-11-25T17:05:58.443Z","txnId":"bcd6cc53c262a76f23015a2003af3e72785079f44be01ba28840c6adaa31bc6a","stage":"Endorsing","message":"updateChaosAsset({\"arguments\":[\"cd1\",\"99\"]})"}
{"component":"CLIENT","timestamp":"2021-11-25T17:06:28.454Z","txnId":"bcd6cc53c262a76f23015a2003af3e72785079f44be01ba28840c6adaa31bc6a","stage":"Failed","message":"14 UNAVAILABLE: Stream refused by server"}

But either way no further submitted transaction succeeds until the client was restarted

Unfortunately client side grpc logging doesn't tell me much only saying what the error message says. The peer logs don't provide anything helpful and if I turn on too much logging then I struggled to recreate the problem.

I recreated the scenario outside of the endorsement policy by having a long running transaction create MVCC Read Conflicts so it looks related to multiple long running transactions failing which have a large write set, in my case writing to 2000 keys but small amounts of data.

A recent change to fabric here hyperledger/fabric#3012 changed the output for when proposal responses don't match by including base64 version of protobuf bytes into the error which subsequently end up in the logs. If I revert this change then I haven't been able to hit the problem. It looks like this change was intended for when the peer is used as a cli but also occurs when the peer is running as a server.

If I don't use the long running txns with large number of keys then the problem doesn't occur, so this is specific to the contrived application being run to test the scenario, so it's not likely to be a common problem so I will leave it up for discussion about whether the recent change to outputting base64 when proposals don't match is of value and should be output when the peer is running in server mode.

@denyeart
Copy link
Contributor

@andrew-coleman what is your opinion?

@andrew-coleman
Copy link
Member

@denyeart
@davidkel and I looked at this today. It's not at all obvious why this recent change would be causing the grpc connection to go bad other than the fact that sending potentially very large error messages back to the client might be triggering the problem (this was only tested with a node client - perhaps an issue with node-grpc?).

Since this a highly contrived error scenario test, we don't believe this should block the 2.4 release.

For the next release, we propose the following enhancement:

  • Revert the error message to no longer contain the base64 encoded response when invoked from the gateway.
  • Instead, when the responses don't match traverse the offending r/w sets to find the states that differ, and report those back to the client in a structured message that clearly shows the keys/values/versions that differ.

@bestbeforetoday
Copy link
Member

Mis-matched proposal responses is a server-side issue related to smart contracts, so I think the key thing is to ensure that mismatched responses are logged at the server end so they can be investigated by whoever has operational responsibility for the Fabric nodes. There's certainly no harm in sending some specific information about the reason for the endorsement failure to the client so the issue can be diagnosed correctly at that end, but I don't think it's strictly necessary to send proposal response content back to the client. In some respects this might even be considered bad security practice since it gives any attacker additional information about the internal behaviour of the network.

@gqqnbig
Copy link

gqqnbig commented Dec 9, 2021

My change hyperledger/fabric#3012 is for CLI, and I'm not aware of the gateway. I agree with the two enhancement proposed by @andrew-coleman.

andrew-coleman added a commit to andrew-coleman/fabric that referenced this issue Dec 10, 2021
The gateway has been reusing the protoutil.CreateTx function to build the transaction envelope from the set of endorsements.  This function changed recently to add extra information into the error message if the proposal responses don’t all match.  This change was identified as causing a problem for the gateway (hyperledger/fabric-gateway#332).
This commit reverts that change, and instead writes the base64 response payloads to the log.  It also refactors the functions slightly to remove duplicate unmarshalling of protos when invoked from the gateway.

Signed-off-by: andrew-coleman <[email protected]>
@davidkel
Copy link
Contributor Author

With the latest images

 Version: 2.4.0
 Commit SHA: aaaec4f
 Go version: go1.16.7
 OS/Arch: linux/amd64
 Chaincode:
  Base Docker Label: org.hyperledger.fabric
  Docker Namespace: hyperledger

I can't recreate this problem and still see the mismatches being reported and logged.
Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants