Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #5001 #5078

Closed
wants to merge 9 commits into from
Closed

Fixes #5001 #5078

wants to merge 9 commits into from

Conversation

dviejokfs
Copy link

@dviejokfs dviejokfs commented Dec 17, 2024

Type of change

  • Improvement (improvement to code, performance, etc)

Description

This pull request introduces several improvements in the HandleGetStateByRange function within the core/chaincode/handler.go file and the GetStateRangeScanIteratorWithPagination function within the core/ledger/kvledger/txmgmt/statedb/stateleveldb/stateleveldb.go file. The changes include:

  • Adding panic recovery in the HandleGetStateByRange function to log errors and clean up any query contexts that may have been created.
  • Validating start and end keys to ensure they are not both empty.
  • Adding validation for the page size to ensure it is greater than zero.
  • Cleaning up query contexts if errors occur during the range query or if the range query iterator is nil.
  • Wrapping the iterator creation in a recover block in GetStateRangeScanIteratorWithPagination to handle potential panics and return meaningful error messages.
  • Ensuring that a nil iterator is not returned from the database.

Additional details

These changes help improve the robustness of the code by handling potential panics and validating inputs, which can help prevent unexpected errors and improve the overall stability of the system.

Related issues

#5001

@dviejokfs dviejokfs requested a review from a team as a code owner December 17, 2024 11:56
This commit introduces the createResponse function in handler.go, which constructs a QueryResponse for range queries. It handles pagination and includes error handling for payload marshaling, ensuring proper cleanup of the query context in case of errors.

Signed-off-by: David VIEJO <[email protected]>
Signed-off-by: David VIEJO <[email protected]>
var seed [32]byte
_, _ = crand.Read(seed[:])
r = rand.New(rand.NewChaCha8(seed))
r = rand.New(rand.NewSource(time.Now().UnixNano()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we seed this with a deterministic and non-random seed?
what was before was better, in my opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, rolling back

return n
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't try 100 attempts, but more attempts, like 1000,000.

and if we still fail, then we should loop until we find a nonce that is not in use using the crypto rand.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the second part it can result in slow nonces if this happens right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after looking at the code myself, I think we can just use util.RandomUInt64 until it no longer exists in the outgoing nonces.

@yacovm
Copy link
Contributor

yacovm commented Dec 17, 2024

I don't understand how the gossip nonce generation is related to the panic, can you explain?

Comment on lines 12 to 11
"math/rand/v2"
"math/rand"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change the library from v2 to the previous library?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, didn't mean to, changed back

Signed-off-by: David VIEJO <[email protected]>
- Corrected comment capitalization for the HandleRegister function in handler.go.
- Updated the import path for the math/rand package to math/rand/v2 in misc.go.

These changes improve code readability and maintain consistency in import statements.

Signed-off-by: David VIEJO <[email protected]>
This commit introduces the import of the crypto/rand package in misc.go, which may be utilized for secure random number generation. This addition complements the existing math/rand/v2 import and prepares the code for future enhancements requiring cryptographic randomness.

Signed-off-by: David VIEJO <[email protected]>
This commit corrects the function name from `Intn` to `IntN` in the RandomInt function, ensuring consistency with the updated import path for the math/rand package. This change enhances code clarity and correctness.

Signed-off-by: David VIEJO <[email protected]>
Comment on lines 193 to 194
rMutex.Lock()
defer rMutex.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the lock/unlock call only here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the panic that happens when there's a lot of concurrency:

panic: runtime error: index out of range [-1]

goroutine 3305 [running]:
math/rand.(*rngSource).Uint64(0xc014386101?)
/usr/local/go/src/math/rand/rng.go:249 +0x7b
math/rand.(*Rand).Uint64(0xc0180cf980?)
/usr/local/go/src/math/rand/rand.go:104 +0x22
github.com/hyperledger/fabric/gossip/util.RandomUInt64(...)
/gossip/util/misc.go:187
github.com/hyperledger/fabric/gossip/gossip/algo.(*PullEngine).newNONCE(0xc006953000)
/gossip/gossip/algo/pull.go:324 +0x25
github.com/hyperledger/fabric/gossip/gossip/algo.(*PullEngine).initiatePull(0xc006953000)
/gossip/gossip/algo/pull.go:179 +0x11b
github.com/hyperledger/fabric/gossip/gossip/algo.NewPullEngineWithFilter.func1()
/gossip/gossip/algo/pull.go:126 +0x27
created by github.com/hyperledger/fabric/gossip/gossip/algo.NewPullEngineWithFilter in goroutine 1
/gossip/gossip/algo/pull.go:120 +0x2d5

I suppose the problem is because we are accessing this function a lot of times, so adding a lock ensures that there are no concurrency issues

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Above r is used in 2 more places without lock. Why is lock not used there?

  2. yes, this is a bug in the math/rand package. But after the release of v3.0.0 the package used 9b28d51 was changed, this change is not in the release yet. And your stacktrace is not using v2 code.

  3. Need proof that you actually fixed the bug suddenly it was fixed by changing the package to version v2. You need a test to prove that your fixes are correct

/usr/local/go/src/math/rand/rng.go:249 +0x7b
math/rand.(*Rand).Uint64(0xc0180cf980?)
/usr/local/go/src/math/rand/rand.go:104 +0x22

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Above r is used in 2 more places without lock. Why is lock not used there?

  2. yes, this is a bug in the math/rand package. But after the release of v3.0.0 the package used 9b28d51 was changed, this change is not in the release yet. And your stacktrace is not using v2 code.

  3. Need proof that you actually fixed the bug suddenly it was fixed by changing the package to version v2. You need a test to prove that your fixes are correct

/usr/local/go/src/math/rand/rng.go:249 +0x7b
math/rand.(*Rand).Uint64(0xc0180cf980?)
/usr/local/go/src/math/rand/rand.go:104 +0x22

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the panic that happens when there's a lot of concurrency:

panic: runtime error: index out of range [-1]

goroutine 3305 [running]: math/rand.(*rngSource).Uint64(0xc014386101?) /usr/local/go/src/math/rand/rng.go:249 +0x7b math/rand.(*Rand).Uint64(0xc0180cf980?) /usr/local/go/src/math/rand/rand.go:104 +0x22 github.com/hyperledger/fabric/gossip/util.RandomUInt64(...) /gossip/util/misc.go:187 github.com/hyperledger/fabric/gossip/gossip/algo.(*PullEngine).newNONCE(0xc006953000) /gossip/gossip/algo/pull.go:324 +0x25 github.com/hyperledger/fabric/gossip/gossip/algo.(*PullEngine).initiatePull(0xc006953000) /gossip/gossip/algo/pull.go:179 +0x11b github.com/hyperledger/fabric/gossip/gossip/algo.NewPullEngineWithFilter.func1() /gossip/gossip/algo/pull.go:126 +0x27 created by github.com/hyperledger/fabric/gossip/gossip/algo.NewPullEngineWithFilter in goroutine 1 /gossip/gossip/algo/pull.go:120 +0x2d5

I suppose the problem is because we are accessing this function a lot of times, so adding a lock ensures that there are no concurrency issues

If there is a panic stemming from gossip then I fail to see how is the chaincode related to the problem.

We shouldn't fix two different problems in the same PR.

Can you move all the non gossip related code changes to a new PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but you see the problem, right? gossip is invoking randomUint64, and it panicking,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, these are two different problems for two different panic errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panic calls a different code than the one you're correcting.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @pfi79 compiling the code in main should fix the issue right about gossip right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @pfi79 compiling the code in main should fix the issue right about gossip right?

It's possible. After updating the package to v2 I stopped catching the panic error in tests.

@@ -319,10 +321,28 @@ func (engine *PullEngine) OnRes(items []string, nonce uint64) {
}

func (engine *PullEngine) newNONCE() uint64 {
n := uint64(0)
engine.lock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to lock at all. We only create nonces from here and we already lock there.

Just loop until engine.outgoingNONCES.Exists(n) is false, no need for attempts or anything.

@dviejokfs
Copy link
Author

@yacovm @pfi79
I have removed the changes for the gossip issue, this PR only contains the issue about the leveldb

@pfi79
Copy link
Contributor

pfi79 commented Dec 17, 2024

@yacovm @pfi79 I have removed the changes for the gossip issue, this PR only contains the issue about the leveldb

please update the pr description.

there is not enough evidence that your change fixes the error when working with leveldb

and please look at the unit test. after your changes they dropped exactly on leveldb.

@dviejokfs
Copy link
Author

dviejokfs commented Dec 17, 2024

@yacovm @pfi79 I have removed the changes for the gossip issue, this PR only contains the issue about the leveldb

please update the pr description.

there is not enough evidence that your change fixes the error when working with leveldb

and please look at the unit test. after your changes they dropped exactly on leveldb.

I just updated the description, I'm going to check the unit tests

Signed-off-by: David VIEJO <[email protected]>
@yacovm
Copy link
Contributor

yacovm commented Dec 17, 2024

Since you removed all gossip code from the PR, can't you just force push and get rid of the gossip code change commits? squash all changes together and force push so it will be more clear what's going on when this is merged, otherwise we'll merge a bunch of commits that cancel each other out.

@yacovm
Copy link
Contributor

yacovm commented Dec 17, 2024

And I will tag @denyeart here, since the only code change in this PR is related to things he has more context about than me and pfi79.

@denyeart
Copy link
Contributor

Thanks for the update @dviejokfs , although the PR has become confusing - 9 commits, a gossip title and bunch of gossip comments, but no gossip code, and 6 bullets in description related to chaincode and ledger, with no tests. It is unclear how everything relates. I suggest to reset - close this PR and open separate PRs for each individual improvement that can stand by itself. Each individual PR should have the smallest unit of code needed for the respective improvement, along with related unit test updates. The unit tests should demonstrate the problem being fixed (failing before the change, and passing after the change). Then the smaller individual PRs can each be judged on their own merit with clarity and focused discussion.

Note that to keep the commits clean we often use the commit amend approach. This way you don't have to go through the process of squashing and then fixing up the commit/PR description. See details at https://hyperledger-fabric.readthedocs.io/en/latest/github/github.html#updating-a-pull-request.

@dviejokfs dviejokfs changed the title Fixes #5001 and gossip panic when high concurrency Fixes #5001 Dec 18, 2024
@dviejokfs
Copy link
Author

@denyeart
Related to the unit tests, these don't change, how can we reproduce something that happens when there's a lot of concurrency? The issue is in the leveldb library:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x1c pc=0x12df6f8]

goroutine 9216936 [running]:
github.com/syndtr/goleveldb/leveldb.(*dbIter).Release(0xc00c41dd40)
/vendor/github.com/syndtr/goleveldb/leveldb/db_iter.go:352 +0x118
github.com/hyperledger/fabric/core/ledger/kvledger/txmgmt/statedb/stateleveldb.(*kvScanner).Close(0x0?)
/core/ledger/kvledger/txmgmt/statedb/stateleveldb/stateleveldb.go:338 +0x1c
github.com/hyperledger/fabric/core/ledger/kvledger/txmgmt/txmgr.(*resultsItr).Close(...)
/core/ledger/kvledger/txmgmt/txmgr/query_executor.go:495

@pfi79
Copy link
Contributor

pfi79 commented Dec 18, 2024

Related to the unit tests, these don't change, how can we reproduce something that happens when there's a lot of concurrency? The issue is in the leveldb library:

panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x1c pc=0x12df6f8]

leveldb has been updated. is the error still there or not? i don't know.
your changes concern the previous version.

@denyeart
Copy link
Contributor

denyeart commented Dec 18, 2024

@dviejokfs Agreed it may not be possible to cause the error in unit tests or integration tests, but it should at least be considered whether the problem can be induced dependably so that we know when it is resolved, even if that requires a manual test. In the extreme cases we've even sometimes temporarily updated Fabric code to make a problem more likely so that we can reproduce and ensure it is fixed. Do you have ideas around this?

@dviejokfs dviejokfs closed this Dec 20, 2024
@dviejokfs
Copy link
Author

@denyeart @yacovm @pfi79 I just created a new PR containing a unit test and the minimal fix #5084

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants