-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lock public JWKS generation and re-check cache #27929
lock public JWKS generation and re-check cache #27929
Conversation
Thanks for this contribution! Can you please check the contributing md file and add a changelog entry? I'll ask our engineers for a technical review as well. |
Hey @bhowe34, this looks good to me - you wouldn't have happened to run any experiments to see performance effects of the fix, would you? |
Hi, I haven't been able to test this in the same conditions in which it occurred but I'll see if I can make a script for local testing that reliably replicates the issue. It may be difficult to replicate locally since the storage latency will be very low which would reduce the time it takes to generate the JWKS. |
I'm not too worried - my sense is that this is a net positive tradeoff - wait for the key to be created instead of roiling in the cache, but it would be nice to have numbers if it's not too difficult to do. |
This program seems to reliably reproduce the issue. You may need to run it 2 or 3 times to observe the behavior or you can increase the amount of forced key rotates. I tested with a dev server using these args Without this change the issue seems more common and typically results in total runtime > 50 seconds on my machine. With this change the issue seems less likely but there appears to still be a good amount of lock contention in some cases. Even then the max runtime I observed was about 35 seconds. I don't have metrics or profiling locally so I'm not sure where the other contention is happening. It could be related to locking for in memory dev server storage. Codepackage main
import (
"context"
"fmt"
"log"
"sync"
"time"
"github.com/hashicorp/go-cleanhttp"
"github.com/hashicorp/vault/api"
)
func main() {
clientConfig := api.DefaultConfig()
client, err := api.NewClient(clientConfig)
if err != nil {
log.Fatalf("failed to create client: %v", err)
}
ctx, cancel := context.WithTimeout(context.Background(), time.Minute*5)
defer cancel()
// create keys
keys := map[string]map[string]interface{}{
"k1": {
"algorithm": "RS256",
"allowed_client_ids": "*",
"rotation_period": "1m",
"verification_ttl": "10m",
},
"k2": {
"algorithm": "RS256",
"allowed_client_ids": "*",
"rotation_period": "3m",
"verification_ttl": "30m",
},
"k3": {
"algorithm": "RS256",
"allowed_client_ids": "*",
"rotation_period": "5m",
"verification_ttl": "50m",
},
}
for name, data := range keys {
_, err := client.Logical().WriteWithContext(ctx, "identity/oidc/key/"+name, data)
if err != nil {
log.Fatalf("error creating key: %v", err)
}
}
keyName := "k1"
// create roles
for i := 0; i < 5000; i++ {
clientID := fmt.Sprintf("role-%d", i)
_, err := client.Logical().WriteWithContext(ctx, "identity/oidc/role/"+clientID, map[string]interface{}{
"key": keyName,
"client_id": clientID,
"ttl": "1m", // can't be longer than key verification_ttl
})
if err != nil {
log.Fatalf("error creating role: %v", err)
}
// switch up the keys a bit
if i == 100 {
keyName = "k2"
} else if i == 300 {
keyName = "k3"
}
}
// prep JWKS requests
startCh := make(chan struct{})
wg := &sync.WaitGroup{}
reqLogThreshhold := time.Second * 2
// have a routine that rotates the keys a few times to force the issue
wg.Add(1)
go func() {
<-startCh
time.Sleep(time.Millisecond * 20)
for i := 0; i < 2; i++ {
_, err := client.Logical().WriteWithContext(ctx, "identity/oidc/key/k1/rotate", nil)
if err != nil {
log.Printf("error rotating key: %v", err)
} else {
log.Printf("rotated key")
}
time.Sleep(time.Millisecond * 300)
}
wg.Done()
}()
for i := 0; i < 5000; i++ {
wg.Add(1)
reqClientConf := api.DefaultConfig()
reqClientConf.HttpClient = cleanhttp.DefaultClient()
reqClient, err := api.NewClient(reqClientConf)
if err != nil {
log.Fatalf("error creating req client: %v", err)
}
reqClient.SetToken("")
go func() {
<-startCh
reqStartTime := time.Now()
_, err := reqClient.Logical().ReadWithContext(ctx, "identity/oidc/.well-known/keys")
if err != nil {
log.Printf("error reading JWKS: %v", err)
}
wg.Done()
reqDuration := time.Now().Sub(reqStartTime)
if reqLogThreshhold <= reqDuration {
log.Printf("long request: %s", reqDuration)
}
}()
}
startTime := time.Now()
log.Printf("starting JWKS requests")
close(startCh)
wg.Wait()
log.Printf("total request time: %s", time.Now().Sub(startTime))
} |
Hi @bhowe34. I ran your load test and saw the longest runtime is above 30s, compared to 50s in the old code. However, after pumping the # of concurrent requests from 5000 to 10000, I saw There seem to be two root cause layers of the problem here, and your PR definitely solves the first layer. Layer 1. Multiple JWKS requests hit pathOIDCReadPublicKeys() at the same time and there is no cached "jwks" or "jwksResponse" values in the cache. The old code lets all the requests proceed to Layer 2. Unfortunately, we have higher priorities that we won't able to investigate the issue deeper. I added this task to the team's backlog to look further when we can. Thank you again for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We do see periodic rotations cause the goroutine spike/request backup occasionally but there is no pattern to it. This fix will at least prevent hammering the backend in that case but may not fix high request latencies when the keys are rotated. |
Description
What does this PR do?
This adds a new lock that just guards the public JWKS generation to ensure multiple requests aren't generating it at the same time. It is only used by the public JWKS endpoint and the token introspection endpoint. After acquiring the lock it double checks the cache to avoid expiring the keys again.
Currently, if multiple concurrent JWKS requests hit vault at the same time and there is no cached
"jwks"
or"jwksResponse"
values in the cache all the requests will proceed to this lock and unnecessarily (after the first request) re-expire the JWKS. This causes unnecessary load on the CPU and storage backend. Expiring the JWKS can also flush the cache which results in more misses and expiration calls, making the problem worse.We have observed this behavior multiple times in our vault clusters with the latest instance resulting in hundreds of thousands of backlogged requests for JWKS (most of which were cancelled on the client side but were still being processed by vault). This backlog of requests took over 8 hours to process with the CPU maxed out on a 96 core EC2 instance. Below is the goroutine dump, goroutine count, instance CPU usage, and storage requests we observed.