Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lock public JWKS generation and re-check cache #27929

Merged
merged 4 commits into from
Sep 9, 2024

Conversation

bhowe34
Copy link
Contributor

@bhowe34 bhowe34 commented Jul 31, 2024

Description

What does this PR do?

This adds a new lock that just guards the public JWKS generation to ensure multiple requests aren't generating it at the same time. It is only used by the public JWKS endpoint and the token introspection endpoint. After acquiring the lock it double checks the cache to avoid expiring the keys again.

Currently, if multiple concurrent JWKS requests hit vault at the same time and there is no cached "jwks" or "jwksResponse" values in the cache all the requests will proceed to this lock and unnecessarily (after the first request) re-expire the JWKS. This causes unnecessary load on the CPU and storage backend. Expiring the JWKS can also flush the cache which results in more misses and expiration calls, making the problem worse.

We have observed this behavior multiple times in our vault clusters with the latest instance resulting in hundreds of thousands of backlogged requests for JWKS (most of which were cancelled on the client side but were still being processed by vault). This backlog of requests took over 8 hours to process with the CPU maxed out on a 96 core EC2 instance. Below is the goroutine dump, goroutine count, instance CPU usage, and storage requests we observed.

Screenshot 2024-07-31 at 3 58 11 PM
Screenshot 2024-07-31 at 3 57 49 PM
Screenshot 2024-07-31 at 3 57 27 PM
Screenshot 2024-07-31 at 4 07 12 PM

@bhowe34 bhowe34 requested a review from a team as a code owner July 31, 2024 21:00
@heatherezell heatherezell added identity/oidc ecosystem bug Used to indicate a potential bug labels Aug 5, 2024
@heatherezell
Copy link
Contributor

Thanks for this contribution! Can you please check the contributing md file and add a changelog entry? I'll ask our engineers for a technical review as well.

@kpcraig
Copy link
Contributor

kpcraig commented Aug 13, 2024

Hey @bhowe34, this looks good to me - you wouldn't have happened to run any experiments to see performance effects of the fix, would you?

@bhowe34
Copy link
Contributor Author

bhowe34 commented Aug 14, 2024

Hi, I haven't been able to test this in the same conditions in which it occurred but I'll see if I can make a script for local testing that reliably replicates the issue. It may be difficult to replicate locally since the storage latency will be very low which would reduce the time it takes to generate the JWKS.

@kpcraig
Copy link
Contributor

kpcraig commented Aug 14, 2024

I'm not too worried - my sense is that this is a net positive tradeoff - wait for the key to be created instead of roiling in the cache, but it would be nice to have numbers if it's not too difficult to do.

@bhowe34
Copy link
Contributor Author

bhowe34 commented Aug 15, 2024

This program seems to reliably reproduce the issue. You may need to run it 2 or 3 times to observe the behavior or you can increase the amount of forced key rotates. I tested with a dev server using these args server -dev -dev-root-token-id=root -log-level=debug. The debug logging shows when keys are rotated.

Without this change the issue seems more common and typically results in total runtime > 50 seconds on my machine. With this change the issue seems less likely but there appears to still be a good amount of lock contention in some cases. Even then the max runtime I observed was about 35 seconds. I don't have metrics or profiling locally so I'm not sure where the other contention is happening. It could be related to locking for in memory dev server storage.

Code
package main

import (
	"context"
	"fmt"
	"log"
	"sync"
	"time"

	"github.com/hashicorp/go-cleanhttp"
	"github.com/hashicorp/vault/api"
)

func main() {
	clientConfig := api.DefaultConfig()
	client, err := api.NewClient(clientConfig)
	if err != nil {
		log.Fatalf("failed to create client: %v", err)
	}

	ctx, cancel := context.WithTimeout(context.Background(), time.Minute*5)
	defer cancel()

	// create keys
	keys := map[string]map[string]interface{}{
		"k1": {
			"algorithm":          "RS256",
			"allowed_client_ids": "*",
			"rotation_period":    "1m",
			"verification_ttl":   "10m",
		},
		"k2": {
			"algorithm":          "RS256",
			"allowed_client_ids": "*",
			"rotation_period":    "3m",
			"verification_ttl":   "30m",
		},
		"k3": {
			"algorithm":          "RS256",
			"allowed_client_ids": "*",
			"rotation_period":    "5m",
			"verification_ttl":   "50m",
		},
	}

	for name, data := range keys {
		_, err := client.Logical().WriteWithContext(ctx, "identity/oidc/key/"+name, data)
		if err != nil {
			log.Fatalf("error creating key: %v", err)
		}
	}

	keyName := "k1"
	// create roles
	for i := 0; i < 5000; i++ {
		clientID := fmt.Sprintf("role-%d", i)
		_, err := client.Logical().WriteWithContext(ctx, "identity/oidc/role/"+clientID, map[string]interface{}{
			"key":       keyName,
			"client_id": clientID,
			"ttl":       "1m", // can't be longer than key verification_ttl
		})
		if err != nil {
			log.Fatalf("error creating role: %v", err)
		}

		// switch up the keys a bit
		if i == 100 {
			keyName = "k2"
		} else if i == 300 {
			keyName = "k3"
		}
	}

	// prep JWKS requests
	startCh := make(chan struct{})
	wg := &sync.WaitGroup{}
	reqLogThreshhold := time.Second * 2

	// have a routine that rotates the keys a few times to force the issue
	wg.Add(1)
	go func() {
		<-startCh
		time.Sleep(time.Millisecond * 20)
		for i := 0; i < 2; i++ {
			_, err := client.Logical().WriteWithContext(ctx, "identity/oidc/key/k1/rotate", nil)
			if err != nil {
				log.Printf("error rotating key: %v", err)
			} else {
				log.Printf("rotated key")
			}
			time.Sleep(time.Millisecond * 300)
		}
		wg.Done()
	}()

	for i := 0; i < 5000; i++ {
		wg.Add(1)

		reqClientConf := api.DefaultConfig()
		reqClientConf.HttpClient = cleanhttp.DefaultClient()
		reqClient, err := api.NewClient(reqClientConf)
		if err != nil {
			log.Fatalf("error creating req client: %v", err)
		}
		reqClient.SetToken("")

		go func() {
			<-startCh
			reqStartTime := time.Now()
			_, err := reqClient.Logical().ReadWithContext(ctx, "identity/oidc/.well-known/keys")
			if err != nil {
				log.Printf("error reading JWKS: %v", err)
			}
			wg.Done()

			reqDuration := time.Now().Sub(reqStartTime)
			if reqLogThreshhold <= reqDuration {
				log.Printf("long request: %s", reqDuration)
			}
		}()
	}

	startTime := time.Now()
	log.Printf("starting JWKS requests")
	close(startCh)

	wg.Wait()
	log.Printf("total request time: %s", time.Now().Sub(startTime))
}

@thyton
Copy link
Contributor

thyton commented Sep 4, 2024

Hi @bhowe34. I ran your load test and saw the longest runtime is above 30s, compared to 50s in the old code. However, after pumping the # of concurrent requests from 5000 to 10000, I saw tcp 127.0.0.1:8200: i/o timeout especially when vault server logged that oidcPeriodicFunc() rotated the keys.

There seem to be two root cause layers of the problem here, and your PR definitely solves the first layer.

Layer 1. Multiple JWKS requests hit pathOIDCReadPublicKeys() at the same time and there is no cached "jwks" or "jwksResponse" values in the cache. The old code lets all the requests proceed to expireOIDCPublicKeys() . This makes all requests rush to the start of expireOIDCPublicKeys() without waiting for the first request to finish, setting jwks in cache. This race condition explained a spike of goroutines and the contented mutex in expireOIDCPublicKeys().

Layer 2. expireOIDCPublicKeys() is also called in oidcPeriodicFunc(). From a brief look, I think, during the flood of these concurrent requests, oidcPeriodicFunc() rotates the keys that are due for rotation, calls expireOIDCPublicKeys() and performs the cache flush. I'm curious to know if you've seen this behavior or other behaviors on your side.

Unfortunately, we have higher priorities that we won't able to investigate the issue deeper. I added this task to the team's backlog to look further when we can. Thank you again for your contribution!

Copy link
Contributor

@thyton thyton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thyton thyton merged commit 5934294 into hashicorp:main Sep 9, 2024
66 of 67 checks passed
@bhowe34
Copy link
Contributor Author

bhowe34 commented Sep 9, 2024

I'm curious to know if you've seen this behavior or other behaviors on your side.

We do see periodic rotations cause the goroutine spike/request backup occasionally but there is no pattern to it. This fix will at least prevent hammering the backend in that case but may not fix high request latencies when the keys are rotated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug ecosystem identity/oidc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants