lock public JWKS generation and re-check cache #27929

bhowe34 · 2024-07-31T21:00:03Z

Description

What does this PR do?

This adds a new lock that just guards the public JWKS generation to ensure multiple requests aren't generating it at the same time. It is only used by the public JWKS endpoint and the token introspection endpoint. After acquiring the lock it double checks the cache to avoid expiring the keys again.

Currently, if multiple concurrent JWKS requests hit vault at the same time and there is no cached "jwks" or "jwksResponse" values in the cache all the requests will proceed to this lock and unnecessarily (after the first request) re-expire the JWKS. This causes unnecessary load on the CPU and storage backend. Expiring the JWKS can also flush the cache which results in more misses and expiration calls, making the problem worse.

We have observed this behavior multiple times in our vault clusters with the latest instance resulting in hundreds of thousands of backlogged requests for JWKS (most of which were cancelled on the client side but were still being processed by vault). This backlog of requests took over 8 hours to process with the CPU maxed out on a 96 core EC2 instance. Below is the goroutine dump, goroutine count, instance CPU usage, and storage requests we observed.

heatherezell · 2024-08-05T23:06:22Z

Thanks for this contribution! Can you please check the contributing md file and add a changelog entry? I'll ask our engineers for a technical review as well.

kpcraig · 2024-08-13T15:54:34Z

Hey @bhowe34, this looks good to me - you wouldn't have happened to run any experiments to see performance effects of the fix, would you?

bhowe34 · 2024-08-14T15:57:31Z

Hi, I haven't been able to test this in the same conditions in which it occurred but I'll see if I can make a script for local testing that reliably replicates the issue. It may be difficult to replicate locally since the storage latency will be very low which would reduce the time it takes to generate the JWKS.

kpcraig · 2024-08-14T17:05:04Z

I'm not too worried - my sense is that this is a net positive tradeoff - wait for the key to be created instead of roiling in the cache, but it would be nice to have numbers if it's not too difficult to do.

bhowe34 · 2024-08-15T16:33:44Z

This program seems to reliably reproduce the issue. You may need to run it 2 or 3 times to observe the behavior or you can increase the amount of forced key rotates. I tested with a dev server using these args server -dev -dev-root-token-id=root -log-level=debug. The debug logging shows when keys are rotated.

Without this change the issue seems more common and typically results in total runtime > 50 seconds on my machine. With this change the issue seems less likely but there appears to still be a good amount of lock contention in some cases. Even then the max runtime I observed was about 35 seconds. I don't have metrics or profiling locally so I'm not sure where the other contention is happening. It could be related to locking for in memory dev server storage.

Code

package main

import (
	"context"
	"fmt"
	"log"
	"sync"
	"time"

	"github.com/hashicorp/go-cleanhttp"
	"github.com/hashicorp/vault/api"
)

func main() {
	clientConfig := api.DefaultConfig()
	client, err := api.NewClient(clientConfig)
	if err != nil {
		log.Fatalf("failed to create client: %v", err)
	}

	ctx, cancel := context.WithTimeout(context.Background(), time.Minute*5)
	defer cancel()

	// create keys
	keys := map[string]map[string]interface{}{
		"k1": {
			"algorithm":          "RS256",
			"allowed_client_ids": "*",
			"rotation_period":    "1m",
			"verification_ttl":   "10m",
		},
		"k2": {
			"algorithm":          "RS256",
			"allowed_client_ids": "*",
			"rotation_period":    "3m",
			"verification_ttl":   "30m",
		},
		"k3": {
			"algorithm":          "RS256",
			"allowed_client_ids": "*",
			"rotation_period":    "5m",
			"verification_ttl":   "50m",
		},
	}

	for name, data := range keys {
		_, err := client.Logical().WriteWithContext(ctx, "identity/oidc/key/"+name, data)
		if err != nil {
			log.Fatalf("error creating key: %v", err)
		}
	}

	keyName := "k1"
	// create roles
	for i := 0; i < 5000; i++ {
		clientID := fmt.Sprintf("role-%d", i)
		_, err := client.Logical().WriteWithContext(ctx, "identity/oidc/role/"+clientID, map[string]interface{}{
			"key":       keyName,
			"client_id": clientID,
			"ttl":       "1m", // can't be longer than key verification_ttl
		})
		if err != nil {
			log.Fatalf("error creating role: %v", err)
		}

		// switch up the keys a bit
		if i == 100 {
			keyName = "k2"
		} else if i == 300 {
			keyName = "k3"
		}
	}

	// prep JWKS requests
	startCh := make(chan struct{})
	wg := &sync.WaitGroup{}
	reqLogThreshhold := time.Second * 2

	// have a routine that rotates the keys a few times to force the issue
	wg.Add(1)
	go func() {
		<-startCh
		time.Sleep(time.Millisecond * 20)
		for i := 0; i < 2; i++ {
			_, err := client.Logical().WriteWithContext(ctx, "identity/oidc/key/k1/rotate", nil)
			if err != nil {
				log.Printf("error rotating key: %v", err)
			} else {
				log.Printf("rotated key")
			}
			time.Sleep(time.Millisecond * 300)
		}
		wg.Done()
	}()

	for i := 0; i < 5000; i++ {
		wg.Add(1)

		reqClientConf := api.DefaultConfig()
		reqClientConf.HttpClient = cleanhttp.DefaultClient()
		reqClient, err := api.NewClient(reqClientConf)
		if err != nil {
			log.Fatalf("error creating req client: %v", err)
		}
		reqClient.SetToken("")

		go func() {
			<-startCh
			reqStartTime := time.Now()
			_, err := reqClient.Logical().ReadWithContext(ctx, "identity/oidc/.well-known/keys")
			if err != nil {
				log.Printf("error reading JWKS: %v", err)
			}
			wg.Done()

			reqDuration := time.Now().Sub(reqStartTime)
			if reqLogThreshhold <= reqDuration {
				log.Printf("long request: %s", reqDuration)
			}
		}()
	}

	startTime := time.Now()
	log.Printf("starting JWKS requests")
	close(startCh)

	wg.Wait()
	log.Printf("total request time: %s", time.Now().Sub(startTime))
}

thyton · 2024-09-04T17:52:08Z

Hi @bhowe34. I ran your load test and saw the longest runtime is above 30s, compared to 50s in the old code. However, after pumping the # of concurrent requests from 5000 to 10000, I saw tcp 127.0.0.1:8200: i/o timeout especially when vault server logged that oidcPeriodicFunc() rotated the keys.

There seem to be two root cause layers of the problem here, and your PR definitely solves the first layer.

Layer 1. Multiple JWKS requests hit pathOIDCReadPublicKeys() at the same time and there is no cached "jwks" or "jwksResponse" values in the cache. The old code lets all the requests proceed to expireOIDCPublicKeys() . This makes all requests rush to the start of expireOIDCPublicKeys() without waiting for the first request to finish, setting jwks in cache. This race condition explained a spike of goroutines and the contented mutex in expireOIDCPublicKeys().

Layer 2. expireOIDCPublicKeys() is also called in oidcPeriodicFunc(). From a brief look, I think, during the flood of these concurrent requests, oidcPeriodicFunc() rotates the keys that are due for rotation, calls expireOIDCPublicKeys() and performs the cache flush. I'm curious to know if you've seen this behavior or other behaviors on your side.

Unfortunately, we have higher priorities that we won't able to investigate the issue deeper. I added this task to the team's backlog to look further when we can. Thank you again for your contribution!

thyton

LGTM

bhowe34 · 2024-09-09T18:06:52Z

I'm curious to know if you've seen this behavior or other behaviors on your side.

We do see periodic rotations cause the goroutine spike/request backup occasionally but there is no pattern to it. This fix will at least prevent hammering the backend in that case but may not fix high request latencies when the keys are rotated.

lock public JWKS generation and re-check cache

082c60a

bhowe34 requested a review from a team as a code owner July 31, 2024 21:00

heatherezell added identity/oidc ecosystem bug Used to indicate a potential bug labels Aug 5, 2024

bhowe34 and others added 2 commits August 6, 2024 14:22

add changelog

643586f

Merge branch 'main' into bhowe34/lock-public-jwks-generation

b9aec2f

Merge branch 'main' into bhowe34/lock-public-jwks-generation

7239ade

thyton approved these changes Sep 4, 2024

View reviewed changes

thyton merged commit 5934294 into hashicorp:main Sep 9, 2024
66 of 67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lock public JWKS generation and re-check cache #27929

lock public JWKS generation and re-check cache #27929

bhowe34 commented Jul 31, 2024 •

edited

Loading

heatherezell commented Aug 5, 2024

kpcraig commented Aug 13, 2024

bhowe34 commented Aug 14, 2024

kpcraig commented Aug 14, 2024

bhowe34 commented Aug 15, 2024 •

edited

Loading

thyton commented Sep 4, 2024

thyton left a comment

bhowe34 commented Sep 9, 2024

lock public JWKS generation and re-check cache #27929

lock public JWKS generation and re-check cache #27929

Conversation

bhowe34 commented Jul 31, 2024 • edited Loading

Description

heatherezell commented Aug 5, 2024

kpcraig commented Aug 13, 2024

bhowe34 commented Aug 14, 2024

kpcraig commented Aug 14, 2024

bhowe34 commented Aug 15, 2024 • edited Loading

thyton commented Sep 4, 2024

thyton left a comment

Choose a reason for hiding this comment

bhowe34 commented Sep 9, 2024

bhowe34 commented Jul 31, 2024 •

edited

Loading

bhowe34 commented Aug 15, 2024 •

edited

Loading