Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant growing consul db size (seems related to entries not deleted in path sys/expire/token/$hex1/$hex2) #11178

Closed
bom-d-van opened this issue Mar 23, 2021 · 5 comments

Comments

@bom-d-van
Copy link

bom-d-van commented Mar 23, 2021

Hi,

We are seeing our vault storage backend consul snapshot size growing constantly. We notice it's since around the time that we started using AWS secret engine. Thus we are suspecting that it might be caused by sys/expire/token/ expiration clean-up logics not dealing entries created by non-orphan batch token.

  1. In our vault+consul installation, we are seeing there are over 4.7 millions entries in consul within one path: vault/sys/expire/token/h6600xxx/xxx
  2. And by reading raw value from some of the entries, it looks like an aws secrets lease id.
  3. However, if we try to look up the lease id, our vault instance returns: {"errors":["invalid lease"]}.
  4. What's more, instead of having 4.7 million leases, there are only less than 96k aws leases under path vault/sys/expire/id/aws.

To Reproduce

  1. Create a new aws secret engine: https://www.vaultproject.io/docs/secrets/aws#setup
  2. Create a new aws role: https://www.vaultproject.io/api-docs/secret/aws#create-update-role
  3. Create a non-orphan batch token using the root token: https://www.vaultproject.io/api-docs/auth/token#create-token
  4. Use the batch token to create a new sts credential: https://www.vaultproject.io/api-docs/secret/aws#generate-credentials
  5. After the lease returned by the previous call expired (or revoke it manually), the lease entry in sys/expire/id is deleted, but the corresponding entry in sys/expire/token isn't deleted.

Expected behavior

vault/sys/expire/token/h6600xxx/xxx entries created by using batch token should also be cleaned up.

Environment:

  • Vault Server Version: 1.6.1 (6d2db3f)
  • Vault CLI Version (retrieve with vault version): various versions, some are using vault cli (1.6.1), some are using go library github.com/hashicorp/vault (v0.11.1)
  • Server Operating System/Architecture: CentOS 3.10.0

Vault server configuration file(s):

max_lease_ttl = "219600h"

cluster_name = "production"

plugin_directory = "/etc/vault/plugin"

api_addr = "https://xxx:8200"

cluster_addr = "https://xxx:8201"

raw_storage_endpoint = true

listener "tcp" {
  address = ":8200"
  cluster_address = ":8201"
  tls_cert_file = "xxx"
  tls_key_file = "xxx"
}

telemetry {
  statsite_address = "127.0.0.1:8125"
  disable_hostname = true
}

Additional context

Upon reading the source code, we found that the following source codes in vault/expieration.go that might be the root cause.

when vault is creating new leased secrets, it creates both sys/expire/id and sys/expire/token entries for non-orphan batch and service token api call.

But while deleting it, it only deletes sys/expire/token for service token api call.

@HridoyRoy
Copy link
Contributor

Hi @bom-d-van ,
What is the max TTL of the corresponding role? Also, what are the TTLs set for the tokens, if they are being set?

Thanks!

@bom-d-van
Copy link
Author

bom-d-van commented Apr 2, 2021

Hi @HridoyRoy , ttls are like this:

The root token: never expire
The non-orphan batch token: 72h
Max TTL for aws roles: 1h-12h (varries between roles)

@bom-d-van
Copy link
Author

Some more informations:

In order to mitigate the situation, we decided to migrate to orphan batch token. And after that we are noticing the growth of consul snapshot size is getting stable and growth of the entries in sys/expire/token/$hex1/$hex2 stopped completely.

It seems that there are two ways to avoid this bug/issue:

  • don't use a never-expire/long-ttl root/parent token to generate batch token to create leased dynamic secrets
  • use orphan batch token when creating leased dynamic secrets

Potential ways to handle this issue:

  • fail and return an error when using batch token with never expired root token to create leased dynamic secrets
  • clean up corresponding sys/expire/token/hex1/hex2 entries for regular batch tokens when lease expired

image
image

@bom-d-van
Copy link
Author

Hi, another follow-up. We have decided to manually clean-up the obsolete entries in path vault/sys/expire/token/h6600xxx/xxx, using the sys/raw api. And after the clean-up, our consul db snapshot size is more than 80% smaller. It took us ~4 days to delete 5.7 millions of them.

At the same time, we didn't delete all of the entries created by the root token. We kept 8621 of them around, and then we tried to revoke the root/parent token. We were hoping Vault would at least clean them up on revocation, but still not, the revoke api call timed out, and then we tried with a token tidy api call.

$ VAULT_TOKEN=xxx /usr/local/sbin/vault-local token revoke -self

$ curl --header "X-Vault-Token: yyy" https://`hostname`/v1/auth/token/tidy -X POST

Only entries in vault/sys/token/{id,parent} were deleted.

vault/sys/token/id/h6600exxx
vault/sys/token/parent/h6600exxx/xxx
vault/sys/token/parent/h6600exxx/xxx
vault/sys/token/parent/h6600exxx/xxx
vault/sys/token/parent/h6600exxx/xxx
vault/sys/token/parent/h6600exxx/xxx

All the entries in vault/sys/expire/token/h6600exxx/xxx weren't cleaned up by neither revoke nor tidy api calls.

@briankassouf
Copy link
Contributor

Should be fixed by #11377. Please let us know if you're still seeing issues post upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants