Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set up GCS (HNS enabled) using blueprint for the ML workload #3690

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 3 additions & 39 deletions examples/hypercompute_clusters/a3u-gke-gcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,48 +44,12 @@ Storage (GCS).
gcloud storage buckets update gs://${BUCKET_NAME} --versioning
```

3. **Create and Configure GCS Buckets:**

* Create separate GCS buckets for training data and checkpoint/restart data:

```bash
PROJECT_ID=<your-gcp-project>
REGION=<your-preferred-region>
TRAINING_BUCKET_NAME=<training-bucket-name>
CHECKPOINT_BUCKET_NAME=<checkpoint-bucket-name>
PROJECT_NUMBER=<your-project-number>

gcloud storage buckets create gs://${TRAINING_BUCKET_NAME} \
--location=${REGION} \
--uniform-bucket-level-access \
--enable-hierarchical-namespace

gcloud storage buckets create gs://${CHECKPOINT_BUCKET_NAME} \
--location=${REGION} \
--uniform-bucket-level-access \
--enable-hierarchical-namespace
```

* Grant workload identity service accounts (WI SAs) access to the buckets:

```bash

gcloud storage buckets add-iam-policy-binding gs://${TRAINING_BUCKET_NAME} \
--member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/default/sa/default" \
--role roles/storage.objectUser

gcloud storage buckets add-iam-policy-binding gs://${CHECKPOINT_BUCKET_NAME} \
--member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/default/sa/default" \
--role roles/storage.objectUser
```

4. **Customize Deployment Configuration:**
3. **Customize Deployment Configuration:**

Modify the `deployment.yaml` file to suit your needs. This will include
region/zone, nodepool sizes, reservation name, and checkpoint/training bucket
names.
region/zone, nodepool sizes, and reservation name.

5. **Deploy the Cluster:**
4. **Deploy the Cluster:**

Use the `gcluster` tool to deploy your GKE cluster with the desired configuration:

Expand Down
39 changes: 25 additions & 14 deletions examples/hypercompute_clusters/a3u-gke-gcs/a3u-gke-gcs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,6 @@ vars:
# <project>/<reservation-name>/reservationBlocks/<reservation-block-name>
extended_reservation:

# The name of the GCS bucket used for training data
training_bucket_name:

# The following variables do not need to be modified
nccl_installer_path: $(ghpc_stage("./nccl-rdma-installer.yaml"))
mtu_size: 8896
Expand Down Expand Up @@ -118,6 +115,22 @@ deployment_groups:
- storage.objectAdmin
- artifactregistry.reader

- id: training_bucket
source: community/modules/file-system/cloud-storage-bucket
settings:
local_mount: /data
random_suffix: true
force_destroy: true
enable_hierarchical_namespace: true

- id: checkpoint_bucket
source: community/modules/file-system/cloud-storage-bucket
settings:
local_mount: /data
random_suffix: true
force_destroy: true
enable_hierarchical_namespace: true

- id: a3-ultragpu-cluster
source: modules/scheduler/gke-cluster
use: [gke-a3-ultra-net-0, workload_service_account]
Expand Down Expand Up @@ -209,13 +222,12 @@ deployment_groups:
apply_manifests:
- source: $(vars.nccl_installer_path)

# Create a remote mount of $(vars.training_bucket_name)
# using mount options optimized for reading training
# data.
# Create a remote mount of training_bucket using
# mount options optimized for reading training data.
- id: gcs-training
source: modules/file-system/pre-existing-network-storage
settings:
remote_mount: $(vars.training_bucket_name)
remote_mount: $(training_bucket.gcs_bucket_name)
local_mount: /training-data
fs_type: gcsfuse
mount_options: >-
Expand All @@ -227,13 +239,12 @@ deployment_groups:
file-cache:cache-file-for-range-read:true,
file-system:kernel-list-cache-ttl-secs:-1

# Create a remote mount of $(vars.checkpoint_bucket_name)
# using mount options optimized for writing and reading
# checkpoint data.
# Create a remote mount of checkpoint_bucket using mount
# options optimized for writing and reading checkpoint data.
- id: gcs-checkpointing
source: modules/file-system/pre-existing-network-storage
settings:
remote_mount: $(vars.checkpoint_bucket_name)
remote_mount: $(checkpoint_bucket.gcs_bucket_name)
local_mount: /checkpoint-data
fs_type: gcsfuse
mount_options: >-
Expand All @@ -250,15 +261,15 @@ deployment_groups:
source: modules/file-system/gke-persistent-volume
use: [gcs-training, a3-ultragpu-cluster]
settings:
gcs_bucket_name: $(vars.training_bucket_name)
gcs_bucket_name: $(training_bucket.gcs_bucket_name)
capacity_gb: 1000000

# Persistent Volume for checkpoint data
- id: checkpointing-pv
source: modules/file-system/gke-persistent-volume
use: [gcs-checkpointing, a3-ultragpu-cluster]
settings:
gcs_bucket_name: $(vars.checkpoint_bucket_name)
gcs_bucket_name: $(checkpoint_bucket.gcs_bucket_name)
capacity_gb: 1000000

# This is an example job that will install and run an `fio`
Expand All @@ -277,7 +288,7 @@ deployment_groups:
mount_path: /scratch-data
size_gb: 1000 # Use 1 out of 12 TB for local scratch

k8s_service_account_name: default
k8s_service_account_name: workload-identity-k8s-sa
image: ubuntu:latest

command:
Expand Down
6 changes: 0 additions & 6 deletions examples/hypercompute_clusters/a3u-gke-gcs/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,3 @@ vars:
# The name of the compute engine reservation of A3-Ultra nodes in the form of
# <project>/<reservation-name>/reservationBlocks/<reservation-block-name>
extended_reservation:

# The name of the GCS bucket used for training data
training_bucket_name:

# The name of the GCS bucket used for checkpoint/restart data.
checkpoint_bucket_name:
Loading