-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EC2 instance runners marked as orphan and deleted even when the job is running. #4391
Comments
Scale down is running indeed in 2 cycles, first makring instances and in a next cycle terminationg them. But the instance should only be marked orphan if the runners is not registred in GitHub. We have seen similar problems when runners are heavy loaded and the github agent is not connecting back to the mothership. But this problem we only have seen under heavy load. The load problem is a general one, also existing on github managed onces. @Brend-Smits do you recognize this pattern? |
@npalm , yes thanks for the clarification and that is what understand on how it works. Here we can see the EC2 instance in AWS, the Run in GitHub, and the instances in the runner group |
Can you reproduce the probelm with one of the examples in the repository? |
Sorry, what do you mean? |
You problem is clear, however I do not see the problem on my test setup, as well not on my production systems. I just tested on my test setup based on the de example in this repo (examples/default) the problem you described. I had a two long running jobs, but in my ssetup the scale-down is not marking the 2 runners as orphan as long the jobs are running. Can you reproduce the problem you describe based on the examples in the repository for example the one in examples/default. |
I had the same problem, but it was resolved by setting runner_boot_time_in_minutes to a longer value. |
Hello
I'm having some issues with AWS GHA runners, the EC2 instances are always marked as orphan when the scale-down script is executed, and when scale-down is executed again they are terminated.
I'm using version 6.0.1 and deploying everything using the terraform module.
Here is my test.
I execute a task that waits for 2 hours, sending a message to console every minute.
The action is executed @21:33
This is the log of scale up
2025-01-24 21:34:34 {"level":"INFO","message":"Created instance(s): i-0425d988f2c7fa6f4","sampling_rate":0,"service":"runners-scale-up","timestamp":"2025-01-25T00:34:29.744Z","xray_trace_id":"1-67943191-450de17da578343815965378","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"ec284d3c-9be4-5940-b336-2d32a8272852","function-name":"gha-ondemand-multi-linux-x64-dem-scale-up","runner":{"type":"Repo","owner":"fabfitfun/tv-api","namePrefix":"gha-ondemand-normal-","configuration":{"runnerType":"Repo","runnerOwner":"fabfitfun/tv-api","numberOfRunners":1,"ec2instanceCriteria":{"instanceTypes":["t3a.2xlarge"],"targetCapacityType":"on-demand","instanceAllocationStrategy":"lowest-price"},"environment":"gha-ondemand-multi-linux-x64-dem","launchTemplateName":"gha-ondemand-multi-linux-x64-dem-action-runner","subnets":["subnet-048333888bdbd82e0","subnet-053fb4a528b2cf900","subnet-0f069d8ae24648af2"],"tracingEnabled":false,"onDemandFailoverOnError":[]}},"github":{"event":"workflow_job","workflow_job_id":"36153145706"}}
Then this is the log of scale-down (tagging execution)
2025-01-24 21:45:55 {"level":"DEBUG","message":"Found: '0' GitHub runners for AWS runner instance: 'i-0425d988f2c7fa6f4'","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.100Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 21:45:55 {"level":"DEBUG","message":"GitHub runners for AWS runner instance: 'i-0425d988f2c7fa6f4': []","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.100Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 21:45:55 {"level":"DEBUG","message":"Tagging 'i-0425d988f2c7fa6f4'","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.100Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down","tags":[{"Key":"ghr:orphan","Value":"true"}]} 2025-01-24 21:45:55 {"level":"INFO","message":"Runner 'i-0425d988f2c7fa6f4' marked as orphan.","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.388Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"}
And this the second pass of scale-down
2025-01-24 22:00:53 {"level":"INFO","message":"Terminating orphan runner 'i-0425d988f2c7fa6f4'","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T01:00:50.394Z","xray_trace_id":"1-679437bf-375a9fde597887a5102c1220","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"73d294c7-c575-4bde-812b-68a9f2c43ef0","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 22:00:53 {"level":"DEBUG","message":"Runner 'i-0425d988f2c7fa6f4' will be terminated.","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T01:00:50.394Z","xray_trace_id":"1-679437bf-375a9fde597887a5102c1220","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"73d294c7-c575-4bde-812b-68a9f2c43ef0","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 22:00:53 {"level":"DEBUG","message":"Runner i-0425d988f2c7fa6f4 has been terminated.","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T01:00:50.723Z","xray_trace_id":"1-679437bf-375a9fde597887a5102c1220","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"73d294c7-c575-4bde-812b-68a9f2c43ef0","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"}
What might be wrong?, why is the scale-down script not detecting that the instance is still active?
Any suggestion or comment would be much appreciated.
Fernando.
The text was updated successfully, but these errors were encountered: