EC2 instance runners marked as orphan and deleted even when the job is running. #4391

ferlosvillas · 2025-01-28T16:11:04Z

Hello
I'm having some issues with AWS GHA runners, the EC2 instances are always marked as orphan when the scale-down script is executed, and when scale-down is executed again they are terminated.
I'm using version 6.0.1 and deploying everything using the terraform module.

Here is my test.
I execute a task that waits for 2 hours, sending a message to console every minute.
The action is executed @21:33
This is the log of scale up
2025-01-24 21:34:34 {"level":"INFO","message":"Created instance(s): i-0425d988f2c7fa6f4","sampling_rate":0,"service":"runners-scale-up","timestamp":"2025-01-25T00:34:29.744Z","xray_trace_id":"1-67943191-450de17da578343815965378","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"ec284d3c-9be4-5940-b336-2d32a8272852","function-name":"gha-ondemand-multi-linux-x64-dem-scale-up","runner":{"type":"Repo","owner":"fabfitfun/tv-api","namePrefix":"gha-ondemand-normal-","configuration":{"runnerType":"Repo","runnerOwner":"fabfitfun/tv-api","numberOfRunners":1,"ec2instanceCriteria":{"instanceTypes":["t3a.2xlarge"],"targetCapacityType":"on-demand","instanceAllocationStrategy":"lowest-price"},"environment":"gha-ondemand-multi-linux-x64-dem","launchTemplateName":"gha-ondemand-multi-linux-x64-dem-action-runner","subnets":["subnet-048333888bdbd82e0","subnet-053fb4a528b2cf900","subnet-0f069d8ae24648af2"],"tracingEnabled":false,"onDemandFailoverOnError":[]}},"github":{"event":"workflow_job","workflow_job_id":"36153145706"}}

Then this is the log of scale-down (tagging execution)
2025-01-24 21:45:55 {"level":"DEBUG","message":"Found: '0' GitHub runners for AWS runner instance: 'i-0425d988f2c7fa6f4'","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.100Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 21:45:55 {"level":"DEBUG","message":"GitHub runners for AWS runner instance: 'i-0425d988f2c7fa6f4': []","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.100Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 21:45:55 {"level":"DEBUG","message":"Tagging 'i-0425d988f2c7fa6f4'","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.100Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down","tags":[{"Key":"ghr:orphan","Value":"true"}]} 2025-01-24 21:45:55 {"level":"INFO","message":"Runner 'i-0425d988f2c7fa6f4' marked as orphan.","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.388Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"}

And this the second pass of scale-down
2025-01-24 22:00:53 {"level":"INFO","message":"Terminating orphan runner 'i-0425d988f2c7fa6f4'","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T01:00:50.394Z","xray_trace_id":"1-679437bf-375a9fde597887a5102c1220","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"73d294c7-c575-4bde-812b-68a9f2c43ef0","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 22:00:53 {"level":"DEBUG","message":"Runner 'i-0425d988f2c7fa6f4' will be terminated.","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T01:00:50.394Z","xray_trace_id":"1-679437bf-375a9fde597887a5102c1220","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"73d294c7-c575-4bde-812b-68a9f2c43ef0","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 22:00:53 {"level":"DEBUG","message":"Runner i-0425d988f2c7fa6f4 has been terminated.","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T01:00:50.723Z","xray_trace_id":"1-679437bf-375a9fde597887a5102c1220","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"73d294c7-c575-4bde-812b-68a9f2c43ef0","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"}

What might be wrong?, why is the scale-down script not detecting that the instance is still active?
Any suggestion or comment would be much appreciated.

Fernando.

The text was updated successfully, but these errors were encountered:

npalm · 2025-01-28T21:47:05Z

Scale down is running indeed in 2 cycles, first makring instances and in a next cycle terminationg them. But the instance should only be marked orphan if the runners is not registred in GitHub. We have seen similar problems when runners are heavy loaded and the github agent is not connecting back to the mothership. But this problem we only have seen under heavy load. The load problem is a general one, also existing on github managed onces.

@Brend-Smits do you recognize this pattern?

ferlosvillas · 2025-01-29T18:37:57Z

@npalm , yes thanks for the clarification and that is what understand on how it works.
The problem is that the scale-down script does not detect the instance as active, and it tags the instance immediately. I reduced the time of the scheduler to 5 minutes so I can detect the issue faster. Here are a couple of screenshots with the status.

Here we can see the EC2 instance in AWS, the Run in GitHub, and the instances in the runner group

This is the log in cloudwatch

npalm · 2025-01-30T18:01:05Z

Can you reproduce the probelm with one of the examples in the repository?

ferlosvillas · 2025-01-30T22:07:10Z

Can you reproduce the probelm with one of the examples in the repository?

Sorry, what do you mean?

npalm · 2025-02-01T10:11:36Z

You problem is clear, however I do not see the problem on my test setup, as well not on my production systems. I just tested on my test setup based on the de example in this repo (examples/default) the problem you described. I had a two long running jobs, but in my ssetup the scale-down is not marking the 2 runners as orphan as long the jobs are running.

Can you reproduce the problem you describe based on the examples in the repository for example the one in examples/default.

AngryMane · 2025-02-12T06:05:39Z

I had the same problem, but it was resolved by setting runner_boot_time_in_minutes to a longer value.
After an EC2 instance registers github -runner(runner-A) on the github server, it may take some time for the github server to return runner-A to runner REST queries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EC2 instance runners marked as orphan and deleted even when the job is running. #4391

EC2 instance runners marked as orphan and deleted even when the job is running. #4391

ferlosvillas commented Jan 28, 2025

npalm commented Jan 28, 2025

ferlosvillas commented Jan 29, 2025 •

edited

Loading

npalm commented Jan 30, 2025

ferlosvillas commented Jan 30, 2025

npalm commented Feb 1, 2025

AngryMane commented Feb 12, 2025

EC2 instance runners marked as orphan and deleted even when the job is running. #4391

EC2 instance runners marked as orphan and deleted even when the job is running. #4391

Comments

ferlosvillas commented Jan 28, 2025

npalm commented Jan 28, 2025

ferlosvillas commented Jan 29, 2025 • edited Loading

npalm commented Jan 30, 2025

ferlosvillas commented Jan 30, 2025

npalm commented Feb 1, 2025

AngryMane commented Feb 12, 2025

ferlosvillas commented Jan 29, 2025 •

edited

Loading