Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 instance runners marked as orphan and deleted even when the job is running. #4391

Open
ferlosvillas opened this issue Jan 28, 2025 · 6 comments

Comments

@ferlosvillas
Copy link

Hello
I'm having some issues with AWS GHA runners, the EC2 instances are always marked as orphan when the scale-down script is executed, and when scale-down is executed again they are terminated.
I'm using version 6.0.1 and deploying everything using the terraform module.

Here is my test.
I execute a task that waits for 2 hours, sending a message to console every minute.
The action is executed @21:33
This is the log of scale up
2025-01-24 21:34:34 {"level":"INFO","message":"Created instance(s): i-0425d988f2c7fa6f4","sampling_rate":0,"service":"runners-scale-up","timestamp":"2025-01-25T00:34:29.744Z","xray_trace_id":"1-67943191-450de17da578343815965378","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"ec284d3c-9be4-5940-b336-2d32a8272852","function-name":"gha-ondemand-multi-linux-x64-dem-scale-up","runner":{"type":"Repo","owner":"fabfitfun/tv-api","namePrefix":"gha-ondemand-normal-","configuration":{"runnerType":"Repo","runnerOwner":"fabfitfun/tv-api","numberOfRunners":1,"ec2instanceCriteria":{"instanceTypes":["t3a.2xlarge"],"targetCapacityType":"on-demand","instanceAllocationStrategy":"lowest-price"},"environment":"gha-ondemand-multi-linux-x64-dem","launchTemplateName":"gha-ondemand-multi-linux-x64-dem-action-runner","subnets":["subnet-048333888bdbd82e0","subnet-053fb4a528b2cf900","subnet-0f069d8ae24648af2"],"tracingEnabled":false,"onDemandFailoverOnError":[]}},"github":{"event":"workflow_job","workflow_job_id":"36153145706"}}

Then this is the log of scale-down (tagging execution)
2025-01-24 21:45:55 {"level":"DEBUG","message":"Found: '0' GitHub runners for AWS runner instance: 'i-0425d988f2c7fa6f4'","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.100Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 21:45:55 {"level":"DEBUG","message":"GitHub runners for AWS runner instance: 'i-0425d988f2c7fa6f4': []","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.100Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 21:45:55 {"level":"DEBUG","message":"Tagging 'i-0425d988f2c7fa6f4'","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.100Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down","tags":[{"Key":"ghr:orphan","Value":"true"}]} 2025-01-24 21:45:55 {"level":"INFO","message":"Runner 'i-0425d988f2c7fa6f4' marked as orphan.","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T00:45:53.388Z","xray_trace_id":"1-6794343b-6e0365590f6c94c11e72aff7","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"523a23f1-54ab-407f-af3f-26d9e01619a8","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"}

And this the second pass of scale-down
2025-01-24 22:00:53 {"level":"INFO","message":"Terminating orphan runner 'i-0425d988f2c7fa6f4'","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T01:00:50.394Z","xray_trace_id":"1-679437bf-375a9fde597887a5102c1220","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"scale-down","aws-request-id":"73d294c7-c575-4bde-812b-68a9f2c43ef0","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 22:00:53 {"level":"DEBUG","message":"Runner 'i-0425d988f2c7fa6f4' will be terminated.","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T01:00:50.394Z","xray_trace_id":"1-679437bf-375a9fde597887a5102c1220","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"73d294c7-c575-4bde-812b-68a9f2c43ef0","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"} 2025-01-24 22:00:53 {"level":"DEBUG","message":"Runner i-0425d988f2c7fa6f4 has been terminated.","sampling_rate":0,"service":"runners-scale-down","timestamp":"2025-01-25T01:00:50.723Z","xray_trace_id":"1-679437bf-375a9fde597887a5102c1220","region":"us-west-2","environment":"gha-ondemand-multi-linux-x64-dem","module":"runners","aws-request-id":"73d294c7-c575-4bde-812b-68a9f2c43ef0","function-name":"gha-ondemand-multi-linux-x64-dem-scale-down"}

What might be wrong?, why is the scale-down script not detecting that the instance is still active?
Any suggestion or comment would be much appreciated.

Fernando.

@npalm
Copy link
Member

npalm commented Jan 28, 2025

Scale down is running indeed in 2 cycles, first makring instances and in a next cycle terminationg them. But the instance should only be marked orphan if the runners is not registred in GitHub. We have seen similar problems when runners are heavy loaded and the github agent is not connecting back to the mothership. But this problem we only have seen under heavy load. The load problem is a general one, also existing on github managed onces.

@Brend-Smits do you recognize this pattern?

@ferlosvillas
Copy link
Author

ferlosvillas commented Jan 29, 2025

@npalm , yes thanks for the clarification and that is what understand on how it works.
The problem is that the scale-down script does not detect the instance as active, and it tags the instance immediately. I reduced the time of the scheduler to 5 minutes so I can detect the issue faster. Here are a couple of screenshots with the status.

Here we can see the EC2 instance in AWS, the Run in GitHub, and the instances in the runner group
Image

This is the log in cloudwatch
Image

@npalm
Copy link
Member

npalm commented Jan 30, 2025

Can you reproduce the probelm with one of the examples in the repository?

@ferlosvillas
Copy link
Author

Can you reproduce the probelm with one of the examples in the repository?

Sorry, what do you mean?

@npalm
Copy link
Member

npalm commented Feb 1, 2025

You problem is clear, however I do not see the problem on my test setup, as well not on my production systems. I just tested on my test setup based on the de example in this repo (examples/default) the problem you described. I had a two long running jobs, but in my ssetup the scale-down is not marking the 2 runners as orphan as long the jobs are running.

Can you reproduce the problem you describe based on the examples in the repository for example the one in examples/default.

@AngryMane
Copy link

I had the same problem, but it was resolved by setting runner_boot_time_in_minutes to a longer value.
After an EC2 instance registers github -runner(runner-A) on the github server, it may take some time for the github server to return runner-A to runner REST queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants