Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adaptive longOp for du operation #1766

Merged
merged 1 commit into from
Oct 5, 2017

Conversation

sjenning
Copy link
Contributor

@sjenning sjenning commented Oct 5, 2017

When running pod density tests on the order of 500 pods, the du operation in the fsHandler can take longer than a second, even if everything about the filesystem is cached. This is a persistent condition that causes the message to continually print.

This PR makes longOp adaptive so if the long du duration is a persistent condition, the message won't continue in the logs forever.

https://bugzilla.redhat.com/show_bug.cgi?id=1498632

@dashpole @vishh @derekwaynecarr @ravisantoshgudimetla

@ravisantoshgudimetla
Copy link

ravisantoshgudimetla commented Oct 5, 2017

@sjenning Sometimes, I have noticed that du operation is taking more than 30 seconds, which means this line will be printed 300 times after which it should stop logging, which I believe should work fine. Any specific reason for not decreasing log level from 2 to higher value?

@@ -60,6 +59,8 @@ const DefaultPeriod = time.Minute

var _ FsHandler = &realFsHandler{}

var longOp = time.Second

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this variable to trackUsage fn?

@sjenning
Copy link
Contributor Author

sjenning commented Oct 5, 2017

@ravisantoshgudimetla at first I did just decrease the log level. But then someone running at the higher logging level would have the same complaint; that being continually seeing a message warning me about a persistent condition that I can't do anything about because someone decided what "too long" was.

@ravisantoshgudimetla
Copy link

at first I did just decrease the log level. But then someone running at the higher logging level would have the same complaint

Makes sense.

@@ -129,6 +129,10 @@ func (fh *realFsHandler) trackUsage() {
duration := time.Since(start)
if duration > longOp {
glog.V(2).Infof("du and find on following dirs took %v: %v", duration, []string{fh.rootfs, fh.extraDir})
// adapt longOp time so that message doesn't continue to print
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log something like the following:

"du and find on following dirs took %v: %v; will log in future if exceeds %v"

@derekwaynecarr
Copy link
Collaborator

LGTM

@derekwaynecarr
Copy link
Collaborator

/retest

@derekwaynecarr derekwaynecarr merged commit 3e659ec into google:master Oct 5, 2017
openshift-merge-robot added a commit to openshift/origin that referenced this pull request Oct 6, 2017
Automatic merge from submit-queue.

UPSTREAM: google/cadvisor: 1766: adaptive longOp for du operation

google/cadvisor#1766

xref https://bugzilla.redhat.com/show_bug.cgi?id=1498632

@derekwaynecarr @ravisantoshgudimetla 

hold pending #16615
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants