Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs requeueing on new OS #2

Closed
heatherkellyucl opened this issue Sep 4, 2015 · 2 comments
Closed

Jobs requeueing on new OS #2

heatherkellyucl opened this issue Sep 4, 2015 · 2 comments
Assignees
Labels

Comments

@heatherkellyucl
Copy link
Contributor

Jobs which complete successfully or end on an error are going back into the queue to be rerun rather than being removed from it. This is a known issue and we are working on it.

@heatherkellyucl heatherkellyucl self-assigned this Sep 4, 2015
@heatherkellyucl
Copy link
Contributor Author

This was fixed on Friday afternoon and jobs will complete properly now.

@LukeSudberyUCL
Copy link

To surmise - here's was what the problem(s) was:

  • a new part of the epilog was introduced - this was designed to clear any locks on /dev/ipath which jobs left behind. This worked, but unfortunately if there were no locks to start with it introduced an error.
  • extra debugging and logging was added to try and find and record the above problem, and any futures ones like it.
  • once the original problem was resolved, the debugging (which retained the jobs state as 'active' on nodes were it failed) remained, and certain jobs kept reuseing this data - saying they had failed, when in fact now they shouldn't be.

So the first part was fixed on Friday, and the last step was resolved this morning (Monday).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants