You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The **Cephpocalypse** was an event occurring in the fall of the 2018-2019 school year, when the Ceph cluster which hosted our central storage drives went completely offline. This incident demonstrated the capability of the Sysadmin team and prompted us to start thinking about ways to remove that one point of failure \(say, through a backup system\).
3
+
The **Cephpocalypse** was an event occurring in the fall of the 2018-2019 school year, when the Ceph cluster which hosted our central network storage went completely offline. This incident demonstrated the capability of the Sysadmin team and prompted us to start thinking about ways to remove that one point of failure \(say, through a backup system\).
4
4
5
5
The purpose of this document is to record our mistakes and remedial actions so that future generations may learn from them.
6
6
7
-
## Conditions
7
+
## Background
8
8
9
-
After delays in obtaining approval for the new storage servers, we received
9
+
After delays in obtaining approval for the new storage servers, we finally received these new G10 servers. In anticipation for the eventual transfer of our Ceph cluster to these new servers, we mounted the servers and prepared the new servers. One of those preparations was a needed upgrade to our production Ceph cluster that went horribly wrong.
10
10
11
11
## Cause
12
12
13
-
On Sunday, September , the Storage Lead \(SL\)began the process of proceeding with major release upgrades to the component servers of our production Ceph cluster.
13
+
On a Sunday in mid-September of 2018, the Storage Lead began the process of of upgrading the component servers of our production Ceph cluster to latest major version. We had been running jewel and we needed to get to mimic. The Storage Lead, in quick succession, upgraded these servers up two major releases. A later independent review suggests that this rapid upgrade of two major releases was to blame for the Cephpocalypse.
14
14
15
-
## Reaction
15
+
## Reactions
16
16
17
17
### From Sysadmins
18
18
19
-
### From other Students
19
+
When we received UptimeRobot notifications, we initially thought that the Storage Lead would be able to fix his mistake fairly quickly. When a fix did not materialize,
20
20
21
-
### From Administration
21
+
### From other students
22
+
23
+
We were featured [on tjToday](https://www.tjtoday.org/24197/showcase/the-system-to-saving-syslab/). For most students not in the SysLab,
24
+
25
+
### From Staff
26
+
27
+
Most TJ staff did not notice much disruption in our services since we were able to quickly restore our most public service, Ion.
22
28
23
29
## Remedial Actions
24
30
25
31
### Trying to fix Ceph
26
32
27
-
### Un-cephing everything
33
+
### Un-Cephing everything
28
34
29
35
### Moving things back to new Ceph
30
36
31
37
## What we learned
32
38
39
+
* It is important to keep off-site backups.
40
+
* It is nice to have multiple people know how Ceph operates.
41
+
* We lack contigency plans.
42
+
* Teamwork is important.
43
+
* We lack documentation.
44
+
33
45
## Results
34
46
47
+
After more than two excruciating weeks, the Storage Lead recovered data from the old cluster \(albeit partially corrupted\) and we began the process of moving everything back onto Ceph. This process ensued for the upcoming months.
Copy file name to clipboardexpand all lines: machines/other/agni.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Agni
2
2
3
-
**Agni** is a box sitting on the floor in the CSL Machine Room. It is currently the lab's primary [NTP](../../technologies/networking/ntp.md) server.
3
+
**Agni** is a box sitting on the floor in the CSL Machine Room between the Borg/NASA Rack and VM Rack 0. It is currently the lab's primary [NTP](../../technologies/networking/ntp.md) server.
4
4
5
5
## Technical Specifications
6
6
@@ -16,5 +16,5 @@
16
16
17
17
## History
18
18
19
-
Agni was received as a part of the 2008 Sun Academic Excellence Grant and was originally used as an emergency access workstation running Solaris 10 \(it did not rely on Kerberos or AFS\). Later it became the Lab's backup server running OpenSolaris.
19
+
Agni was received as a part of the 2008 Sun Academic Excellence Grant and was originally used as an emergency access workstation running Solaris 10 \(it did not rely on Kerberos or AFS\). Later, it became the Lab's backup server running OpenSolaris. It is now being used as our primary NTP server.
@@ -17,7 +17,7 @@ _THE_ single most powerful machine in the lab. Probably even outpaces the [Clust
17
17
18
18
Snowy was a purchase order as part of Arya Kumar's 2019 Senior Research project. He went through a bunch of different names as the case kept getting delayed in the mail, eventually arriving 3 months late.
19
19
20
-
Snowy currently lives in Room 202, but we have plans to move it permanently into the Server Room where it can be part of the Cluster.
20
+
Snowy is currently sitting in the HPC rack on top of the HPC nodes. Its ethernet cable comes from the Sun Rack and power is a mess.
Copy file name to clipboardexpand all lines: procedures/account-provisioning.md
-4
Original file line number
Diff line number
Diff line change
@@ -2,10 +2,6 @@
2
2
3
3
Even though we have integrated authentication for accounts, user provisioning still needs to occur in every system independently.
4
4
5
-
## Windows/Active Directory
6
-
7
-
The Windows IT staff takes care of this. Sysadmins \(starting with the graduating class of 2006\) are moved into a separate ou \(organizational unit\) before this occurs and will have their accounts preserved, but passwords are still subject to expire annually.
8
-
9
5
## Unix accounts
10
6
11
7
We have a script called `create_user.sh` that provisions all necessary accounts. It takes the username. first name. and last name as the arguments.
Copy file name to clipboardexpand all lines: services/cluster/borg.md
+4-4
Original file line number
Diff line number
Diff line change
@@ -4,10 +4,10 @@ Borg is experimental, so documentation is limited for now.
4
4
5
5
Pertinent things to know:
6
6
7
-
* We have 40 \(`borgw2[01-40]`\), but can only run 12 \(`borgw2[01-12]`\)
8
-
* They are named after constellations
7
+
* We have 40 \(`borg[01-40]`\), but can only run 12 \(`borg[01-12]`\)
8
+
* They are named after constellations or, in rare cases, `borgw[01-40]`
9
9
* They run the same OS \([CentOS](../../technologies/operating-systems/centos.md)\) and use the same [Ansible](../../technologies/tools/ansible.md) play as the rest of the cluster.
10
-
*To use them from infosphere, add the `-p compute2` flag to your `srun` or `salloc` play.
10
+
*We are currently transitioning to Ubuntu Server.
11
11
12
-
There is debate on whether or not we should move the Borg cluster to Ubuntu Server to be standard with the rest of the lab. The author of this documentation feels that's a good idea, but would require a substantially large rewrite of many current cluster [Ansible](../../technologies/tools/ansible.md) plays.
12
+
We are currently in the process of moving the CentOS nodes to Ubuntu Server in order to become standard with the rest of the Lab and avoid. This effort is currently in progress but requires a significant rewrite of many current cluster [Ansible](../../technologies/tools/ansible.md) plays.
Copy file name to clipboardexpand all lines: services/cluster/setup.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -12,13 +12,13 @@ Do the standard steps to add new entries for a node \(or block of nodes\) to DHC
12
12
13
13
Also make sure the switches/routers/cables are set up right.
14
14
15
-
## Install CentOS
15
+
## Install Ubuntu
16
16
17
-
The preferred method is to use Netboot, but a regular CentOS install stick works as well. If installing from an USB stick make sure your hostname matches the one specified in DNS/DHCP.
17
+
The preferred method is to use Netboot, but a regular Ubuntu install stick works as well. If installing from an USB stick make sure your hostname matches the one specified in DNS/DHCP.
Copy file name to clipboardexpand all lines: services/cluster/slurm-administration.md
+12
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,8 @@
2
2
3
3
This page is intended to serve as a guide for Sysadmins who need to administrate the Slurm system running on the HPC cluster. If you're a regular user, this information probably won't be very interesting to you.
4
4
5
+
Here is a Slurm quickstart from their developers: [https://slurm.schedmd.com/quickstart.html](https://slurm.schedmd.com/quickstart.html)
6
+
5
7
## Accounts vs Users
6
8
7
9
The Slurm accounting system separates the ideas of Accounts and Users, which is slightly confusing at first. When you look at it from the higher-level functioning of Slurm though, these concepts make sense.
Copy file name to clipboardexpand all lines: services/cluster/slurm.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -49,9 +49,9 @@ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
49
49
50
50
### Creating Programs to Run of the HPC Cluster
51
51
52
-
The HPC Cluster is comprised of 64-bit CentOS Linux systems. While you can run any old Linux program on the Cluster, to take advantage of the parallel processing capability that the Cluster has, it's _highly_ recommended to make use of a parallel programming interface. If you're taking or have taken Parallel Computing, you will know how to write and compile a program which uses MPI. If you aren't, [http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml](http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml) is a good introduction to MPI in C. See below for instructions on running an MPI program on the cluster.
52
+
The HPC Cluster is comprised of 64-bit CentOS Linux or Ubuntu Server systems. While you can run any old Linux program on the Cluster, to take advantage of the parallel processing capability that the Cluster has, it's _highly_ recommended to make use of a parallel programming interface. If you're taking or have taken Parallel Computing, you will know how to write and compile a program which uses MPI. If you aren't, [http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml](http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml) is a good introduction to MPI in C. See below for instructions on running an MPI program on the cluster.
53
53
54
-
When compiling your program, it's best to connect to infosphere \(the login node explained in the section above\), so that your code is compiled in a similar environment to where it will be run. The login node should have all the necessary tools to do so, such as gcc, g++, and mpicc/mpixx.
54
+
When compiling your program, it's best to connect to `infosphere`\(the login node explained in the section above\), so that your code is compiled in a similar environment to where it will be run. The login node should have all the necessary tools to do so, such as gcc, g++, and mpicc/mpixx.
55
55
56
56
**Important note: You won't be able to run mpicc or other special compilation tools until you load the appropriate programs into your environment. For MPI, the command to do so is**`module load mpi`**.** The reason for this is different compiler systems can conflict with each other, and the module system gives you the flexibility to use whatever compiler you want by loading the appropriate modules.
57
57
@@ -67,7 +67,7 @@ Salloc allocates resources for a generic job and, by default, creates a shell wi
67
67
68
68
This is the simplest method, and is probably what you want to start out with. All you have to do is run `srun -n (processes) (path_to_program)`, where `(processes)` is the number of instances of your program that you want to run, and `(path_to_program)` is, you guessed it, the path to the program you want to run. If your program is an MPI program, you should not use `srun`, and instead use the `salloc` method described above.
69
69
70
-
If your command is successful, you should see "srun: jobid \(x\) submitted". You can check on the status of your job by running `sacct`. You will receive any output of your program to the console. For more resource options, run `man srun` or use the official Slurm documentation.
70
+
If your command is successful, you should see `srun: jobid (x) submitted`. You can check on the status of your job by running `sacct`. You will receive any output of your program to the console. For more resource options, run `man srun` or use the official Slurm documentation.
0 commit comments