Skip to content

Commit dd5db24

Browse files
theo-ogitbook-bot
authored andcommittedOct 20, 2019
GitBook: [master] 14 pages modified
1 parent 5d3a2dc commit dd5db24

14 files changed

+56
-69
lines changed
 

‎SUMMARY.md

+1-3
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,8 @@
5555
* [Academic Services](services/academic-services/README.md)
5656
* [Tin](services/academic-services/tin.md)
5757
* [Othello](services/academic-services/othello/README.md)
58-
* [Setup](services/academic-services/othello/setup.md)
5958
* [Administration](services/academic-services/othello/administration.md)
59+
* [Setup](services/academic-services/othello/setup.md)
6060
* [Technologies](technologies/README.md)
6161
* [Web](technologies/web/README.md)
6262
* [Nginx](technologies/web/nginx.md)
@@ -214,8 +214,6 @@
214214
* [Procedures](procedures/README.md)
215215
* [Data Recovery](procedures/data-recovery.md)
216216
* [Account Provisioning](procedures/account-provisioning.md)
217-
* [Username Changes](procedures/username-changes.md)
218-
* [Virtual Machine Creation](procedures/vm-creation.md)
219217
* [tjSTAR](procedures/tjstar/README.md)
220218
* [Tech Support](procedures/tjstar/tech-support.md)
221219
* [Onboarding](procedures/onboarding/README.md)
+24-8
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,50 @@
11
# 2018 Cephpocalypse
22

3-
The **Cephpocalypse** was an event occurring in the fall of the 2018-2019 school year, when the Ceph cluster which hosted our central storage drives went completely offline. This incident demonstrated the capability of the Sysadmin team and prompted us to start thinking about ways to remove that one point of failure \(say, through a backup system\).
3+
The **Cephpocalypse** was an event occurring in the fall of the 2018-2019 school year, when the Ceph cluster which hosted our central network storage went completely offline. This incident demonstrated the capability of the Sysadmin team and prompted us to start thinking about ways to remove that one point of failure \(say, through a backup system\).
44

55
The purpose of this document is to record our mistakes and remedial actions so that future generations may learn from them.
66

7-
## Conditions
7+
## Background
88

9-
After delays in obtaining approval for the new storage servers, we received
9+
After delays in obtaining approval for the new storage servers, we finally received these new G10 servers. In anticipation for the eventual transfer of our Ceph cluster to these new servers, we mounted the servers and prepared the new servers. One of those preparations was a needed upgrade to our production Ceph cluster that went horribly wrong.
1010

1111
## Cause
1212

13-
On Sunday, September , the Storage Lead \(SL\) began the process of proceeding with major release upgrades to the component servers of our production Ceph cluster.
13+
On a Sunday in mid-September of 2018, the Storage Lead began the process of of upgrading the component servers of our production Ceph cluster to latest major version. We had been running jewel and we needed to get to mimic. The Storage Lead, in quick succession, upgraded these servers up two major releases. A later independent review suggests that this rapid upgrade of two major releases was to blame for the Cephpocalypse.
1414

15-
## Reaction
15+
## Reactions
1616

1717
### From Sysadmins
1818

19-
### From other Students
19+
When we received UptimeRobot notifications, we initially thought that the Storage Lead would be able to fix his mistake fairly quickly. When a fix did not materialize,
2020

21-
### From Administration
21+
### From other students
22+
23+
We were featured [on tjToday](https://www.tjtoday.org/24197/showcase/the-system-to-saving-syslab/). For most students not in the SysLab,
24+
25+
### From Staff
26+
27+
Most TJ staff did not notice much disruption in our services since we were able to quickly restore our most public service, Ion.
2228

2329
## Remedial Actions
2430

2531
### Trying to fix Ceph
2632

27-
### Un-cephing everything
33+
### Un-Cephing everything
2834

2935
### Moving things back to new Ceph
3036

3137
## What we learned
3238

39+
* It is important to keep off-site backups.
40+
* It is nice to have multiple people know how Ceph operates.
41+
* We lack contigency plans.
42+
* Teamwork is important.
43+
* We lack documentation.
44+
3345
## Results
3446

47+
After more than two excruciating weeks, the Storage Lead recovered data from the old cluster \(albeit partially corrupted\) and we began the process of moving everything back onto Ceph. This process ensued for the upcoming months.
48+
49+
50+

‎machines/other/agni.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Agni
22

3-
**Agni** is a box sitting on the floor in the CSL Machine Room. It is currently the lab's primary [NTP](../../technologies/networking/ntp.md) server.
3+
**Agni** is a box sitting on the floor in the CSL Machine Room between the Borg/NASA Rack and VM Rack 0. It is currently the lab's primary [NTP](../../technologies/networking/ntp.md) server.
44

55
## Technical Specifications
66

@@ -16,5 +16,5 @@
1616

1717
## History
1818

19-
Agni was received as a part of the 2008 Sun Academic Excellence Grant and was originally used as an emergency access workstation running Solaris 10 \(it did not rely on Kerberos or AFS\). Later it became the Lab's backup server running OpenSolaris.
19+
Agni was received as a part of the 2008 Sun Academic Excellence Grant and was originally used as an emergency access workstation running Solaris 10 \(it did not rely on Kerberos or AFS\). Later, it became the Lab's backup server running OpenSolaris. It is now being used as our primary NTP server.
2020

‎machines/other/snowy.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ _THE_ single most powerful machine in the lab. Probably even outpaces the [Clust
66

77
| Field | Value |
88
| :--- | :--- |
9-
| **Server Type** | It's a desktop, not a server dummy |
9+
| **Server Type** | It's a desktop \(a documentation writer's worst nightmare\) |
1010
| **CPU** | AMD Ryzen Threadripper 1950X 16-Core Processor |
1111
| **RAM** | 64 GB |
1212
| **GPU** | 3x NVIDIA RTX 2080 |
@@ -17,7 +17,7 @@ _THE_ single most powerful machine in the lab. Probably even outpaces the [Clust
1717

1818
Snowy was a purchase order as part of Arya Kumar's 2019 Senior Research project. He went through a bunch of different names as the case kept getting delayed in the mail, eventually arriving 3 months late.
1919

20-
Snowy currently lives in Room 202, but we have plans to move it permanently into the Server Room where it can be part of the Cluster.
20+
Snowy is currently sitting in the HPC rack on top of the HPC nodes. Its ethernet cable comes from the Sun Rack and power is a mess.
2121

2222
## Trivia
2323

‎machines/switches/README.md

+2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
# Switches
22

3+
Switches form the backbone of our network. We highlight four of them in this documentation.
34

5+
More information on switches and their role can be found in [our runbooks](../../general/documentation/runbooks.md).
46

‎machines/ups.md

+2
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,5 @@
22

33
We have a few Uninterruptable Power Supplys that provide power to our machines to keep them powered while the backup generator turns on.
44

5+
More documentation can be found in [our runbooks](../general/documentation/runbooks.md).
6+

‎machines/vlans.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ The following is a list of VLANs and the corresponding address spaces for refere
5555

5656
* Cluster
5757
* IPv4: 198.38.20.0/25 \(198.38.20.0-198.38.20.127\)
58-
* IPv4 Gateway: 198.38.20.126 (gateway-2000.csl.tjhsst.edu)
58+
* IPv4 Gateway: 198.38.20.126 \(gateway-2000.csl.tjhsst.edu\)
5959
* IPv6: 2001:468:cc0:2000::/64
6060
* IPv6 Gateway: 2001:468:cc0:2000::/64
6161

‎procedures/account-provisioning.md

-4
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,6 @@
22

33
Even though we have integrated authentication for accounts, user provisioning still needs to occur in every system independently.
44

5-
## Windows/Active Directory
6-
7-
The Windows IT staff takes care of this. Sysadmins \(starting with the graduating class of 2006\) are moved into a separate ou \(organizational unit\) before this occurs and will have their accounts preserved, but passwords are still subject to expire annually.
8-
95
## Unix accounts
106

117
We have a script called `create_user.sh` that provisions all necessary accounts. It takes the username. first name. and last name as the arguments.

‎procedures/username-changes.md

-26
This file was deleted.

‎procedures/vm-creation.md

-13
This file was deleted.

‎services/cluster/borg.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@ Borg is experimental, so documentation is limited for now.
44

55
Pertinent things to know:
66

7-
* We have 40 \(`borgw2[01-40]`\), but can only run 12 \(`borgw2[01-12]`\)
8-
* They are named after constellations
7+
* We have 40 \(`borg[01-40]`\), but can only run 12 \(`borg[01-12]`\)
8+
* They are named after constellations or, in rare cases, `borgw[01-40]`
99
* They run the same OS \([CentOS](../../technologies/operating-systems/centos.md)\) and use the same [Ansible](../../technologies/tools/ansible.md) play as the rest of the cluster.
10-
* To use them from infosphere, add the `-p compute2` flag to your `srun` or `salloc` play.
10+
* We are currently transitioning to Ubuntu Server.
1111

12-
There is debate on whether or not we should move the Borg cluster to Ubuntu Server to be standard with the rest of the lab. The author of this documentation feels that's a good idea, but would require a substantially large rewrite of many current cluster [Ansible](../../technologies/tools/ansible.md) plays.
12+
We are currently in the process of moving the CentOS nodes to Ubuntu Server in order to become standard with the rest of the Lab and avoid. This effort is currently in progress but requires a significant rewrite of many current cluster [Ansible](../../technologies/tools/ansible.md) plays.
1313

‎services/cluster/setup.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,13 @@ Do the standard steps to add new entries for a node \(or block of nodes\) to DHC
1212

1313
Also make sure the switches/routers/cables are set up right.
1414

15-
## Install CentOS
15+
## Install Ubuntu
1616

17-
The preferred method is to use Netboot, but a regular CentOS install stick works as well. If installing from an USB stick make sure your hostname matches the one specified in DNS/DHCP.
17+
The preferred method is to use Netboot, but a regular Ubuntu install stick works as well. If installing from an USB stick make sure your hostname matches the one specified in DNS/DHCP.
1818

1919
## Install SSH Server
2020

21-
`yum install openssh openssh-server`. 'Nuf said.
21+
`apt install openssh-server`. 'Nuf said.
2222

2323
## Run the Ansible play
2424

‎services/cluster/slurm-administration.md

+12
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
This page is intended to serve as a guide for Sysadmins who need to administrate the Slurm system running on the HPC cluster. If you're a regular user, this information probably won't be very interesting to you.
44

5+
Here is a Slurm quickstart from their developers: [https://slurm.schedmd.com/quickstart.html](https://slurm.schedmd.com/quickstart.html)
6+
57
## Accounts vs Users
68

79
The Slurm accounting system separates the ideas of Accounts and Users, which is slightly confusing at first. When you look at it from the higher-level functioning of Slurm though, these concepts make sense.
@@ -41,3 +43,13 @@ cp -r /etc/skel /cluster/(username)
4143
chown -R (username) /cluster/(username)
4244
```
4345

46+
## Partitions/Nodes
47+
48+
Slurm has a system of partitions that help segment work.
49+
50+
In our setup, we have two partitions \`compute\` and `gpu`.
51+
52+
53+
54+
55+

‎services/cluster/slurm.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,9 @@ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4949

5050
### Creating Programs to Run of the HPC Cluster
5151

52-
The HPC Cluster is comprised of 64-bit CentOS Linux systems. While you can run any old Linux program on the Cluster, to take advantage of the parallel processing capability that the Cluster has, it's _highly_ recommended to make use of a parallel programming interface. If you're taking or have taken Parallel Computing, you will know how to write and compile a program which uses MPI. If you aren't, [http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml](http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml) is a good introduction to MPI in C. See below for instructions on running an MPI program on the cluster.
52+
The HPC Cluster is comprised of 64-bit CentOS Linux or Ubuntu Server systems. While you can run any old Linux program on the Cluster, to take advantage of the parallel processing capability that the Cluster has, it's _highly_ recommended to make use of a parallel programming interface. If you're taking or have taken Parallel Computing, you will know how to write and compile a program which uses MPI. If you aren't, [http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml](http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml) is a good introduction to MPI in C. See below for instructions on running an MPI program on the cluster.
5353

54-
When compiling your program, it's best to connect to infosphere \(the login node explained in the section above\), so that your code is compiled in a similar environment to where it will be run. The login node should have all the necessary tools to do so, such as gcc, g++, and mpicc/mpixx.
54+
When compiling your program, it's best to connect to `infosphere` \(the login node explained in the section above\), so that your code is compiled in a similar environment to where it will be run. The login node should have all the necessary tools to do so, such as gcc, g++, and mpicc/mpixx.
5555

5656
**Important note: You won't be able to run mpicc or other special compilation tools until you load the appropriate programs into your environment. For MPI, the command to do so is** `module load mpi`**.** The reason for this is different compiler systems can conflict with each other, and the module system gives you the flexibility to use whatever compiler you want by loading the appropriate modules.
5757

@@ -67,7 +67,7 @@ Salloc allocates resources for a generic job and, by default, creates a shell wi
6767

6868
This is the simplest method, and is probably what you want to start out with. All you have to do is run `srun -n (processes) (path_to_program)`, where `(processes)` is the number of instances of your program that you want to run, and `(path_to_program)` is, you guessed it, the path to the program you want to run. If your program is an MPI program, you should not use `srun`, and instead use the `salloc` method described above.
6969

70-
If your command is successful, you should see "srun: jobid \(x\) submitted". You can check on the status of your job by running `sacct`. You will receive any output of your program to the console. For more resource options, run `man srun` or use the official Slurm documentation.
70+
If your command is successful, you should see `srun: jobid (x) submitted`. You can check on the status of your job by running `sacct`. You will receive any output of your program to the console. For more resource options, run `man srun` or use the official Slurm documentation.
7171

7272
#### `sbatch`
7373

0 commit comments

Comments
 (0)
Please sign in to comment.