GitBook: [master] 14 pages modified

theo-o · gitbook-bot · commit dd5db2487cb3 · 2019-10-20T17:03:27.000Z
diff --git a/SUMMARY.md b/SUMMARY.md
@@ -55,8 +55,8 @@
   * [Academic Services](services/academic-services/README.md)
     * [Tin](services/academic-services/tin.md)
     * [Othello](services/academic-services/othello/README.md)
-      * [Setup](services/academic-services/othello/setup.md)
       * [Administration](services/academic-services/othello/administration.md)
+      * [Setup](services/academic-services/othello/setup.md)
 * [Technologies](technologies/README.md)
   * [Web](technologies/web/README.md)
     * [Nginx](technologies/web/nginx.md)
@@ -214,8 +214,6 @@
 * [Procedures](procedures/README.md)
   * [Data Recovery](procedures/data-recovery.md)
   * [Account Provisioning](procedures/account-provisioning.md)
-  * [Username Changes](procedures/username-changes.md)
-  * [Virtual Machine Creation](procedures/vm-creation.md)
   * [tjSTAR](procedures/tjstar/README.md)
     * [Tech Support](procedures/tjstar/tech-support.md)
   * [Onboarding](procedures/onboarding/README.md)
diff --git a/machines/history/2018-cephpocalypse.md b/machines/history/2018-cephpocalypse.md
@@ -1,34 +1,50 @@
 # 2018 Cephpocalypse
 
-The **Cephpocalypse** was an event occurring in the fall of the 2018-2019 school year, when the Ceph cluster which hosted our central storage drives went completely offline. This incident demonstrated the capability of the Sysadmin team and prompted us to start thinking about ways to remove that one point of failure \(say, through a backup system\).
+The **Cephpocalypse** was an event occurring in the fall of the 2018-2019 school year, when the Ceph cluster which hosted our central network storage went completely offline. This incident demonstrated the capability of the Sysadmin team and prompted us to start thinking about ways to remove that one point of failure \(say, through a backup system\).
 
 The purpose of this document is to record our mistakes and remedial actions so that future generations may learn from them.
 
-## Conditions
+## Background
 
-After delays in obtaining approval for the new storage servers, we received 
+After delays in obtaining approval for the new storage servers, we finally received these new G10 servers.  In anticipation for the eventual transfer of our Ceph cluster to these new servers, we mounted the servers and prepared the new servers. One of those preparations was a needed upgrade to our production Ceph cluster that went horribly wrong.
 
 ## Cause
 
-On Sunday, September , the Storage Lead \(SL\) began the process of proceeding with major release upgrades to the component servers of our production Ceph cluster.
+On a Sunday in mid-September of 2018, the Storage Lead began the process of of upgrading the component servers of our production Ceph cluster to latest major version.  We had been running jewel and we needed to get to mimic. The Storage Lead, in quick succession, upgraded these servers up two major releases. A later independent review suggests that this rapid upgrade of two major releases was to blame for the Cephpocalypse.
 
-## Reaction
+## Reactions
 
 ### From Sysadmins
 
-### From other Students
+When we received UptimeRobot notifications, we initially thought that the Storage Lead would be able to fix his mistake fairly quickly. When a fix did not materialize, 
 
-### From Administration
+### From other students
+
+We were featured [on tjToday](https://www.tjtoday.org/24197/showcase/the-system-to-saving-syslab/). For most students not in the SysLab, 
+
+### From Staff
+
+Most TJ staff did not notice much disruption in our services since we were able to quickly restore our most public service, Ion.
 
 ## Remedial Actions
 
 ### Trying to fix Ceph
 
-### Un-cephing everything
+### Un-Cephing everything
 
 ### Moving things back to new Ceph
 
 ## What we learned
 
+* It is important to keep off-site backups.
+* It is nice to have multiple people know how Ceph operates.
+* We lack contigency plans.
+* Teamwork is important.
+* We lack documentation.
+
 ## Results
 
+After more than two excruciating weeks, the Storage Lead recovered data from the old cluster \(albeit partially corrupted\) and we began the process of moving everything back onto Ceph. This process ensued for the upcoming months.
+
+
+
diff --git a/machines/other/agni.md b/machines/other/agni.md
@@ -1,6 +1,6 @@
 # Agni
 
-**Agni** is a box sitting on the floor in the CSL Machine Room. It is currently the lab's primary [NTP](../../technologies/networking/ntp.md) server.
+**Agni** is a box sitting on the floor in the CSL Machine Room between the Borg/NASA Rack and VM Rack 0. It is currently the lab's primary [NTP](../../technologies/networking/ntp.md) server.
 
 ## Technical Specifications
 
@@ -16,5 +16,5 @@
 
 ## History
 
-Agni was received as a part of the 2008 Sun Academic Excellence Grant and was originally used as an emergency access workstation running Solaris 10 \(it did not rely on Kerberos or AFS\). Later it became the Lab's backup server running OpenSolaris.
+Agni was received as a part of the 2008 Sun Academic Excellence Grant and was originally used as an emergency access workstation running Solaris 10 \(it did not rely on Kerberos or AFS\). Later, it became the Lab's backup server running OpenSolaris. It is now being used as our primary NTP server.
 
diff --git a/machines/other/snowy.md b/machines/other/snowy.md
@@ -6,7 +6,7 @@ _THE_ single most powerful machine in the lab. Probably even outpaces the [Clust
 
 | Field | Value |
 | :--- | :--- |
-| **Server Type** | It's a desktop, not a server dummy |
+| **Server Type** | It's a desktop \(a documentation writer's worst nightmare\) |
 | **CPU** | AMD Ryzen Threadripper 1950X 16-Core Processor |
 | **RAM** | 64 GB |
 | **GPU** | 3x NVIDIA RTX 2080 |
@@ -17,7 +17,7 @@ _THE_ single most powerful machine in the lab. Probably even outpaces the [Clust
 
 Snowy was a purchase order as part of Arya Kumar's 2019 Senior Research project. He went through a bunch of different names as the case kept getting delayed in the mail, eventually arriving 3 months late.
 
-Snowy currently lives in Room 202, but we have plans to move it permanently into the Server Room where it can be part of the Cluster.
+Snowy is currently sitting in the HPC rack on top of the HPC nodes. Its ethernet cable comes from the Sun Rack and power is a mess.
 
 ## Trivia
 
diff --git a/machines/switches/README.md b/machines/switches/README.md
@@ -1,4 +1,6 @@
 # Switches
 
+Switches form the backbone of our network. We highlight four of them in this documentation.
 
+More information on switches and their role can be found in [our runbooks](../../general/documentation/runbooks.md).
 
diff --git a/machines/ups.md b/machines/ups.md
@@ -2,3 +2,5 @@
 
 We have a few Uninterruptable Power Supplys that provide power to our machines to keep them powered while the backup generator turns on.
 
+More documentation can be found in [our runbooks](../general/documentation/runbooks.md).
+
diff --git a/machines/vlans.md b/machines/vlans.md
@@ -55,7 +55,7 @@ The following is a list of VLANs and the corresponding address spaces for refere
 
 * Cluster
 * IPv4: 198.38.20.0/25 \(198.38.20.0-198.38.20.127\)
-* IPv4 Gateway: 198.38.20.126 (gateway-2000.csl.tjhsst.edu)
+* IPv4 Gateway: 198.38.20.126 \(gateway-2000.csl.tjhsst.edu\)
 * IPv6: 2001:468:cc0:2000::/64
 * IPv6 Gateway: 2001:468:cc0:2000::/64
 
diff --git a/procedures/account-provisioning.md b/procedures/account-provisioning.md
@@ -2,10 +2,6 @@
 
 Even though we have integrated authentication for accounts, user provisioning still needs to occur in every system independently.
 
-## Windows/Active Directory
-
-The Windows IT staff takes care of this. Sysadmins \(starting with the graduating class of 2006\) are moved into a separate ou \(organizational unit\) before this occurs and will have their accounts preserved, but passwords are still subject to expire annually.
-
 ## Unix accounts
 
 We have a script called `create_user.sh` that provisions all necessary accounts. It takes the username. first name. and last name as the arguments.
diff --git a/procedures/username-changes.md b/procedures/username-changes.md
diff --git a/procedures/vm-creation.md b/procedures/vm-creation.md
diff --git a/services/cluster/borg.md b/services/cluster/borg.md
@@ -4,10 +4,10 @@ Borg is experimental, so documentation is limited for now.
 
 Pertinent things to know:
 
-* We have 40 \(`borgw2[01-40]`\), but can only run 12 \(`borgw2[01-12]`\)
-* They are named after constellations
+* We have 40 \(`borg[01-40]`\), but can only run 12 \(`borg[01-12]`\)
+* They are named after constellations or, in rare cases, `borgw[01-40]`
 * They run the same OS \([CentOS](../../technologies/operating-systems/centos.md)\) and use the same [Ansible](../../technologies/tools/ansible.md) play as the rest of the cluster.
-* To use them from infosphere, add the `-p compute2` flag to your `srun` or `salloc` play.
+* We are currently transitioning to Ubuntu Server.
 
-There is debate on whether or not we should move the Borg cluster to Ubuntu Server to be standard with the rest of the lab. The author of this documentation feels that's a good idea, but would require a substantially large rewrite of many current cluster [Ansible](../../technologies/tools/ansible.md) plays.
+We are currently in the process of moving the CentOS nodes to Ubuntu Server in order to become standard with the rest of the Lab and avoid. This effort is currently in progress but requires a significant rewrite of many current cluster [Ansible](../../technologies/tools/ansible.md) plays.
 
diff --git a/services/cluster/setup.md b/services/cluster/setup.md
@@ -12,13 +12,13 @@ Do the standard steps to add new entries for a node \(or block of nodes\) to DHC
 
 Also make sure the switches/routers/cables are set up right.
 
-## Install CentOS
+## Install Ubuntu
 
-The preferred method is to use Netboot, but a regular CentOS install stick works as well. If installing from an USB stick make sure your hostname matches the one specified in DNS/DHCP.
+The preferred method is to use Netboot, but a regular Ubuntu install stick works as well. If installing from an USB stick make sure your hostname matches the one specified in DNS/DHCP.
 
 ## Install SSH Server
 
-`yum install openssh openssh-server`. 'Nuf said.
+`apt install openssh-server`. 'Nuf said.
 
 ## Run the Ansible play
 
diff --git a/services/cluster/slurm-administration.md b/services/cluster/slurm-administration.md
@@ -2,6 +2,8 @@
 
 This page is intended to serve as a guide for Sysadmins who need to administrate the Slurm system running on the HPC cluster. If you're a regular user, this information probably won't be very interesting to you.
 
+Here is a Slurm quickstart from their developers: [https://slurm.schedmd.com/quickstart.html](https://slurm.schedmd.com/quickstart.html)
+
 ## Accounts vs Users
 
 The Slurm accounting system separates the ideas of Accounts and Users, which is slightly confusing at first. When you look at it from the higher-level functioning of Slurm though, these concepts make sense.
@@ -41,3 +43,13 @@ cp -r /etc/skel /cluster/(username)
 chown -R (username) /cluster/(username)
 ```
 
+## Partitions/Nodes
+
+Slurm has a system of partitions that help segment work.
+
+In our setup, we have two partitions \`compute\` and `gpu`.
+
+
+
+
+
diff --git a/services/cluster/slurm.md b/services/cluster/slurm.md
@@ -49,9 +49,9 @@ JOBID  PARTITION   NAME     USER    ST      TIME   NODES  NODELIST(REASON)
 
 ### Creating Programs to Run of the HPC Cluster
 
-The HPC Cluster is comprised of 64-bit CentOS Linux systems. While you can run any old Linux program on the Cluster, to take advantage of the parallel processing capability that the Cluster has, it's _highly_ recommended to make use of a parallel programming interface. If you're taking or have taken Parallel Computing, you will know how to write and compile a program which uses MPI. If you aren't, [http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml](http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml) is a good introduction to MPI in C. See below for instructions on running an MPI program on the cluster.
+The HPC Cluster is comprised of 64-bit CentOS Linux or Ubuntu Server systems. While you can run any old Linux program on the Cluster, to take advantage of the parallel processing capability that the Cluster has, it's _highly_ recommended to make use of a parallel programming interface. If you're taking or have taken Parallel Computing, you will know how to write and compile a program which uses MPI. If you aren't, [http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml](http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml) is a good introduction to MPI in C. See below for instructions on running an MPI program on the cluster.
 
-When compiling your program, it's best to connect to infosphere \(the login node explained in the section above\), so that your code is compiled in a similar environment to where it will be run. The login node should have all the necessary tools to do so, such as gcc, g++, and mpicc/mpixx.
+When compiling your program, it's best to connect to `infosphere` \(the login node explained in the section above\), so that your code is compiled in a similar environment to where it will be run. The login node should have all the necessary tools to do so, such as gcc, g++, and mpicc/mpixx.
 
 **Important note: You won't be able to run mpicc or other special compilation tools until you load the appropriate programs into your environment. For MPI, the command to do so is** `module load mpi`**.** The reason for this is different compiler systems can conflict with each other, and the module system gives you the flexibility to use whatever compiler you want by loading the appropriate modules.
 
@@ -67,7 +67,7 @@ Salloc allocates resources for a generic job and, by default, creates a shell wi
 
 This is the simplest method, and is probably what you want to start out with. All you have to do is run `srun -n (processes) (path_to_program)`, where `(processes)` is the number of instances of your program that you want to run, and `(path_to_program)` is, you guessed it, the path to the program you want to run. If your program is an MPI program, you should not use `srun`, and instead use the `salloc` method described above.
 
-If your command is successful, you should see "srun: jobid \(x\) submitted". You can check on the status of your job by running `sacct`. You will receive any output of your program to the console. For more resource options, run `man srun` or use the official Slurm documentation.
+If your command is successful, you should see `srun: jobid (x) submitted`. You can check on the status of your job by running `sacct`. You will receive any output of your program to the console. For more resource options, run `man srun` or use the official Slurm documentation.
 
 #### `sbatch`
 

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,6 @@`
`1`	`1`	`# Switches`
`2`	`2`
	`3`	`+Switches form the backbone of our network. We highlight four of them in this documentation.`
`3`	`4`
	`5`	`+More information on switches and their role can be found in [our runbooks](../../general/documentation/runbooks.md).`
`4`	`6`
Original file line number	Diff line number	Diff line change
`@@ -2,3 +2,5 @@`
`2`	`2`
`3`	`3`	`We have a few Uninterruptable Power Supplys that provide power to our machines to keep them powered while the backup generator turns on.`
`4`	`4`
	`5`	`+More documentation can be found in [our runbooks](../general/documentation/runbooks.md).`
	`6`	`+`