Raspberry Pi provisioning updates (and a general status update) #841

rvagg · 2017-08-20T10:51:25Z

Got the Pi cluster all updated and working nicely again, everything's online and seems to be working although there's been some odd spotty Jenkins or network problems. If they don't settle down then there'll have to be more investigation and perhaps some more hardware replacements.

I reprovisioned around 10 of them in this round of fixes and in the process I had to replace 4 SD cards (FYI I can tell they need replacing if I write over with a fresh image but the old image is still there when I boot it! Very strange behaviour but a consistent way to know which cards are bad).

In this update I'm turning on imapd for NFSv4 support from my new NAS. I've upgraded git to the patched secure version and have tweaked the instructions to back off on overclocking. I've had to replace SD cards on the Pi 1 B+'s the most and I'm taking a guess that the overclocking might be contributing to the fatigue.

I've still got 3 or 4 replacement SD cards sitting ready for next time. I haven't ordered more but I still have ~$170 in the donation bucket to pay for a batch when I need to restock. (Also FYI I dipped in to that this week for a replacement power supply for one of the Odroid XU's, it's all in the financials spreadsheet that some of you have access to). Both of the XU's are still offline, they both seem to have disk problems but it's eMMC so not as simple as formatting an SD card (I have an SD adapter for it but it's still a bit tedious). The XU3 is online and back in the mix but it's the only one carrying the "ubuntu1404-armv7" load which I reenabled in node-test-commit-arm and is normally handled by the XU3 and the two XU's.

While I'm doing a status update, test-mininodes-ubuntu1604-arm64_odroid_c2-2 is offline. It's one of the 3 Odroid C2's hosted at miniNodes. We've had consistent problems with it and now it's locked up. I got David to have a look and he got on to the console and ran an fsck and found a ton of errors. Suggests that it needs an SD card replacement. I've told him it's not a super high priority but I imagine he'll get to it sometime this week. Jenkins can do everything we need when we get it back online again in virgin state.

msft-serialport-win1 and test-digitalocean-ubuntu1604_docker-x64-1 are the only other machines offline at the moment. I don't think I've ever seen msft-serialport-win1 online and I'm not sure who was working with it so perhaps we should remove it. test-digitalocean-ubuntu1604_docker-x64-1 hasn't been offline for a while and we have test-digitalocean-ubuntu1604_docker_alpine34-x64-1 in the mix which seems to be doing all the work so I'm thinking test-digitalocean-ubuntu1604_docker-x64-1 is unnecessary and can be removed. If anyone has background on any of these let me know before I wade in and make a mess.

/cc @nodejs/build cause that was quite a status update!

rvagg · 2017-08-20T11:51:57Z

@nodejs/testing (update above on some build infra matters that may be interesting) FYI I've been messing with my NAS today and it's been having some nasty disk troubles which has impacted on the NFS exports that all the Pi's are using. There's going to be a lot of failures today on the Pi's that can be written off to this. They should have settled down (assuming my NAS is in proper order again, no guarantee) now and be working like normal. If Pi failures persist though I may need to dig in and do some bulk restarting/remounting/re-somethinging, ping me if you're seeing anything that's not related to standard test failures from tomorrow onward.

jasnell · 2017-08-20T13:42:43Z

@rvagg ... Thank you for the update. I've been wondering about the possibility of convincing nearForm to work on setting up a redundant pi cluster to help both with reliability, performance, and just reducing general bus factor. I haven't looked around on the infra side lately so forgive me if I've missed it, but do you have any write-up on how the pi cluster is built, what components are necessary, and how things are configured/managed?

rvagg · 2017-08-20T23:33:36Z

@jasnell no, I haven't done a writeup on the specifics but could provide specifics as needed.

If there's staffing and required capacity there to actually manage a small cluster then it may make sense for me to ship over a chunk of this cluster, maybe half (including power and a network switch), so we have proper redundancy. Unfortunately it's not a trivial commitment and needs someone to become a minor expert in monitoring and maintaining these things when they go awry. I could provide full instructions for how I do it, but it's not essential that it's set up exactly the same way.

Some things it'd need:

Space, obviously
A network with enough spare IPs, a private NAT would be ideal, I like to keep the cluster network isolated from my private network because a fairly wide group of people have access (build/test) but also the code that gets tested isn't strictly safe (we ask collaborators to tick "certify safe" as an acknowledgement that a contributor could slip in malicious code that could mess with our cluster)
A way to provide external access to the machines—I have a computer here that acts as a jump host that can be used by the build/test team, so everyone's SSH configs make a connection there and then jump in to the internal host by internal IP. An alternative might be to provide specific ports on an externally exposed router that each go to one of the machines.
A host that can expose some NFS disk. This is probably going to be the trickiest bit. I have a NAS host that exposes an SSD to all of the Pi's that they share for their own workspace and a shared ccache directory (the NAS and SSD are mine, not technically part of the Foundation's hardware). The Pi's don't have enough storage on them to be reliable and the NFS disk is (I think) slightly faster than a local SD Card. Having a shared space is also helpful for ccache (not actually used much since we cross-compile for tests) and for providing shared access to a git repo that's used to download binaries from the cross-compiler.

In terms of people-power, it needs someone who can understand how it all works, can monitor the cluster for failures, can wipe and reimage SD cards, replace ones that look dodgy, go through the manual steps to prepare hosts for running ansible against (here). We'd probably need to make sure this person was part of our build/test group so they have enough access to do this all themselves.

joaocgreis · 2017-08-22T19:34:14Z

@rvagg is https://github.com/nodejs/build/blob/master/ansible/inventory.yml up to date? It's not great that we have to keep two ansible inventories, but the one under ansible/ should be kept up to date because it's used to generate the ssh config file (and already does that quite well - it's very straightforward to update with it). The one under setup/ should perhaps only have the machines that are still deployed using the old scripts.

rvagg · 2017-08-23T10:44:46Z

Good question! So ansible/inventory.yml is up to date. I've been very strict with my use of it and have been pushing updates to it whenever I provision a new host (not using PRs, I hope nobody objects to pushes changing IP addresses). On the other hand setup/ansible-inventory.yml is woefully out of date except for the raspberry pi's, I keep that updated properly (not that I needs regular changing). I don't believe it's used for anything else, although I could imagine that there are things not properly in ansible/ where it's relevant, like the website.
At this stage I'm not quite sure what to do about it because if we clean it up then maybe we should be cleaning up all of the directories in setup/ that we're no longer using too? I'm deploying all other standard hosts from that new config now and it's working great and I know the old setup/ ones are getting out of date (e.g. see the recent EPEL PR from me that's to ansible/ not setup/).

I suppose that I'm currently in the best position to make a first-pass call on what's redundant now since I've been doing a lot of the standard provisions lately. I'll put it on my list.

gibfahn · 2017-08-23T12:22:19Z

At this stage I'm not quite sure what to do about it because if we clean it up then maybe we should be cleaning up all of the directories in setup/ that we're no longer using too?

Yes we definitely should, 💯 to that.

mcollina · 2017-08-23T14:47:45Z

In terms of people-power, it needs someone who can understand how it all works, can monitor the cluster for failures, can wipe and reimage SD cards, replace ones that look dodgy, go through the manual steps to prepare hosts for running ansible against (here). We'd probably need to make sure this person was part of our build/test group so they have enough access to do this all themselves.

The major problem is people the nearForm side, and providing a quick enough response time. It seems the Pi cluster needs more care than a bunch of normal servers in term of maintenance, so it might be a blocker. We'll see what we can do.
(For everyone, I am remote and not located at the office either).

jasnell · 2017-08-23T15:01:01Z

Yep, this is the main reason I was asking about documenting the process more. In order to make a decision about whether it is possible for us to set up a redundant cluster we need to get an idea of the resource commitment required. We may need to put out a call to Node.js Foundation member organizations to sponsor the effort by offering resources.

joaocgreis · 2017-08-29T19:10:23Z

I don't think I've ever seen msft-serialport-win1 online and I'm not sure who was working with it so perhaps we should remove it.

Ref: serialport/node-serialport#879

@munyirik do you still want to bring this server back online or can we delete the configuration? (I can make a backup of the job config somewhere)

munyirik · 2017-08-29T19:22:46Z

@joaocgreis you can delete it

refack · 2017-08-30T12:37:25Z

Strange configuration on ubuntu1604-arm64_odroid_c2
https://ci.nodejs.org/job/node-test-commit-arm/11807/nodes=ubuntu1604-arm64_odroid_c2/console
(tests run with -j 0)

Regular expression run condition: Expression=[linux-gnu], Label=[linux-gnu]
Run condition [Regular expression match] enabling perform for step [BuilderChain]
[ubuntu1604-arm64_odroid_c2] $ /bin/sh -xe /tmp/hudson6207292464805124188.sh
+ test true = true
+ FLAKY_TESTS_MODE=dontcare
+ echo FLAKY_TESTS_MODE=dontcare
FLAKY_TESTS_MODE=dontcare
+ TEST_CI_ARGS=
+ echo TEST_CI_ARGS:
TEST_CI_ARGS:
+ NODE_TEST_DIR=/home/iojs/node-tmp PYTHON=python FLAKY_TESTS=dontcare TEST_CI_ARGS= make run-ci -j 0

This comment was marked as off-topic.

Sign in to view

refack mentioned this pull request Aug 30, 2017

make: the '-j' option requires a positive integer argument #857

Closed

rvagg and others added 3 commits November 4, 2017 12:32

pi improvements

3aed9b5

build: changes to add in backup ARM6

bc1fe33

lots more raspberry pi ansible improvements

084381b

rvagg force-pushed the rvagg/raspberry-pi branch from 2ec6f4b to 084381b Compare November 4, 2017 09:52

rvagg closed this Nov 4, 2017

rvagg mentioned this pull request Nov 4, 2017

Lots of Raspberry Pi improvements #974

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raspberry Pi provisioning updates (and a general status update) #841

Raspberry Pi provisioning updates (and a general status update) #841

rvagg commented Aug 20, 2017

This comment was marked as off-topic.

rvagg commented Aug 20, 2017

jasnell commented Aug 20, 2017

This comment was marked as off-topic.

This comment was marked as off-topic.

rvagg commented Aug 20, 2017

This comment was marked as off-topic.

joaocgreis commented Aug 22, 2017

rvagg commented Aug 23, 2017

gibfahn commented Aug 23, 2017 •

edited

Loading

This comment was marked as off-topic.

mcollina commented Aug 23, 2017

jasnell commented Aug 23, 2017

joaocgreis commented Aug 29, 2017

munyirik commented Aug 29, 2017

refack commented Aug 30, 2017

Raspberry Pi provisioning updates (and a general status update) #841

Raspberry Pi provisioning updates (and a general status update) #841

Conversation

rvagg commented Aug 20, 2017

This comment was marked as off-topic.

rvagg commented Aug 20, 2017

jasnell commented Aug 20, 2017

This comment was marked as off-topic.

This comment was marked as off-topic.

rvagg commented Aug 20, 2017

This comment was marked as off-topic.

joaocgreis commented Aug 22, 2017

rvagg commented Aug 23, 2017

gibfahn commented Aug 23, 2017 • edited Loading

This comment was marked as off-topic.

mcollina commented Aug 23, 2017

jasnell commented Aug 23, 2017

joaocgreis commented Aug 29, 2017

munyirik commented Aug 29, 2017

refack commented Aug 30, 2017

gibfahn commented Aug 23, 2017 •

edited

Loading