-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raspberry Pi provisioning updates (and a general status update) #841
Conversation
@nodejs/testing (update above on some build infra matters that may be interesting) FYI I've been messing with my NAS today and it's been having some nasty disk troubles which has impacted on the NFS exports that all the Pi's are using. There's going to be a lot of failures today on the Pi's that can be written off to this. They should have settled down (assuming my NAS is in proper order again, no guarantee) now and be working like normal. If Pi failures persist though I may need to dig in and do some bulk restarting/remounting/re-somethinging, ping me if you're seeing anything that's not related to standard test failures from tomorrow onward. |
@rvagg ... Thank you for the update. I've been wondering about the possibility of convincing nearForm to work on setting up a redundant pi cluster to help both with reliability, performance, and just reducing general bus factor. I haven't looked around on the infra side lately so forgive me if I've missed it, but do you have any write-up on how the pi cluster is built, what components are necessary, and how things are configured/managed? |
@jasnell no, I haven't done a writeup on the specifics but could provide specifics as needed. If there's staffing and required capacity there to actually manage a small cluster then it may make sense for me to ship over a chunk of this cluster, maybe half (including power and a network switch), so we have proper redundancy. Unfortunately it's not a trivial commitment and needs someone to become a minor expert in monitoring and maintaining these things when they go awry. I could provide full instructions for how I do it, but it's not essential that it's set up exactly the same way. Some things it'd need:
In terms of people-power, it needs someone who can understand how it all works, can monitor the cluster for failures, can wipe and reimage SD cards, replace ones that look dodgy, go through the manual steps to prepare hosts for running ansible against (here). We'd probably need to make sure this person was part of our build/test group so they have enough access to do this all themselves. |
@rvagg is https://github.com/nodejs/build/blob/master/ansible/inventory.yml up to date? It's not great that we have to keep two ansible inventories, but the one under |
Good question! So ansible/inventory.yml is up to date. I've been very strict with my use of it and have been pushing updates to it whenever I provision a new host (not using PRs, I hope nobody objects to pushes changing IP addresses). On the other hand setup/ansible-inventory.yml is woefully out of date except for the raspberry pi's, I keep that updated properly (not that I needs regular changing). I don't believe it's used for anything else, although I could imagine that there are things not properly in ansible/ where it's relevant, like the website. I suppose that I'm currently in the best position to make a first-pass call on what's redundant now since I've been doing a lot of the standard provisions lately. I'll put it on my list. |
Yes we definitely should, 💯 to that. |
The major problem is people the nearForm side, and providing a quick enough response time. It seems the Pi cluster needs more care than a bunch of normal servers in term of maintenance, so it might be a blocker. We'll see what we can do. |
Yep, this is the main reason I was asking about documenting the process more. In order to make a decision about whether it is possible for us to set up a redundant cluster we need to get an idea of the resource commitment required. We may need to put out a call to Node.js Foundation member organizations to sponsor the effort by offering resources. |
Ref: serialport/node-serialport#879 @munyirik do you still want to bring this server back online or can we delete the configuration? (I can make a backup of the job config somewhere) |
@joaocgreis you can delete it |
Strange configuration on Regular expression run condition: Expression=[linux-gnu], Label=[linux-gnu]
Run condition [Regular expression match] enabling perform for step [BuilderChain]
[ubuntu1604-arm64_odroid_c2] $ /bin/sh -xe /tmp/hudson6207292464805124188.sh
+ test true = true
+ FLAKY_TESTS_MODE=dontcare
+ echo FLAKY_TESTS_MODE=dontcare
FLAKY_TESTS_MODE=dontcare
+ TEST_CI_ARGS=
+ echo TEST_CI_ARGS:
TEST_CI_ARGS:
+ NODE_TEST_DIR=/home/iojs/node-tmp PYTHON=python FLAKY_TESTS=dontcare TEST_CI_ARGS= make run-ci -j 0 |
2ec6f4b
to
084381b
Compare
Got the Pi cluster all updated and working nicely again, everything's online and seems to be working although there's been some odd spotty Jenkins or network problems. If they don't settle down then there'll have to be more investigation and perhaps some more hardware replacements.
I reprovisioned around 10 of them in this round of fixes and in the process I had to replace 4 SD cards (FYI I can tell they need replacing if I write over with a fresh image but the old image is still there when I boot it! Very strange behaviour but a consistent way to know which cards are bad).
In this update I'm turning on imapd for NFSv4 support from my new NAS. I've upgraded git to the patched secure version and have tweaked the instructions to back off on overclocking. I've had to replace SD cards on the Pi 1 B+'s the most and I'm taking a guess that the overclocking might be contributing to the fatigue.
I've still got 3 or 4 replacement SD cards sitting ready for next time. I haven't ordered more but I still have ~$170 in the donation bucket to pay for a batch when I need to restock. (Also FYI I dipped in to that this week for a replacement power supply for one of the Odroid XU's, it's all in the financials spreadsheet that some of you have access to). Both of the XU's are still offline, they both seem to have disk problems but it's eMMC so not as simple as formatting an SD card (I have an SD adapter for it but it's still a bit tedious). The XU3 is online and back in the mix but it's the only one carrying the "ubuntu1404-armv7" load which I reenabled in node-test-commit-arm and is normally handled by the XU3 and the two XU's.
While I'm doing a status update, test-mininodes-ubuntu1604-arm64_odroid_c2-2 is offline. It's one of the 3 Odroid C2's hosted at miniNodes. We've had consistent problems with it and now it's locked up. I got David to have a look and he got on to the console and ran an fsck and found a ton of errors. Suggests that it needs an SD card replacement. I've told him it's not a super high priority but I imagine he'll get to it sometime this week. Jenkins can do everything we need when we get it back online again in virgin state.
msft-serialport-win1 and test-digitalocean-ubuntu1604_docker-x64-1 are the only other machines offline at the moment. I don't think I've ever seen msft-serialport-win1 online and I'm not sure who was working with it so perhaps we should remove it. test-digitalocean-ubuntu1604_docker-x64-1 hasn't been offline for a while and we have test-digitalocean-ubuntu1604_docker_alpine34-x64-1 in the mix which seems to be doing all the work so I'm thinking test-digitalocean-ubuntu1604_docker-x64-1 is unnecessary and can be removed. If anyone has background on any of these let me know before I wade in and make a mess.
/cc @nodejs/build cause that was quite a status update!