Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raspberry Pi provisioning updates (and a general status update) #841

Closed
wants to merge 3 commits into from

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Aug 20, 2017

Got the Pi cluster all updated and working nicely again, everything's online and seems to be working although there's been some odd spotty Jenkins or network problems. If they don't settle down then there'll have to be more investigation and perhaps some more hardware replacements.

I reprovisioned around 10 of them in this round of fixes and in the process I had to replace 4 SD cards (FYI I can tell they need replacing if I write over with a fresh image but the old image is still there when I boot it! Very strange behaviour but a consistent way to know which cards are bad).

In this update I'm turning on imapd for NFSv4 support from my new NAS. I've upgraded git to the patched secure version and have tweaked the instructions to back off on overclocking. I've had to replace SD cards on the Pi 1 B+'s the most and I'm taking a guess that the overclocking might be contributing to the fatigue.

I've still got 3 or 4 replacement SD cards sitting ready for next time. I haven't ordered more but I still have ~$170 in the donation bucket to pay for a batch when I need to restock. (Also FYI I dipped in to that this week for a replacement power supply for one of the Odroid XU's, it's all in the financials spreadsheet that some of you have access to). Both of the XU's are still offline, they both seem to have disk problems but it's eMMC so not as simple as formatting an SD card (I have an SD adapter for it but it's still a bit tedious). The XU3 is online and back in the mix but it's the only one carrying the "ubuntu1404-armv7" load which I reenabled in node-test-commit-arm and is normally handled by the XU3 and the two XU's.

While I'm doing a status update, test-mininodes-ubuntu1604-arm64_odroid_c2-2 is offline. It's one of the 3 Odroid C2's hosted at miniNodes. We've had consistent problems with it and now it's locked up. I got David to have a look and he got on to the console and ran an fsck and found a ton of errors. Suggests that it needs an SD card replacement. I've told him it's not a super high priority but I imagine he'll get to it sometime this week. Jenkins can do everything we need when we get it back online again in virgin state.

msft-serialport-win1 and test-digitalocean-ubuntu1604_docker-x64-1 are the only other machines offline at the moment. I don't think I've ever seen msft-serialport-win1 online and I'm not sure who was working with it so perhaps we should remove it. test-digitalocean-ubuntu1604_docker-x64-1 hasn't been offline for a while and we have test-digitalocean-ubuntu1604_docker_alpine34-x64-1 in the mix which seems to be doing all the work so I'm thinking test-digitalocean-ubuntu1604_docker-x64-1 is unnecessary and can be removed. If anyone has background on any of these let me know before I wade in and make a mess.

/cc @nodejs/build cause that was quite a status update!

gibfahn

This comment was marked as off-topic.

@rvagg
Copy link
Member Author

rvagg commented Aug 20, 2017

@nodejs/testing (update above on some build infra matters that may be interesting) FYI I've been messing with my NAS today and it's been having some nasty disk troubles which has impacted on the NFS exports that all the Pi's are using. There's going to be a lot of failures today on the Pi's that can be written off to this. They should have settled down (assuming my NAS is in proper order again, no guarantee) now and be working like normal. If Pi failures persist though I may need to dig in and do some bulk restarting/remounting/re-somethinging, ping me if you're seeing anything that's not related to standard test failures from tomorrow onward.

@jasnell
Copy link
Member

jasnell commented Aug 20, 2017

@rvagg ... Thank you for the update. I've been wondering about the possibility of convincing nearForm to work on setting up a redundant pi cluster to help both with reliability, performance, and just reducing general bus factor. I haven't looked around on the infra side lately so forgive me if I've missed it, but do you have any write-up on how the pi cluster is built, what components are necessary, and how things are configured/managed?

gibfahn

This comment was marked as off-topic.

gibfahn

This comment was marked as off-topic.

@rvagg
Copy link
Member Author

rvagg commented Aug 20, 2017

@jasnell no, I haven't done a writeup on the specifics but could provide specifics as needed.

If there's staffing and required capacity there to actually manage a small cluster then it may make sense for me to ship over a chunk of this cluster, maybe half (including power and a network switch), so we have proper redundancy. Unfortunately it's not a trivial commitment and needs someone to become a minor expert in monitoring and maintaining these things when they go awry. I could provide full instructions for how I do it, but it's not essential that it's set up exactly the same way.

Some things it'd need:

  • Space, obviously
  • A network with enough spare IPs, a private NAT would be ideal, I like to keep the cluster network isolated from my private network because a fairly wide group of people have access (build/test) but also the code that gets tested isn't strictly safe (we ask collaborators to tick "certify safe" as an acknowledgement that a contributor could slip in malicious code that could mess with our cluster)
  • A way to provide external access to the machines—I have a computer here that acts as a jump host that can be used by the build/test team, so everyone's SSH configs make a connection there and then jump in to the internal host by internal IP. An alternative might be to provide specific ports on an externally exposed router that each go to one of the machines.
  • A host that can expose some NFS disk. This is probably going to be the trickiest bit. I have a NAS host that exposes an SSD to all of the Pi's that they share for their own workspace and a shared ccache directory (the NAS and SSD are mine, not technically part of the Foundation's hardware). The Pi's don't have enough storage on them to be reliable and the NFS disk is (I think) slightly faster than a local SD Card. Having a shared space is also helpful for ccache (not actually used much since we cross-compile for tests) and for providing shared access to a git repo that's used to download binaries from the cross-compiler.

In terms of people-power, it needs someone who can understand how it all works, can monitor the cluster for failures, can wipe and reimage SD cards, replace ones that look dodgy, go through the manual steps to prepare hosts for running ansible against (here). We'd probably need to make sure this person was part of our build/test group so they have enough access to do this all themselves.

joaocgreis

This comment was marked as off-topic.

@joaocgreis
Copy link
Member

@rvagg is https://github.com/nodejs/build/blob/master/ansible/inventory.yml up to date? It's not great that we have to keep two ansible inventories, but the one under ansible/ should be kept up to date because it's used to generate the ssh config file (and already does that quite well - it's very straightforward to update with it). The one under setup/ should perhaps only have the machines that are still deployed using the old scripts.

@rvagg
Copy link
Member Author

rvagg commented Aug 23, 2017

Good question! So ansible/inventory.yml is up to date. I've been very strict with my use of it and have been pushing updates to it whenever I provision a new host (not using PRs, I hope nobody objects to pushes changing IP addresses). On the other hand setup/ansible-inventory.yml is woefully out of date except for the raspberry pi's, I keep that updated properly (not that I needs regular changing). I don't believe it's used for anything else, although I could imagine that there are things not properly in ansible/ where it's relevant, like the website.
At this stage I'm not quite sure what to do about it because if we clean it up then maybe we should be cleaning up all of the directories in setup/ that we're no longer using too? I'm deploying all other standard hosts from that new config now and it's working great and I know the old setup/ ones are getting out of date (e.g. see the recent EPEL PR from me that's to ansible/ not setup/).

I suppose that I'm currently in the best position to make a first-pass call on what's redundant now since I've been doing a lot of the standard provisions lately. I'll put it on my list.

@gibfahn
Copy link
Member

gibfahn commented Aug 23, 2017

At this stage I'm not quite sure what to do about it because if we clean it up then maybe we should be cleaning up all of the directories in setup/ that we're no longer using too?

Yes we definitely should, 💯 to that.

mhdawson

This comment was marked as off-topic.

@mcollina
Copy link
Member

In terms of people-power, it needs someone who can understand how it all works, can monitor the cluster for failures, can wipe and reimage SD cards, replace ones that look dodgy, go through the manual steps to prepare hosts for running ansible against (here). We'd probably need to make sure this person was part of our build/test group so they have enough access to do this all themselves.

The major problem is people the nearForm side, and providing a quick enough response time. It seems the Pi cluster needs more care than a bunch of normal servers in term of maintenance, so it might be a blocker. We'll see what we can do.
(For everyone, I am remote and not located at the office either).

@jasnell
Copy link
Member

jasnell commented Aug 23, 2017

Yep, this is the main reason I was asking about documenting the process more. In order to make a decision about whether it is possible for us to set up a redundant cluster we need to get an idea of the resource commitment required. We may need to put out a call to Node.js Foundation member organizations to sponsor the effort by offering resources.

@joaocgreis
Copy link
Member

I don't think I've ever seen msft-serialport-win1 online and I'm not sure who was working with it so perhaps we should remove it.

Ref: serialport/node-serialport#879

@munyirik do you still want to bring this server back online or can we delete the configuration? (I can make a backup of the job config somewhere)

@munyirik
Copy link

@joaocgreis you can delete it

@refack
Copy link
Contributor

refack commented Aug 30, 2017

Strange configuration on ubuntu1604-arm64_odroid_c2
https://ci.nodejs.org/job/node-test-commit-arm/11807/nodes=ubuntu1604-arm64_odroid_c2/console
(tests run with -j 0)

Regular expression run condition: Expression=[linux-gnu], Label=[linux-gnu]
Run condition [Regular expression match] enabling perform for step [BuilderChain]
[ubuntu1604-arm64_odroid_c2] $ /bin/sh -xe /tmp/hudson6207292464805124188.sh
+ test true = true
+ FLAKY_TESTS_MODE=dontcare
+ echo FLAKY_TESTS_MODE=dontcare
FLAKY_TESTS_MODE=dontcare
+ TEST_CI_ARGS=
+ echo TEST_CI_ARGS:
TEST_CI_ARGS:
+ NODE_TEST_DIR=/home/iojs/node-tmp PYTHON=python FLAKY_TESTS=dontcare TEST_CI_ARGS= make run-ci -j 0

@rvagg rvagg force-pushed the rvagg/raspberry-pi branch from 2ec6f4b to 084381b Compare November 4, 2017 09:52
@rvagg rvagg closed this Nov 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants