Node down and seem unresponsive

Hi,

I’ve been running my node since April with no issue. Hardware is:

  • Rasp Pi 4 4 Gb Ram
  • FLIRC case → added recently
  • 1 TB SSD Verbatim Vi550 S3
  • Official power supply

But it’s down for some weeks now. The web interface stays stucks on the “starting umbrel” page. Sometimes it displays the Red Umbrella of Death with “system service failed” message. The LED on the SSD keeps blinking without interruption. I’m not totally sure, but it seems to have appeared after repeated power outages in December due to snowfalls.

When SSH’ing into the Pi, htop shows a load average over 20%

I tried running the debug command to get the logs but nothing happens, the logs never show up. I also tried following the steps listed here: Red Umbrella of Death after Power Outage
and commands to stop umbrel or docker seem not to respond as well.

Command to update the os version does not work as well. Even ls command ran in the /umbrel folder hangs.
I plugged the SSD into a Ubuntu machine an ran e2fsck, it told me it was clean.

I reflashed the SD with 0.4.8 and 0.4.9 outcome is the same.

Any idea of what I could do?

Thx.

Could you run docker ps and paste the dump here. Are there docker container services running at all?
And just to verify, ~/umbrel/scripts/debug doesn’t do anything?
How about dmesg to check whether your SSD has a mounting issue - or df -hal | grep G | head -10 to check whether the SSD is full

Thanks @Hakuna for your help.

df -hal | grep G | head -10 gives the output below

    /dev/root        59G  3.1G   53G   6% /
    devtmpfs        1.7G     0  1.7G   0% /dev
    tmpfs           1.9G     0  1.9G   0% /dev/shm
    tmpfs           1.9G   18M  1.9G   1% /run
    tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup
    /dev/root        59G  3.1G   53G   6% /status-server
    /dev/sda1       938G  488G  403G  55% /mnt/data
    /dev/sda1       938G  488G  403G  55% /home/umbrel/umbrel
    /dev/sda1       938G  488G  403G  55% /var/lib/docker
    /dev/sda1       938G  488G  403G  55% /swap

dmesg gives several warnings/errors related to SSD device and timeouts for processes

htop shows docker related processes:

running docker ps and waiting 10 mins or so


same with ~/umbrel/scripts/debug

Really tricky, maybe others here can jump in.
Assuming it’s either

  • something with the SSD. Can you try connecting it to your laptop / desktop and check whether it’s working fine? Under linux, you can run fsck (read mode only) or efsck, checking for bad sectors or file system
  • something with the heat? Is the node particulary hot, since you use a new casing, maybe that’s causing your node to malfunction. You could check temperature and whether the node is throttled with vcgencmd measure_temp && vcgencmd get_throttled
  • lastly, perhaps something with the USB connection. Do you have a chance to change cables, the USB connector?

sorry bit poking into the dark here too

Yes, I’m also poking into the dark :smiley:

Okay, I think we can somewhat assume the hardware is fine. Let’s bring out the big guns and point at the software (Umbrel OS).
Could you first ensure that

  • LND isn’t running (I’d assume so, since docker ps isn’t running, and I can’t see any docker services on htop with the user umbrel)
  • backup your channel.backup under ~/umbrel/backup and all your files / your lnd channel states under ~/umbrel/lnd/data/graph/mainnet/
  • backup your lnd.conf under ~/umbrel/lnd
  • have your seed words

once this is all done, let’s try to kill and reinstate the docker system. I think this is where it’s hanging up

Source ==> Umbrel Troubleshooting Guide

Some docker component fail to start

I can’t access umbrel.local on browser or ip address. Did ssh and ran debug script. First suspect line is:

stat /var/lib/docker/overlay2/....... no such file or directory

How to fix this issue:

  • just in case, re-flash the mSD card with the latest version of UmbrelOS (exactly the steps you did first time installing your node using the instructions from getumbrel.com
  • If still don’t do nothing, use this command (enter using SSH into your node):

:arrow_right: :arrow_right: sudo systemctl stop umbrel-startup.service && docker system prune --force --all && sudo systemctl start umbrel-startup.service

Restart your node

sudo reboot

Optional another command to clear the docker containers is:

:arrow_right: :arrow_right: sudo docker kill $(sudo docker ps -aq) && sudo docker rm $(sudo docker ps -aq)

then restart your node

Let’s see how that works

Hello @Hakuna and thank you again for your much appreciated help.

~/umbrel/backup and ~/umbrel/lnd/data/graph/mainnet/ are empty, maybe because I already reflashed the os a few times before? as for ~/umbrel/lnd I don’t remember, but I shall confirm that.

EDIT:
~/umbrel/backup doesn’t exist.
~/umbrel/lnd/data/graph/mainnet/ does have some files
lnd.conf exists in ~/umbrel/lnd

sudo systemctl stop umbrel-startup.service && docker system prune --force --all && sudo systemctl start umbrel-startup.service and related commands involving systemctl or docker don’t work, just as ~/umbrel/scripts/debug

I tried running those command while having the SSD unplugged, it works, and some data around 1Gb has been cleared. However, after shutting down the node, pluging back the SDD in and restarting the node the result is the same.

Any idea? Is there a way to wipe the SSD without losing the sync and having to restore with the seed?

Hi, thanks for your help @Hakuna.
I finally had to try to format the SSD, but I couldn’t manage to complete it, it seems that it was dead for good. Had to start over with a brand new ssd :expressionless:

1 Like

If there was a Kubernetes version of Umbrel, it would probably be more resiliant. (and self-healing) (similar to how https://aerokube.com/moon/latest/#install-kubernetes does it.)