Yesterday, my homelab server suddenly became unresponsive. It started with a flurry of Discord notifications, the universal signal that something has gone seriously wrong.
I found all services offline. The logs pointed to a primary culprit: a Redis failure, specifically a Server Out of Memory error.
The core error was: RedisClient::CommandError: MISCONF Errors writing to the AOF file: No space left on device
My first thought was: Why is AOF even enabled? I turned it on for testing and forgot. My root partition was at 99% capacity, just 270MB remaining out of 24GB.
Further investigation revealed where the "wasted" space was hiding:
- PM2 Logs (~3.8GB): The process manager was storing massive, unrotated text logs.
-
Hidden Caches (~1.5GB): Accumulated
~/.cache,~/.npm, and~/.rvmsource files from multiple builds and deployments.
To get the system breathing again, I performed a quick "surgical" cleaning:
-
PM2 Flush: Immediately cleared the massive log files using
pm2 flush. -
Log Truncation: Emptied application logs using
truncate -s 0 log/*.log(this clears the content without deleting the file handle). -
Cache Pruning: Deleted hidden build caches in
~/.npmand~/.cache. -
Journal Vacuum: Cleared system logs with
journalctl --vacuum-size=500M.
Now I had enough space to spin up all process again, but I need to recover Redis since it entered a Read-Only mode to protect data integrity.
-
Fixing the AOF Manifest: Because the disk filled during a Redis write, the
appendonly.aof.manifestwas corrupted. I fixed it usingsudo redis-check-aof --fixon the manifest file inside/var/lib/redis/appendonlydir/. -
Clearing the MISCONF Lock: Even with free space, Redis remained in a "protected" state. I manually overrode this with
redis-cli config set stop-writes-on-bgsave-error no. -
Service Restart: Reset the systemd failure counter with
systemctl reset-failed redis-serverand restarted the service.
After that I could successfully restart all services and have everything running. All data inside redis were not critical, so I didn't care about losing it.
Lessons learned
The failure was a classic case of neglecting "boring" infrastructure: log rotation and disk monitoring. To prevent a repeat performance, I've implemented the following:
- Log Management: Installed pm2-logrotate to cap PM2 logs at 10MB per file and limited journald to 500MB globally.
- Next Steps:
- Expand the VM disk size (24GB is too tight for this stack).
- Set up a cron job for weekly
apt autoremoveand cache clearing. - Implement an automated disk usage alert (likely via Grafana or a simple shell script to Discord).
This article was originally published by DEV Community and written by Jancer Lima.
Read original article on DEV Community