Incident Report: Service failure due to storage full

Yesterday, my homelab server suddenly became unresponsive. It started with a flurry of Discord notifications, the universal signal that something has gone seriously wrong.

I found all services offline. The logs pointed to a primary culprit: a Redis failure, specifically a Server Out of Memory error.

The core error was: RedisClient::CommandError: MISCONF Errors writing to the AOF file: No space left on device
My first thought was: Why is AOF even enabled? I turned it on for testing and forgot. My root partition was at 99% capacity, just 270MB remaining out of 24GB.

Further investigation revealed where the "wasted" space was hiding:

PM2 Logs (~3.8GB): The process manager was storing massive, unrotated text logs.
Hidden Caches (~1.5GB): Accumulated ~/.cache, ~/.npm, and ~/.rvmsource files from multiple builds and deployments.

To get the system breathing again, I performed a quick "surgical" cleaning:

PM2 Flush: Immediately cleared the massive log files using pm2 flush.
Log Truncation: Emptied application logs using truncate -s 0 log/*.log (this clears the content without deleting the file handle).
Cache Pruning: Deleted hidden build caches in ~/.npm and ~/.cache.
Journal Vacuum: Cleared system logs with journalctl --vacuum-size=500M.

Now I had enough space to spin up all process again, but I need to recover Redis since it entered a Read-Only mode to protect data integrity.

Fixing the AOF Manifest: Because the disk filled during a Redis write, the appendonly.aof.manifest was corrupted. I fixed it using sudo redis-check-aof --fix on the manifest file inside /var/lib/redis/appendonlydir/.
Clearing the MISCONF Lock: Even with free space, Redis remained in a "protected" state. I manually overrode this with redis-cli config set stop-writes-on-bgsave-error no.
Service Restart: Reset the systemd failure counter with systemctl reset-failed redis-server and restarted the service.

After that I could successfully restart all services and have everything running. All data inside redis were not critical, so I didn't care about losing it.

Lessons learned

The failure was a classic case of neglecting "boring" infrastructure: log rotation and disk monitoring. To prevent a repeat performance, I've implemented the following:

Log Management: Installed pm2-logrotate to cap PM2 logs at 10MB per file and limited journald to 500MB globally.
Next Steps:
- Expand the VM disk size (24GB is too tight for this stack).
- Set up a cron job for weekly apt autoremove and cache clearing.
- Implement an automated disk usage alert (likely via Grafana or a simple shell script to Discord).

DE

Source

This article was originally published by DEV Community and written by Jancer Lima.

Read original article on DEV Community

Back to Discover

Lessons learned

Reading List