Edit: In case it wasn’t clear, I wasn’t seeking advice and I’m more than familiar with all the preventative measures that exist. The post is called “What to do if you kill the wrong file” not “check your backups”. There’s a plethora of information about the latter, even in this post, but virtually nothing on the former. This is the only edit made to the post. The only edits made were this addition of this clarification, and the addition of “without a backup” to the title.
Yep… it happened to me. I killed a docker compose file with 550 lines of God-forsaken yaml less than a week before the project launch, and the most recent backup we had was nearly a month old and would have taken at least a day to get back up to speed. With a stroke of luck, I handled it about as well as I could have for on the feet thinking and I’d like to share my experience and lessons learned for anyone else that may ever find themselves in these smelly shoes:
Disclaimer! I’m a young engineer still growing my expertise and experience. Some stuff in here may be bad advice or wrong, like my assertion that using dd to pull data off of an unmounted drive doesn’t risk data loss; I’m pretty damn sure of that, but I wouldn’t stake my life (or your data) on it. I’ll happily update this post as improvements are suggested.
IF YOU RM’D THE WRONG THING:
1. Stop all writes to that partition as quickly as possible.
this step has some optional improvements at the bottom
Up to this point I’d been keeping a lazy backup of the file deleted on another partition. In order to preserve the disk as well as possible and prevent overwriting the blocks with the lost file, I cd to the backup dir and run a docker compose down. There were a few stragglers, but docker stop $containerName worked fine.
2. Unmount the partition
The goal is to ensure nothing writes to this disk at all. This, in tandem with the fact that most data recovery tools require an unmounted disk, is a critical step in preserving all hopes of recovering your data. Get that disk off of the accessible filesystem.
3. Save what you have
Once your partition is unmounted, you can use dd or a similar tool to create a backup somewhere else without risking corruption of the data. You should restore to a different disk/partition if at all possible, but I know sometimes things aren’t possible and /boot can come in handy in an emergency. It would have been big enough to save me if I wasn’t working on a dedicated app-data partition.
4. Your sword of choice
It’s time to choose your data recovery tool. I tried both extundelete and testdisk/photorec, and extundelete got some stuff back but not what I was looking for, while also running into seg faults and other issues. Photorec, on the other hand, was truly a gift from the cosmos. It worked like a dream, it was quick and easy, and it saved my sanity and my project.
5. The search for gold
Use “grep -r ‘./restore/directory’ -e ‘term in your file’” to look through everything you’ve deleted on the part since the beginning of time for the file you need.
It was a scary time for me, and hopefully this playbook can help some of you recover from a really stupid, preventable mistake.
potential improvements
In hindsight, two things could have gone better here: 1. Quicker: I could have shut them down immediately if I was less panicked and remembered this little trick: docker stop $(docker ps -q) 2. Exporter running config: I could have used ‘docker inspect > /path/to/other/partition’ to aid in the restoration process if I ended up needing to reconstruct it by hand. I decided it was worth it to risk it for the biscuit, though, and choosing to shut the stack down as quickly as possible was worth the potential sacrifice.
If you fight to preserve a running config of some sorts, whether k8s docker or other, MAKE SURE YOU WRITE IT TO ANOTHER PARTITION. It’s generally wise to give an application it’s own data partition but hey, you don’t have a usable backup so if you don’t have a partition to spare consider using the /boot partition if you really want to save your running config.
If you’re considering a donation to FOSS, join me in sending a few bucks over to CGSecurity.
remove, recurse, force wrong path, there is no backup desperate panic
Why it wasn’t in version control is the real issue here…
I’m not denying that stupid stuff didn’t happen nor that this wasn’t entirely preventable. There’s some practical reasons that are unique to large, slow moving orgs that explain why it wasn’t (yet) in version control.