So, imagine you are in the following situation:
You have a “enterprise” backup software like Veeam running. Now, for some reason the file level restore doesn’t work at 2 a.m. and you start to slightly panic. You don’t notice that you have a snapshot running on the virtual machine and the next thing you do, is a disk file restore to just completely replace the disk with the wrong or faulty data.
With that, you have fucked up. Sadly, this enterprise level software doesn’t recognize that there is an active snapshot running on the vm, and just replaces the faulty disk. This results in a broken snapshot chain because you replace the disk running an active snapshot with the same disk from an earlier time, which didn’t have a snapshot running. What does that lead to?
- You cannot remove any snapshot anymore
- Veeam depends on snapshots and thus, you won’t be able to backup anymore
- This is a productive system, very dangerous not to have any backup
Now, how do you remove that snapshot without any risk of data loss and/or corruption?
As a disclaimer, I am by no means what some would call a “cloud”engineer (cringe, http://xkcd.com/908/). I have a basic understanding how VMware vSphere and ESX works but I am no expert. This is simply a solution I have found, there might be easier ways to solve an issue like that.
The tool that is integrated in all the ESXi hosts, vmkfstools can create a copy or clone of a vmdk. There is two ways to approach this issue:
- Either you copy each vmdk to a new location. You use the delta file (e.g. vmname_1-000004.vmdk) so vmkfstools can consolidate the disk.
- You don’t trust vmkfstools entirely and try to create a clone of a clean disk to the new location. This requires you to get the data off the disk that has a corrupted snapshot running.
Once you have finished the cloning of all the disks, just recreate the vm and add all the disks to the newly created vm that has no corruption.
Getting the data off the corrupted disk
With our file system and disk layout, we had only one “physical” disk with a partition table, the boot disk. Everything else was running on so called data disks with LVM.
This made everything more simple: You can add a disk to the vm that has a corrupted snapshot running. This disk won’t have any snapshot and thus is clean. Simply add the newly added disk to the same Volume Group and move all the data away online, to the new disk. Once complete you can remove the faulty disk!
Sadly, a downtime is required for cloning the disk as the virtual machine is locking them. Once the server is successfully shut down you can proceed. Use the following command to clone the disk:
vmkfstools -i <source disk (disk_file.vmdk)> <destination disk>
Clone one disk after another. You can create the virtual machine file while the clone job is running. Make sure you add each disk one after another to the new virtual machine and take care of things like physical address and resources.
With that, you should have a running vm again, free of corrupted snapshots!
Conclusion and tips
I am pretty sure there are easier ways to do this. I mean, I can’t be the first one that something like that has happened to. If you know, please feel free to let me know, I’ll gladly adjust. Here are some short quick tips, basically a TL:DR
- Don’t ever, ever restore a disk that has a running delta file/snapshot.
- Try not do this kind of Ops in late nights or early mornings. Having a clear head sure helps a lot!
- If you want to keep the MAC address, make sure you shut down the original VM.
- Depending on the amount of data, plan enough time for your maintenance. The system will be offline for most of this time.
- For general good feeling: make sure you shut the vm down in a clean state. Applications should be closed properly.
- LVM is your hero.
I hope this can help some people!
Happy panic ops!