Backup & DR Operations Articles - Altaro DOJO | Backup & DR

How to Ensure Your Disaster Recovery Strategy Will Actually Work When You Need It to

Paul Schnackenburg — Wed, 10 Jul 2024 07:35:57 +0000

Systems fail. That’s an unfortunate fact of computing. If things were, otherwise, this book would have been much shorter. However, what happens when the systems that we build as a failsafe against failure suffer failures of their own?

Just as you watch over your production systems, you must also put forth the effort to ensure that you always have at least one complete, undamaged, and retrievable copy of your data. You cannot stow away your backup data and hope for the best.

I took a phone call from a customer whose sole line-of-business application server lost both of its mirrored drives before they knew that they had a problem.

Because this business closed at five PM and didn’t re-open until 8 AM, we had set them up to perform fully daily backups that automatically ejected the tape upon completion. While trying to help them, I learned that they had hired an onsite administrator for a short time.

During his stay, he switched them over to weekly full backups with daily incremental jobs. He also disables the automatic eject feature.

When he left, he neglected to train anyone on the backup system. No one knew that they were supposed to change the backup tapes daily. Every night for over a year, their backups overwrote the previous night’s data, usually only with a tiny subset of changes.

So, when they needed it most, the backup data was not there.

To safeguard yourself against problems such as these (and the attendant horror stories), build testing schedules into your data recovery plan. Assign staff to perform those tests. At the scheduled meetings to update the disaster recovery documentation, require that testers require a synopsis of their test activities.

This will give your organization external accountability control over the backup process. You will have a chance to discover if no one has performed a backup without the need to encounter an emergency.

Testing Backup Data with Restore Operations

You have a very straightforward way to uncover problems in backup: try to restore something. Most modern backup software has some built-in way to help.

Exact steps depend upon your software. Follow these guidelines:

Redirect to an alternative, non-production, “sandbox” location. If your backup somehow has corrupted data, you don’t want to find out with a “test” overwrite of valid production data. If you’re ensuring that you can retrieve a virtual machine, you don’t want it to collide with the “real” system.

Test restoring multiple types of data. Bring back individual files, entire SQL databases, domain controllers, virtual machines, and any other type of logical unit that you rely on.

Rotate the items that you check at each testing interval.

Test from more than one repository.

Verify the restored information by accessing it yourself. Do not interpret a successful restore operation as proof that the data survived.

The major problem with this type of testing is its scope. You will always work with a sample of the backed-up data, not the entire set. You will likely be instructed to test the most important business components at every step. Make sure to test representative items from lower priority systems as well.

Data corruption is sneaky. Unless your equipment suffered a major failure or someone accidentally degaussed the wrong drive pack, the odds are that you will never uncover any degradation or errors.

Take heart; since you probably won’t find any corruption, you probably won’t ever need to restore anything that happened to become corrupted. However, do not take this condition as a reason to skip test restores.

We insist on multiple full copies of backup data as the primary way to protect against small-scale corruption. Unless the production data is corrupted, there is almost no chance that two distinct copies will have problems in the same place.

The purpose of a test restore is not to try to find these minor errors. We are looking for big problems. A routine test would have caught the problem in the anecdote that opened this section.

If someone accidentally (or maliciously) unchecked the option to back up your customer database, you will notice when you attempt a test restore. If a backup drive has a mechanical failure, you will either get nothing or blatantly corrupt data from it.

Use manual test restores to spot-check for corruption, verify that backup covers the data that you need, and that your media contains the information that you expect.

To shield against the ever-present threat of ransomware, only use operations that can read from the backup (not write to it) and work within an isolated sandbox. These types of tests are the only true way to verify the validity of offline media.

Testing Backup Data with Automated Operations

Manual tests leave you with the problem of minor data corruption. Backup data sets have only increased in size through the years, compounding the problem. With the fortuitous gradual deprecation of tape, backup application vendors have seized on opportunities to add health-check routines to their software.

They can scan through data to ensure that the bit signatures in storage match the bit signatures that they initially recorded.

Features like these call out the importance of a specific distinction in the usage of the word “automation”. It certainly applies to a process the computer performs in order to remove the burden on a human.

It does not necessarily mean “happens automatically”. For that connotation, stick to the word “scheduling”. In this context, do not assume that any mention of automated testing in your backup program’s interface means that it will handle everything without your help. Some programs have that capability, but this will never be a “set it and forget it” activity.

End-to-end data validation is time-consuming and an intense load on resources. Otherwise, we would happily do it ourselves and not need the backup program’s help. Also, some of them block other backup operations while in progress. So, such processes need three things from you:

A specific start time
Sufficient time to complete
A human-led procedure for verifying and recording the results

In a few cases, especially at smaller organizations, there may not be a major reason to avoid scheduling the start of a validation job, due to concerns for its impact on any overlapping backup jobs. If you have a slow backup system that will require multiple days to process everything, then validation is probably not feasible.

It is better to capture as many backups as possible and rely on manual spot-checking than to allow an automated verification process to disrupt jobs. At best, these automatic checks can add some peace of mind. But they will never replace manual work.

You also have the option to create custom checks. You can use scripts or software tools to scan through temporarily restored data. It can look for problems or ensure that it can find the expected data. You can potentially interface it with your backup software.

For instance, you can restore data to an alternative location and have the backup create another copy near it. A comparison tool can show where the data differs. Always keep ransomware top-of-mind. If you set up something like this, no process should have written access to production data and the backup location.

Systems administrators tend to be a clever, intelligent group. When we read guides like this, many of us think to ourselves, “I can script that!” That’s great, don’t let anything here discourage you. A virtualized domain controller that runs dcdiag on itself after a restore operation?

A SQL Server that runs through DBCC on restored databases? Your own system for creating and validating checksums on your most important file repository? Things like that are awesome!

You can never have too many helping checks. However, you can never rely solely on them, either. In the event of any kind of failure that backup does not recover, management will ask, “Did you verify that yourself?”

They will not recognize the value of your scripts. Your skills will not impress them. A solid track record of twenty years without a failure will not make any difference. Worse, if a data loss exposes your company to litigation, judges, attorneys, and jurors will care even less.

You must employ the sort of manual processes that non-technical people understand. An answer of, “That particular data set rotates to human validation every three months and the disaster hit at just the wrong time,” would help to pull attention away from you.

At the same time, doing the work to prepare against such conditions properly can help to ensure that you never have to face them.

Remember that automated routines can only supplement manual, personal operations. They will never stand on their own.

Geographically Distributed Clusters

Due to the logistics involved, few organizations will utilize geographically distributed clusters (sometimes called stretched clusters). Combined with synchronously replicated storage and a very highspeed site interconnect, they offer a high degree of automated protection.

Properly configuring one requires many architectural decisions and an intimate understanding of the necessary hardware and software components. This book will not dive that far into the topic.

The basic concepts of a geographically distributed cluster:

These clusters are built specifically for business continuity. They are not an efficient solution for making resources available in two sites at once.

Geographically stretched clusters must use synchronously replicated storage for effectiveness

Administrators often set resources so that they operate only in location except in the event of a failover. Individual resources are configured to run in the location where they are closest to their users, if possible.

Each location should be able to run all cluster resources

Each location should have sufficient capacity to allow for local failures. As an example, if the resources on your cluster require 5 nodes to operate and you want N+1 protection, then all sites require 6 nodes.

Resources must be prioritized so that, in the event that the cluster does not have enough nodes, that the most important resources remain online

If you have created such a cluster, you must periodically test it to ensure that it can meet your criteria. Because the sudden loss of an inter-site link or storage device will almost certainly trigger a resource crash, it would be best to perform these tests with only non-production resources.

The easiest way to accomplish this goal is to schedule downtime for the entire system, take the production resources offline gracefully, and do all of your work with test resources.

If your protected resources do not allow that much downtime, then you can use cross-cluster migration tools to evacuate the resources to other clusters during the test.

In many cases, you will not have any good options available. Alternative options:

Use test systems with the same fundamental configuration as your production systems and test with those

Remove a node or two from each site, create a secondary cluster from them, perform your testing, then rejoin the nodes to the production cluster

These alternatives have problems and risks. Test systems let you know how a site failure would theoretically work, but they do not prove that your production cluster will survive.

Individual nodes could have undiscovered problems, recently added resources might take you slightly over your minimum required node count, and the cluster configuration may not function as expected in a site failure.

A common problem that’s not immediately obvious without testing is that your cluster configuration might take everything offline in all sites because it can’t establish a quorum. Worse, it might keep disconnected sites online simultaneously, running on storage units that can longer synchronize, leading to a split-brain situation.

You will catch those problems in pre-production testing, but changing conditions can affect them (adding nodes, unplanned outages in multiple sites, etc.).

Testing geographically distributed clusters

When establishing the tests to run, start with probable events. Look specifically at the resources that the cluster operates and poke at their weaknesses. A few ideas:

Take a storage location offline without notifying the cluster

Unplug network cables

Disable the LAN for the cluster nodes in one site

Reboot the device that connects the sites. Assuming redundant links, try them separately, then all in one site at the same time.

Use your imagination. Also, don’t forget to perform the same sorts of tests that you would for a single site cluster (node removal, etc.).

Coping with the challenges of geographically distributed clusters

In reality, most organizations cannot adequately test production clusters of any type. Do not use that as an excuse to do nothing. You always have some things to try. For instance, if you skip the storage tests, you can perform validation on a Microsoft failover cluster almost any time without impact. Research the clustering technologies that you use. Look to user forums or support groups for ideas.

Make sure not to over-promise on the capabilities of geographically distributed clusters. Take time to understand how to deal with conditions such as the aforementioned quorum outage.

Above all else, take special care to understand how your storage and clustering technologies react to major failures. Do not rely on past experiences or general knowledge. Use strict change tracking and review the build at each disaster recovery update cycle.

Backup will always be your best protection against anything that threatens your cluster. Make certain that clusters have adequate and operational backup.

Testing replication

Replication technologies are built specifically to deal with failovers, so they are not as difficult to test as geo-clusters.

Testing almost always involves downtime, but usually has a manageable impact. Unlike geographically distributed clustering, testing failover of a resource that shares a common platform with others usually tells you enough that you don’t have to test everything.

When you first build a replication system, work through a complete failover scenario. This exercise helps you and your staff more than the technology does on its own. Replication and failover do not always work the way that administrators assume.

If you see the entire procedure in action, then you will have a much better understanding of what would happen to you in a real-world situation.

Document anything that seemed surprising. If you find a blocking condition, shift the parameters. Continue testing until failover works as seamlessly as possible before going into production.

Many times, small organizations set up Hyper-V Replica without a complete understanding of the technology. They follow the instructions and the prompts, and everything appears to work perfectly. Then, they try to fail over to their secondary site, and nothing works.

On investigating these problems, we discovered that many of them were replicating a domain controller virtual machine and had no other DCss online at the secondary site during the failover. When the domain controller went offline, the secondary site could no longer authenticate anything, including the Hyper-V replica operations.

This is why the earlier section on configuring replica called out the importance of using application-specific replication where possible. It also points out the importance of testing; sites that tried to fail over their replicas as a test fared much better than sites that didn’t try until catastrophe struck.

For a clear example, consider virtual machines protected by Hyper-V Replica. If you have a test virtual machine that spans the same hosts and same storage locations as your production virtual machines, then start with it.

Provided that all conditions match, it will give you a good idea of what would happen to the production virtual machines. If you have any low-priority production virtual machines, or some that you can take offline for a while, test with those next.

When possible, test all resources. Failing over a sample does not uncover problems like data corruption in the targets.

Unfortunately, testing with the real systems might not catch it either, or failing back to the main site could very well ship corrupted data back to the source. Use your monitoring tools and capture a good backup before testing any production systems.

If possible, take the source resource offline, wait for synchronization to complete, and capture a hash of the files at both locations. For file-based resources, you can use PowerShell:

Get-FileHash -Path C:\Source\File.txt -Algorithm MD5

Use the same command on the destination file; if the results don’t match, something is not right. Do not perform a failover until you have found and eliminated the problem.

Note: the MD5 algorithm no longer has value in security applications, but still works well for file comparisons due to its speed and the near zero likelihood of a hash collision on two slightly different copies of the same file.

Once you have successfully failed a resource to your alternative site, bring it online and make certain that it works as expected. Some configurations require advanced settings and comparable testing.

To return to our Hyper-V Replica example, you can set up the replica virtual machines to use a different IP address than the source VMs. If you have done that, ensure that the replicas have the expected connectivity.

After testing at the remote site, fail the resource back to the primary. Depending on the extent of your testing, it may take some time for any changes to cross. Return it to a service state and ensure that it works as expected. Have your backup data ready in case.

Do Not Neglect Testing

We all have so much work to do that testing often feels like low-priority busy work. We have lots of monitoring systems to tell us if something failed. If nothing has changed since the last time that we tested, would another test tell us anything?

Regardless of our workload and the tedium of testing, we cannot afford to skip it.

We humans tend to predict the future based on the past, which means that we naturally expect functioning systems to continue to function. However, the odds of system failure increase over time. The only sure way to find problems is through testing.

Conclusion

In conclusion, the inevitability of system failures underscores the critical importance of proactive measures. The anecdote highlights the consequences of neglecting backup procedures.

Testing, both manual and automated, emerges as a vital aspect of disaster recovery plans, ensuring the efficacy of backup data.

Geographically distributed clusters and replication technologies offer additional layers of protection, necessitating thorough testing to uncover potential pitfalls.

The post How to Ensure Your Disaster Recovery Strategy Will Actually Work When You Need It to appeared first on Altaro DOJO | Backup & DR.

The Step-by-Step Guide to Disaster Recovery

Paul Schnackenburg — Thu, 20 Jun 2024 08:22:39 +0000

We’ve talked about cataloging personnel and items, configuring systems to protect against data loss, and setting up sites to accommodate failed over data and dislocated employees.
Now, we need to establish the processes that people will follow during and after a disaster. We covered the topic of downtime procedures earlier. We mainly intend those for times when the system is offline but recoverable.
Your disaster recovery business process must accommodate failure of greater magnitude. You can use the downtime procedures that you developed as a starting point and as an idea generator, though.

Incident Response

Businesses encounter challenges every day. Executives and staff quickly learn how to prioritize and handle the problems that they face. On a typical day, their difficulties fall in line with normal expectations. Events outside the norm take extra time to understand and adjust to.
The time needed scales with the degree that an occurrence skews from normal and familiar. If staff don’t know who to contact, that compounds the problem.
To smooth handling of emergencies, organizations need to build an incident response process. Larger organizations often have designated incident response teams. Whether assigned to an individual, a team, or collectively to everyone, incident response begins with triage.
Members of an incident response team might not know what to do, but they must know who to involve. Relaying information to the incident response team usually happens automatically as employees pass news up their reporting chain. Eventually, it reaches someone that knows how to activate the response process.
An incident response team should include at least one, preferably two, members from every department. As organizations subdivide, the response team grows. When activated, the team should collaborate as quickly as possible.

They need to decide on questions such as the following:

Can a single department or subgroup handle the incident?
Will this event impact other departments or subgroups?
Has the problem caused downtime?
Will downtime continue?
Does the team need to send broad notifications to employees?
Should staff reach out to customers?
Who will address the problem?
How will staff involved directly in a solution send updates to the response team?
How will the response team update employees or customers?

A problem that necessitates involvement from an incident response team often works much like a planned project. If you have experienced project managers, appoint one or more to serve on the team.
Effective incident response requires participation. Establish clear procedures for designating alternates. A vacation or illness should not prevent a rapid solution to an unexpected event.

Executive Declaration

Enacting downtime and disaster recovery processes has associated costs. Personnel cease carrying out their normal functions and shift into their alternative emergency roles. Switching from a primary data system to a replica has time and risk implications mentioned in the relevant section.
Equipment and inventory inspection and recovery efforts will accrue liabilities and debt, as will calling in contractors for any tasks. To keep things in order, categorize three levels of event response:

Define activities that occur during and after a crisis. These should expect minimal or no supervisory guidance. This level includes items such as moving everyone to safety, notifying authorities, and beginning low-impact downtime procedures.
Create a “downtime” operational level. Because switching to and from downtime operations incurs time and risk, clarify that it can only happen when indicated by staff with a particular level of authority. This would not include any low-impact activities that you included in the first level.
Specify a “response and recovery” operational level. This involves accounting for all personnel, relocating and failing over to alternative sites, and implementing equipment and data recovery processes

The indicated names are arbitrary; use anything that makes sense. The important part is defining responses in advance during a calm so that staff have fewer problems to solve during an emergency. Having predefined levels also helps to reduce improper reactions, such as bringing a replica online while the primary site still functions.

Preparing and Planning for Impacted Personnel

First, your business continuity plan must cover the human aspect. It needs to provide actions and guidance, both for the people that enact the plan and for the people impacted by whatever condition caused the plan to go into effect.

Predicting user impact

Disaster recovery plans tend to have a high degree of sterilization and focus on the business, assets, and data. While all of that probably requires the most quantity, none of it is as important as the people. Employee safety needs to top all priority lists.

The plan will include a great deal of content on what processes to follow. That will help to keep staff focused, but at all phases, everyone involved in planning needs to remember that crisis conditions look nothing like a typical day at work.
You can make some predictions on the sorts of disasters that our business would be most likely to face, but that has limited value. Most people do not know how they’ll react to a catastrophe until they face one. There is no such thing as a “normal” response. Some will focus and work well under pressure; others will not.

People will be scared, in shock, injured, or have any of a number of other adverse responses. Afterward, the effects can linger. The death or serious injury of a coworker can traumatize others.

While you have no way to know exactly what will happen, you can plan with the expectation that anyone who needs to put it into action will have a disadvantage.

Tips for effective response plans:

Keep all instructions short and clear;
Do not assume that anyone enacting your plan understands corporate or departmental jargon and colloquialisms;
Use acronyms and mnemonics for disaster response training. In documentation to follow during a response, clearly spell out any acronyms or symbols;
Employ iconography. For instance, if process B depends upon the completion status of process A, use a large icon of a stop sign or similar callout at the end of process A;
Where iconography does not suitably attract attention, use textual clues. For instance, large “Warning” boxes in a bold color and using large fonts.

Research or brainstorm acronyms for problems that are likely to occur and that require uncommon activities. For instance, most people have never used a fire extinguisher. You might create literature on using fire extinguishers that includes the common “PASS” acronym.

Then have pictures, or better yet, a video that matches “P” to pulling the extinguisher’s pin, “A” to aiming at the base of the fire, the first “S” to squeezing the trigger, and the final “S” to sweeping the nozzle back and forth.

If you include directions with extinguishers (highly recommended), you can have a short tag with these items spelled out. Do not assume that anyone remembers (or even attended) the training.

You can create your own acronyms. As an example, you could create a fire protocol and call it “The three Es (EEE): Extinguish, Evacuate, Escape”. Your training would expand these to “extinguish the fire if possible”, “evacuate others”, and “escape yourself”. If drilled, people have a better chance of remembering what to do when they have some simple mnemonics to work with.

Do not overuse these memory tools. For instance, if you search the Internet for “emergency response acronyms”, you will find lists that contain government agencies, response programs, and common phrases used when response personnel communicate with others.

People who work in disaster response full time might remember these, but no one else will. Have only a few and try to have them on printed literature near any equipment that relates to the situation that they address.

Above all, remember that some catastrophes affect more than just your business. Some of your staff may have had their lives upended. Many will have things in their own lives to recover from. Business continuity planning must include flexibility for employees.

Working with displaced employees

Catastrophes can render a site unusable for a significant period of time. Plan in advance what the employees will do.

If you redirect staff to an alternative site, ensure that everyone knows the location. Include a reminder in the notification system. Importantly, have someone verify the viability of the site before sending everyone there. You can use an initial message that informs every one of the situations and instructs them to wait for further notifications. Once someone deems the secondary site usable, send a follow-up notification.

Remember that, just like surviving a disaster, an interrupted work routine causes distress. People will arrive late, get lost, and need to leave at atypical times of the day to reach appointments that normally needed only a few minutes of travel time. Plans should expect erratic attendance patterns while employees adjust.

Working with offsite employees

Many positions began transitioning to remote work years ago. The need for isolation brought on by the COVID-19 crisis dramatically accelerated that transition.

As long as remote employees still have some system to connect to and were otherwise not impacted by the event, little changes for them. Include them in communications about the situation and remember that the conditions will have some effect on them.

You may choose to have some employees begin working from home that would normally commute to a physical location. An effective transition from on-premises to at-home work requires a substantial amount of advance planning, especially if you do not currently have a formal remote work policy.

Your organization will need to answer many questions:

Do employees use their own hardware?
Does the company provide equipment?
Does the company reimburse?
How will remote employees maintain communication?
Will you pay for a premium collaboration service, such as Zoom?
Will you enforce a requirement of a particular service?
Will work hours change? Flex?
Will users connect via a VPN? VDI? Microsoft Remote Desktop sessions? A Citrix solution? Something else?
Do your systems have sufficient capacity to support the potential number of remote workers?

Some employers worry that productivity will drop from remote workers. Studies (Are Remote Workers More Productive Than In-Office Workers? & Is Working Remotely Effective? Gallup Research Says Yes) have shown that this concern has no basis. However, if the source event was a major disaster, the psychological effects and any damage to employees’ property will impact their work.

Even without that, transitioning from the office to the home does take time. Balloon some adjustment flexibility into your plan.

Notifying and accounting for employees

Your plan should already include notification trees and contact methods. Response documentation must include an accounting system. Small businesses can do this informally.

Medium-sized businesses can require employees to check in with their supervisors who in turn report to a central command structure. Large businesses can do the same in a tree structure or make use of telephone numbers.

Define processes for unreachable employees in the context of a widespread disaster. You can use things like “unknown”, but that cannot be a final disposition after attempting a single phone call. Establish a schedule for retries. When multiple attempts to locate an employee fail and you can no longer devote resources to them, report them as missing to the authorities.

To reiterate, do that only when you have reason to believe that the person might be in danger. Do not call the police if a systems administrator doesn’t answer a text message about a server crash. While that might seem obvious, make the conditions very clear in your documentation.

Design Guidelines for Business Continuity Processes

The overall goal of this article is to cover the role and importance of people in a disaster recovery plan. Use this information as a starting point and guidance system for building your own documentation. The actual processes to include must come entirely from your business experts. Start with these high-level points:

Guidelines for managers and executives to decide between a short interruption that warrants no major response, an event that justifies switching to full downtime procedures, and a genuine disaster that requires an orchestrated response
Minimal and full downtime procedures
Employee immediate response, notification, and accounting procedures
Relocation activities
Remote work policies and practices
Recovery processes

Conclusion

The recovery process portions will need a lot of space. They should only start after the immediate problems have passed. Recovery will include installing replacement systems, restoring data, ordering equipment, organizing contractors, filing insurance claims, notifying customers, and any other activities that staff indicate.

The post The Step-by-Step Guide to Disaster Recovery appeared first on Altaro DOJO | Backup & DR.

How to use Replication to Easily Achieve Business Continuity

Paul Schnackenburg — Mon, 27 May 2024 09:05:24 +0000

As costs for high-speed networking technology decline, we gain more ways to maintain operations through a catastrophe. Replication has changed disaster recovery more than anything else since the backup tape was first introduced.

Tapes once granted us the power to conveniently move data to a safe distance from its origin. Now, we can instantly transmit changes offsite as they occur or after a short delay.

A Short Introduction to Replication

Replication was discussed back in an earlier article as part of a backup strategy, but in terms of disaster recovery it requires a bit more exploration. The name says most of it: replication makes a “replica”, or “copy”. “Copy” invokes the idea of backup, but they have differences.

On the one hand, replication makes a unique, independent copy of data, just like backup. However, replicas do not have much of a historical record, nor do they have a long useful life.

Replication involves some sort of software running within the operating system or on a smart storage platform. You start by making an initial copy, called a “seed”. The replication software then watches the original for changes and transmits them to another instance of the same software, which incorporates the changes into the replica.

Features of typical replication software

Runs continuously or on a short interval schedule
Functions one way at a time
May act as a component of another piece of software

Creates a genuine duplicate of the original, not wrapped in a format proprietary to the replication engine
Replicates without human intervention; failover to replica requires intervention

You will encounter occasional exceptions, primarily with replication systems such as Active Directory that do not treat any replica as the original. However, even in those systems, a change always occurs in one replica first, then the software transmits it to the others.

Also, the product of a replica might be in a proprietary format, but typically only when the replication mechanism belongs to a larger program. As examples, some SQL server software has built-in replication mechanisms and some backup applications, like Hornetsecurity’s VM Backup, include a replication component.

In those cases, the format belongs to the program, not its replication engine.

Synchronous replication

High-end storage systems and some software offer synchronous replication. Details vary between implementations, but all end up transmitting changes from the origin to the replica in real time.

Synchronous replication processes have significant monetary, processing, and transmission costs. They allow for hot sites to pick up right from a failure point. Some synchronous replication systems allow for geographically distributed or “stretched” clusters. With these, you can reliably operate resources within the same cluster across distant datacenters.

For clustered roles like databases, you can have almost zero-downtime failovers. For items such as virtual machines, a failed datacenter will cause its virtual machines to crash, but remote nodes with synchronous storage can bring the VMs back online almost immediately.

Such protections allow you to architect an active/active design that keeps resources close to their users when all is well but to continue running in an alternative location when all is not.

Asynchronous replication

You will find a broader offering of asynchronous replication solutions. As the name implies, they operate with delay. The replication mechanism accumulates changes at the origin for a period ranging from a few seconds to a few minutes.

When it reaches a specified volume or time threshold, the system packages and transmits the changes to the remote point. The receiving replication system unpacks the changes and applies them to the replica.

Asynchronous replication’s primary advantage over synchronous is cost. It can transfer, test, and acknowledge paced large data chunks instead of a rapid series of small blocks, so it reduces network load. Also, because of the convenient packaging system, some replication software will save a bit of history.

In case the system detects a corrupted data block, it might be able to walk back the recent changes to a good point.

Asynchronous replication can only function in active/passive mode. It does not mix with stretched clusters, although it can create a replica of a cluster at the origin site.

Choosing synchronous or asynchronous replication

Sometimes, you will have only one viable choice with replication. Software with a high IOPS profile may not function correctly with synchronous replication. You may uncover instances in which a software-agnostic synchronous replication does not work as well as a software package’s built-in asynchronous mechanism.

A line-of-business vendor may prohibit supporting any installation that sits atop a synchronous replication system. In cases such as those, conditions make the decision for you.

In other cases, you have three primary factors:

Price differences
Recovery point objectives (RPO)
Data value

Synchronous replication usually costs substantially more than asynchronous replication when you only compare the mechanisms. Synchronous replication also demands more from hardware, within the compute layer, the storage subsystems, and the network stack.

Cost often sets the parameters before you even consider the other factors. Recall the discussion on RPOs from an earlier article. If a system or data set has a large RPO tolerance, then do not rush to put synchronous replication on it without some other driving force, such as stretched clusters.

The shorter the RPO, the more you can justify synchronous replication. Asynchronous replication typically allows for very short delays, down to a few minutes or even a few seconds. If that satisfies your RPO, then prefer an asynchronous solution.

Even with a relatively short desired RPO, low-value data won’t justify the higher cost of synchronous replica. As an example, think of a freezer unit at a food distribution company. The historical record of its temperatures has value, especially if you have outstanding litigation over food storage.

However, the temperature of the freezer in the last five minutes before the facility collapsed in an earthquake probably does not matter to anyone. So, the current information only matters operationally, so it only has value when there are current operations. Asynchronous replication can adequately protect this type of data.

Avoid mixing synchronous and asynchronous replication for the same data. It might work without error, but nothing comes without cost. Replication can place a high toll on system resources. Layering replication makes it all worse and may not have any positives.

Choosing Replication Solutions

You will almost certainly use a mixture of replication technologies in order to achieve the best balance of support, functionality, protection levels, and resource usage. Even before looking at dedicated replication hardware or software, you have access to some replication technologies. A few that you might have right now

Microsoft Active Directory
Microsoft SQL Server
Microsoft Exchange Server
Microsoft Hyper-V Server
Backup software application, as an example, Hornetsecurity’s Total Backup
Some SAN and NAS devices
Windows Server Datacenter Edition provides Storage Replica
Windows Server 2019/2022 Standard Edition has a limited implementation of Storage Replica

Look at your major software servers and packages to see if any of them have replication capabilities. Prefer the most specific replication technology that satisfies your requirements. Follow this decision process:

If you have virtual or physical machine running software that has its own replication mechanism (like Active Directory), then use the application’s mechanism only.
If your hypervisor has a replica function and the software in the virtual machine cannot replicate itself, use the hypervisor’s replication tool.
If the machine is physical or you can’t use the hypervisor’s replication (perhaps because you do not have a target system running the same hypervisor), then use operating system replication (like Storage Replica).
If you cannot use replication in the operating system, use NAS or SAN replication.

If, like most organizations, you have many virtual machines running a range of server applications (like AD, Exchange, SQL, etc.), then you should decide on replication separately. You will not get the best results by trying to force everything into the same solution. Some things will not work under some replication configurations that other programs can use without trouble.

The decision factors do not directly include backup replication. Most of the replication features in backup applications only make additional copies of backed-up data, not general-purpose data replicas. In that case, they only count as applications themselves (step 1) and only for the archives that they create.

If your backup program has a general data replication feature, then you can prioritize it before or after step 4. This order of preference exists for several reasons:

If the software manufacturer went to the trouble of building a replication mechanism into their software, then it’s probably the best. Many of Microsoft’s technologies have been developed over decades. External replication cannot know the inner workings of these programs, so it will not work as effectively.
Vendors will not always support their software in conjunction with certain replication technologies. For example, Microsoft does not support using Hyper-V Replica on Exchange.

Replication functions provided by SANs, NAS devices, and hypervisors require the target to run the same or very similar system. If you decide to switch vendors, you’ll have to start replication processes created with their functions all over.

If you cannot get a sufficient budget to maintain the same services or equipment at all locations, you may run into some last-minute or mid-stream problems. Synchronous replication might present an overriding decision point.

You must remain wary of support concerns and other problems. In the absence of such barriers, synchronous replica bumps itself into the #2 preference. You will also need it anytime you intend to use fully functional stretched clusters.

Do Not Replace Backup with Replication

Backup and replication have similar features, but you cannot use them interchangeably. If you must choose between them, always choose backup. Replication exists to enable rapid failover. Replication characteristics that preclude its use as a backup tool:

Little historical information
Usually only one complete copy
Limited testing ability
No capability for regular offline copies
Replication does not always utilize quiescing technology such as VSS

If undesirable data, such as encrypting ransomware, travels to the replica, then it will probably invalidate the entire thing. You will then need to use your standard backup restoration process. Data deleted more than a short time ago will not exist anywhere in the replica’s files.

You will always need the long-term and offline protection features of a true backup.

Considering Replication Licensing Implications

Since replication is different from backup, its use may impose some licensing considerations. Microsoft does not consider a replica virtual machine as an “offline” or “cold” copy since the replication mechanism constantly updates it and the replica is a fully functional entity distinct from the original.

For that reason, hosts that maintain a replica of a Windows Server virtual machine require a separate license from the source host’s license that covers the original.

Above, we mentioned that Active Directory replication serves it better than other replication types, such as Hyper-V Replica. If you use Hyper-V Replica to protect a domain controller, you must still license the replica host as though the virtual machine were online.

So, running one distinct domain controller in each site gives you the best replication technology and makes no difference to your licensing. Note: this rule applies to any Windows Server instance in a virtual machine, not specifically to Active Directory.

You will need to investigate the licensing rules of your software and consider them in the context of replication. This can become complicated quickly, as it can also depend on the type of replication in use (application, hypervisor, operating system, dedicated software, or hardware) and other factors.

For instance, if you add Software Assurance to a Windows Server host license, you can replicate its virtual machines to other systems without additional licensing costs. For the most comprehensive answers, work with trained licensing specialists at authorized resellers or contact software vendors directly.

Hornetsecurity provides full 24/7 support as part of its services to help users achieve the perfect configuration. Make use of services like this to ensure your licensing matches your requirements.

Configuring Replication

Despite the plethora of replication solutions, they share common configuration points. The exact steps will depend on your hardware or software, so we will give a generic overview of the process.

Establishing replication sources and targets

Replication requires at least two endpoints capable of acting as replication partners. That requires a mirroring configuration of hardware and possibly software on each end. To begin, install and configure the hardware and, if you use a software-based mechanism, configure that as well. Necessary steps depend on your replication solution. A few examples:

For generic data replication, configure the hardware or software as an endpoint using the system’s directions.
For Active Directory replication, install the Active Directory Domain Services role on a system in each location. Follow the necessary steps to add them to the same domain and ensure that they have IP connectivity over your inter-site link. Use Active Directory Sites and Services or PowerShell to logically separate the sites. Active Directory will automatically set up its own replication, applying special rules for traffic that crosses sites in the expectation that they have less than gigabit speeds, that the links might have high contention, and that sites may periodically lose connection.
For Microsoft SQL replication, you first need to fully install SQL on each endpoint. You will then select one of SQL Server’s many data synchronization options and configure it accordingly.
Hyper-V Replica requires you to first configure each participating server to receive replica. For clustered Hyper-V hosts, you must create the Hyper-V Replica Broker role and configure it instead of working on any individual node. Once you complete that step on all relevant systems, you can then configure individual virtual machines to replicate to specific target replica partners.

If your backup application includes a replication function, follow its directions for setup and configuration. As mentioned previously, you will likely have a mixture of replication configurations. Create a checklist of the items to cover in order to ensure that you configure them all.

Creating an initial seed

After enabling replication, the first thing that must happen is a complete build-up of the starting replica. That will probably amount to a very significant chunk of data. Perform some rough calculations based on the data size and the speed of your intersite connection.

If you discover that it might take several days to finish transmission of the beginning replicas, you can create an offline initial seed. The process works like this:

Establish the replication partners
Define the data or objects that will replicate from the current site and configuration replication
Use the application or device’s process to create an initial seed on a transportable device (such as a USB hard drive)
Physically transfer the seed to the target system
Establish the replica from the seed data
Start the replication process

Because operations will continue while the seed is in transit and building the replica, the replication system will need some time to catch up. You should not need to do anything else manually.

Most dedicated replication software uses the term “initial seed” or something recognizably similar. Software with built-in replication typically uses other wording. For example, follow the “Install From Media” (IFM) procedure when setting up an Active Directory Domain Controller.

Maintaining Replica

Most replication technologies work unsupervised after setup. Regardless of your confidence in the tools that you use, you should set up automated monitoring to keep an eye on them. You could also rely on some sort of daily manual verification process. However, your organization probably would not want a 24-hour or weekend period to pass without viable synchronization.

Also, the more a process tends to succeed, the less inclination tech staff will have to check on it. Your monitoring method depends on the architecture of the replication system. Set up alerts for:

Windows event logs
Linux error logs
Unexpected service halts
Inter-site connection breaks
Storage capacity

Some systems may have an option to send notification e-mails. While you should take advantage of those, do not rely on them. If the service fails completely, it will not send anything

It’s easier to forget about a message that you never received than to ignore one that you did. Use active external monitoring

If you don’t have much experience with a particular tool, it may take some trial and error to properly balance its monitoring. As with all other things, you want it to tell you what you need to know without overwhelming you. The technology may introduce some concepts that are new to you.

For instance, most asynchronous replication systems involve some sort of “logging” and “playback” technique. For instance, Hyper-V Replica (HVR) builds changes at the source into a log file. At the designated time interval, it ships the log file to the target replica host. Once received, the replica host “replays” the contents of the log file into the replica.

If something goes wrong with HVR, you will see symptoms in the directory that contains the replica and its log files. HVR keeps the log files for a time, but eventually cleans them up. If you’re accumulating log files, that signals a problem. If you have zero log files, you will want to investigate. In the case of HVR, you should have accompanying event log entries that provide detail.

However, monitoring the storage location in addition to the logs gives you additional opportunities to detect a problem before it has a permanent effect. It will also set up a best practice for you in case you have a different tool that does not write to the event log or some environmental problem that causes logs to roll over more quickly than you can process them.

Create a plan to accommodate resource discontinuation. Most replication systems will not automatically perform cleanup when you stop replicating. Whatever procedures you have developed for decommissioning applications and systems, append a process that describes how to stop and clean up replication.

As part of discovery and initial testing, find out how to handle these situations in your replication tools. Take time to learn if and how you can safely move replicas. If replication is a subsystem of another application, then it typically follows the application’s resource moving rules.

For example, you can use Hyper-V’s Storage Migration to move a replica virtual machine and HVR will automatically deliver replica files to the new location.

If you follow the supported steps to move the NTDS.DIT file on any domain controller, it will not break Active Directory replication. For application-agnostic replication technologies, you may have more work. Research in advance so that you never need to try to figure out how to move items under pressure.

Correcting Problems with Replication

You will encounter three main problem categories with replication:

Broken connections

Overwhelmed destination

Synchronization collisions

Most mature replication technologies deal with broken connections gracefully. They wait until they can reach the destination again and pick up where they left off. Test new systems before deployment and learn how they cope with and report these events. Use this information to shape your monitoring plan and responses.

With some technologies, the replica system can fall so far behind the primary that it simply gives up and breaks out of the partnership. The exact technique to recover depends entirely on the product. Check its literature for information.

Usually, the fix involves a resynchronization, which you can target for a quieter period. Discovering the root cause is equally as important as correcting the condition.

If it resulted from a broken inter-site link, then you know why and probably no other recourse have than to fix it and move on. However, if the link stayed active, then simple corrective action may only set it up to fail again. Some ways to address repeatedly overwhelmed replication:

Adjust the delay between transmissions. A natural instinct is to increase the delay to give the target system more time to process log files. However, it sometimes helps to reduce the interval so that the link and secondary system work with smaller files.
Reduce the load on the inter-site link.
Increase the speed of the inter-site link.
Upgrade the target hardware.

The first option is the easiest but involves potentially frustrating trial and error. The last two items will likely involve capital expenditures and contractors. To discover where to focus your efforts, set up monitoring on the resources.

Learn if the target becomes overwhelmed because it doesn’t receive the data in time to process it ahead of the next package, or if it receives it quickly enough but doesn’t have sufficient speed to handle it before another arrives.

You need to find the bottlenecks before you start trying to fix them. A common technique for load reduction is removal of non-essential resources from the replication chain. For virtual machines, you can relocate swap data to separate virtual disks and exclude those disks from replication.

Preventing “split brain” and synchronization collisions

Cluster technologies use some form of external arbiter to prevent access to the same object from multiple locations with the expectation that a completely isolated member will not come online without extraordinary steps.

Controls might be complex, like Microsoft’s dynamic quorum, or simple, like a lock file. In contrast, replication works with linked but unique objects. Any replication partner must have the freedom to operate on its own replica even when completely isolated. The only arbiter is human operators.

Replication functions properly when one partner processes a change to its object and transmits that change to the other partner(s). When two or more partners in replication receive changes to their local copy of the same item, you have the potential for a collision.

Active/active replication systems have some capability to minimize these problems. Active Directory uses timestamps and other arbitration techniques to choose the one change that it will keep and records the others as historical changes. Active/ passive replication typically does not have such robust protection.

Consider a situation in which Site A replicates to Site B. The inter-site link drops. Site A continues operating as normal because it was the original. An operator at Site B has built a script that automatically fails to the local replica when the link drops, on the assumption that such a drop means that Site A has gone down.

Unfortunately, that assumption was incorrect. The script runs, resulting in both sites actively making changes to their local replica. We call this condition “split brain”. When the link is restored. Site A will try to resume synchronization. If B’s replica is not in the condition that Site A expects, synchronization will fail with no automatic way to recover.

Depending on the replication technology in play, you may have a great deal of clean-up work to look forward to. Complete recovery may not be possible. In the case of Hyper-V Replica, you will need to choose one replica as the origin and resynchronize to the other as if it were new.

You can copy any data that you want to save out of the replica first, then put it back into the origin. File-by-file replication systems will only have troubles with competing file changes. More complex systems with no viable repair path may suffer permanent data loss.

Even active/active mechanisms like Active Directory have some risks. It should have no problems surviving the above scenario because it was designed with those types of failures in mind. However, you can cause permanent damage to Active Directory in other ways.

In the past, rolling a virtualized domain controller back to a previous state could cause irreparable damage to the directory. Research “USN rollback” for more information on that problem. For the purposes of this discussion, understand that you can break any kind of replication technology by using it in an unsupported fashion.

Most such breakdowns require restoring to an earlier backup. A few best practices can keep you out of split-brain conditions:

Do not automate failover for replication systems that have no automated arbitration
Create a defined process for initiating failover (see the upcoming section on Business Process for Disaster Recovery for more information)
Do not mix virtual machine snapshot/checkpoint technologies with replication technologies

As a note on the last bullet point, Hyper-V incorporates its checkpoint technology to facilitate backup operations, including Hyper-V Replica.

These special-purpose checkpoints pose no risk to replication. Many synchronization collisions occur because a change was made, duplicated to a replica, rolled back at the source, and then the source changed again prior to the next replication interval.

The new changes appear to conflict with an earlier change, which throws the replica into an unknown state. Because Hyper-V’s backup and replication checkpoint functions never revert, they do not cause collisions.

Fundamentally, replication exists to enable rapid failover to an alternative site. When used correctly, it can allow nearly uninterrupted data services even in a major catastrophe. When used incorrectly, it adds a lot of overhead at best and causes a great deal of damage at worst.

Leveraging Replication in Disaster Recovery

While replication can address the offsite requirements of backup, it does not replace any of its other components. You cannot maintain a series of offline replicas, nor will replication software have a simple way to retrieve historical data (like an e-mail or a single file).

Replication software will overwrite good data with corrupted data without hesitation and then delete its previous state. Replica supplements backup well, but it will never replace it. If you have sufficient funding and at least one viable alternative site, replication enhances your business continuity solution.

Conclusion

Replication transforms disaster recovery by swiftly creating independent data copies for seamless offsite updates. Whether synchronous or asynchronous, it enables rapid failover, minimizing downtime during crises.

While a valuable supplement, it doesn’t replace backup, excelling in maintaining uninterrupted data services during catastrophic events, making it a crucial element for robust disaster recovery strategies.

The post How to use Replication to Easily Achieve Business Continuity appeared first on Altaro DOJO | Backup & DR.

How to Back Up your Hyper-V Virtual Machines

Symon Perriman — Sun, 28 Feb 2021 06:12:21 +0000

Virtualization is becoming ubiquitous today as more and more companies realize the benefits of Hyper-V. But as a line of business apps and mission-critical functions are moved from physical hosts to virtual ones, the idea of using “snapshots” to take backups of VMs is simply not enough. They don’t scale, can be confusing to work with, and consume resources much better left for running the production VMs. In this article, we’re going to look at what you want to do to provide a scalable and performant backup solution for your Hyper-V virtual machines running on Microsoft Windows Server 2012, 2012 R2, 2016, and 2019 hosts.

Why Backing Up VMs is Different

It’s important to understand that, while a virtual disk is, at least to the host operating system, a single big file, it’s definitely not just a single big file. Even a simple VM with a single disk for everything has an image of a system volume, with the full directory structure you’d find on any system, along with supporting files for the image, changes, et al. More advanced VMs may also include other image files on other storage arrays to handle databases, or log files, or to host the applications that the VM runs. All of these are likely to be in a constant state of reading and write while the VM is running. When a virtual machine is shut down, you can treat that data like a handful of really big files, but who wants to shut down mission-critical servers every night so you can make a backup? Anybody?

In addition to all the changes going on within the VM’s file systems, which are all reads and writes to that big file, there’s the memory image being written to the swap file of the VM, which is again a bunch of changes being written to that one big file. Trying to “grab” an image of a file that is in an almost constant state of change is what makes backing up VMs so different from backing up files. A file system can be held off from committing a change to disk when a backup application puts a lock on a file to capture it to back up in a second or two, but you cannot do that when you’re trying to back up a virtual disk that may take several minutes to complete. That virtual machine is wanting to write multiple changes per second and cannot just stop when it’s running. And you can’t shut down VMs each night just so you can do backups. If you have multiple machines behind a load balancer, you can take one at a time out of rotation to back things up, but that increases complexity, reduces performance, and risks fault tolerance.

There’s also the method to consider. Physical device backups require software to be running on the host machine or access to the host’s storage over a network share. Running the software on the host may be okay in smaller environments but is not advisable. You want to preserve as much as possible of the host’s resources as you can for the VMs to use. Agent-based backups use a small software service to run on the VM host, while the heavy lifting of the backup application handling things like compression, encryption, deduplication, and the actual jobs of backing up VMs is handled by a server separate from your VM hosts.

Choosing the right application

There are several applications on the market that can do backups of virtual machines, but not all have all of the key features you want to have to protect your mission-critical systems. Here is the shortlist of must-have features you need to have in order to get the best performance and protection for your VMs in the most cost-effective way you can.

Live backups
Make sure your backup solution can do backups while the VM is running. Not shut down, not paused, not taken out of the load balancer rotation. Backups should help protect you, not slow you down. Live backups mean you can back up when you want without dropping capacity or staying up late at night.
Continuous Data Backups
How many changes a day do your VMs process? 10? 1000? 1M? How much work would it take to bring a machine and its config back to “current” if you had to start from a backup taken last night? Continuous Data Backup lets you backup your VMs throughout the day-as frequently as every five minutes if that’s what you need. That’s a pretty good RPO.
Deduplication
Backing up multiple times a day is great for RPO but will need a LOT of storage space to do. Save space by choosing a backup solution that can do deduplication and make sure it can do it inline, not after the fact.
Automatic Retention Policies Further help keep your storage use in line by choosing a solution that handles retention policies automatically. You don’t have time to go through and dump old backups. Your solution should do that for you.
Historical Point-in-Time Options There are benefits to keeping some older backups for recovery, analysis, and other just-in-case scenarios. Systems that can take and keep yearly or quarterly or monthly or weekly backups so you have a historical point you can go back to is invaluable. Frequently called a “Grandfather-father-son” rotation scheme, make sure your solution can take care of this for you.
Concurrency
Even with fast backups that can run while the VMs are live, you don’t have time to do backups serially. Unfortunately, some solutions can only do one VM at a time. Make sure the one you choose offers concurrency so you can back up multiple machines all at once.
Options options options
You want options on where your backups go. Disk to disk is okay but can be costly, and if they are all in the same place, a site disaster means you’re toast. Over the wire to an offsite location is much safer, but of course, it requires that you have an offsite location. If you don’t have a second datacenter, make sure your solution supports cost-effective and secure data storage. Azure Blob Storage is a great solution offering secure storage, fault tolerance, and quick recovery, so consider a solution that can use Azure for storing your backups.

Crash Course on VSS (Brief)

We mentioned in the features section live backups. How do you take a backup of a VM that is currently running? Easy. Use a solution that works with VSS. The Volume Shadow Copy Service, or VSS, is where the magic happens. This service, a part of all currently supported versions of the Windows operating system and dating back to Windows XP, enables the operating system and compatible applications to take backups of volumes while other applications or services are actively writing data to them. That’s critical to obtaining consistent and complete backups of large data sets, including databases, complete volumes, and virtual disks. VSS coordinates all the actions needed to create a “shadow copy,” also called a snapshot, of whatever data is being backed up. VSS is useful for more than just backups. You can use it for data mining, disk-to-disk backups, fast data recovery for failed LUNs, and more. In the case of VMs, using VSS to create backups ensures that you get everything that makes up the virtual disk so that you have an application-consistent backup. That way, you have a working VM you can boot if you need to, not just a VHD you can mount to search the file system.

There are four parts of VSS. They include

The VSS service, which is a part of Windows and arbitrates the communications between the various other components
One or more VSS requestors, which are the backup applications to use VSS. These include the Windows Server Backup Utility (Windows Backup,) System Center Data Protection Manager and many third-party backup software solutions.
The VSS writers, which are the components that guarantee that data is complete and consistent when committed to a volume and made available for backup. Microsoft server products like Exchange and SQL Server and, yes, Hyper-V, include VSS writers, as so some third-party server applications that need to rely on VSS.
The VSS writer, which is what handles the shadow copy creation and maintenance. This can be software in the operating system or in drivers or can be a hardware solution included in a SAN. Windows’ included VSS writer uses a copy-on-write to ensure all writes are committed to a shadow copy.

VSS writers can write data for shadow copies in three different ways.

Complete copy – consider this a clone. Every single block at a given point in time is copied to another volume.
Copy on write – this creates a differential copy of all changes (writes) from a given point in time going forward.
Redirect on write – similar to the copy on write. This redirects all changes to a different volume.

Please see https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/ee923636(v=ws.10) if you would like a deeper dive into VSS, including information on I/O and storage costs as well as different providers that are available. For the purposes of backing up Hyper-V virtual machines, it’ enough to understand that you want a VSS-compatible backup solution so that you can take exact point-in-time backups as and when you need it.

What are Production Checkpoints? (2016 & 2019)

Hyper-V in Windows Server 2016 introduced a new type of checkpoint, called the Production Checkpoint. The original checkpoint is now referred to as a “standard checkpoint”. The Production Checkpoint is a point in time image of VM and leverages the guest operating system’s backup technology to take a complete snapshot of the VHDX at the point in time. Not only is this faster than using a saved state technology, but it’s also fully supported to restore this image for all production workloads, which is a pretty big deal for things like Exchange, SQL, and Active Directory. Standard checkpoints rely upon the host and “grab the state, data, hardware settings, et al” of a running VM but are not supported for production recovery. You can use them for dev and test all you want, but don’t count on them for production if you need to maintain a supported state.

What is Resilient Change Tracking (aka CBT), and why is it Important?

Changed Block Tracking (CBT,) or as Microsoft refers to this, Resilient Change Tracking, is a native change tracking mechanism in Hyper-V 2016 and onward. This is a more flexible and better-performing method of capturing changes to a volume than VSS offers and is a feature VMware admins have enjoyed for a few years. RCT’s most important capabilities are the ability to track and capture changes at the block level, which speeds up performance and can save 30% or more on storage requirements when taking backups of VHDXs. That equates to cost savings for storage.

Use case of a non-VM based VM Backup tool vs. something like Altaro VM Backup.

When it comes down to what solution to use to backup (and restore) your mission-critical virtual machines, you can use any system that can back up files stored for Hyper-V VMs, including the VHDXs, differencing files, configuration files, and memory dumps. If you store all of those in a single location, you can just back up the directory, and Bob’s your uncle. What’s not to love? Actually, quite a bit. There are several to consider. Let’s take a look at a hypothetical organization that runs a mission-critical web application with a three-tier architecture. The front-end web servers, middle-tier application servers, and back-end database servers all work together and all have different backup needs, RPO/RTO targets, and retention requirements. All are running as Windows Server 2016 (in this case but would work with other supported versions) guests on various Hyper-V hosts.

Storage requirements-Using Windows Backup or a third-party backup application means taking backups of every single machine in all three tiers. Even though they are all running the same operating system and at the same patch level, each VM is a unique set of files. But when using Altaro VM Backup can reduce storage costs by performing inline deduplication. Since the VMs are running, and the solution is leveraging the guest operating system, files that are common amongst the VMs, like the operating system, can be deduplicated in the backups, saving tons of space.
Reduced RPO-Backups of files can take a significant amount of time to start, run, and complete, increasing the elapsed time between your most recent backup and when something bad happens. Continuous Data Protection enables you to take a backup as frequently as every five minutes, ensuring a very short RPO. And that policy can be set at different frequencies for different VMs easily. Take a backup of the web servers once a day, the middle-tier servers four times a day, and the database servers every five minutes, to get a balance between RPO and storage costs.
Simplicity-many backup applications are written for file systems and don’t have as much granularity for different needs for different sources, making the setup and maintenance of backups much more work. Altaro is purpose-built to backup virtual machines, and the interface is easy and intuitive, letting you get your backups set quickly.

An Example Backup Job

Still not convinced? Here’s a quick walkthrough of how quick and easy it is to set up a backup of a VM using Altaro, complete with pretty pictures. We’re using our own product as an example here, but many other backup applications targeting virtual machines will work in a somewhat similar manner.

If you haven’t already, log onto your Hyper-V host, download and install Altaro Backup fromhttps://www.altaro.com/vm-backup/download.php. You can choose a free version that lets you back up two VMs per host forever or a thirty-day trial that lets you try everything for a month before you buy. It’s a “next, next, enter” type install. At the end of the install, you will be prompted to launch the management console like this.

The console will prompt you to either connect to “This Machine” or to a “Remote Machine” since you can manage more than one instance of Altaro VM Backup from the same console. Choose “This Machine” to set up your job. You can also tick the box to log on automatically next time if you wish.
You can use the Quick Setup to get a backup job up and running quickly. As you can see in the console, there are three simple steps.
Click Add Hyper-V/VMware Host to add your host physical server. This will switch to the management console once it enumerates the host.
Select the backup location you wish to use to store your backups. You can select local or network storage, as seen below.

In this example, we will use a Physical Drive, so we click that and then click Next.

Here we can see the three physical disks connected to our Hyper-V host, and we have selected the G: drive. Note, backups will be stored at the root of the volume, under a folder names AltaroV7, unless you click the “Choose Folder” to pick or create another folder.
To complete step 2, we just need to select the drive and click Finish.

Now we need to select the VM or VMs we wish to backup to this drive and simply drag-and-drop them to the drive. Make sure you drag the VMs over the actual drive letter, not just into the middle pane.

You can see in the console pane on the left that there are several options you can configure, including scheduling, retention, and more. For a quick backup, though, we’re just going to click down on Step 3, Take Backup.

You will be prompted to save changes. Do so, then tick the box next to the VMs you wish to back up and click “Take Backup.”

Please note, if you have VMs that were created on an older version of Hyper-V, they might be running configuration version 5.0. If you get an error that they cannot be backed up, you need to first shut down the VM, then update the configuration (right-click the VM in the Hyper-V console and select upgrade) and then restart the VM. You’ll be able to back it up then.

As the backups run, you can see the status by moving your mouse over the progress icon.

When complete, you should see this.

If you click the + sign on the far right, you’ll see some statistics on the backup. Notice the space savings for this particular backup!

If you need to do a restore, the steps are very similar.

On the right-hand menu, click Restore.
Click the source location, and click Next.
Select the VM or VMs you want to restore, and click Next.
Choose the options appropriate for your restore job. Notice you can restore to another host, and the default is to restore with the NIC disconnected. Click Restore, and off it goes!

Other options

There are several other options you can select. You can do granular file-level restores from file servers or item-level restores from Exchange servers.

You can also test and verify your backups, which is probably the single most important thing to do with any backup solution but is missed by so many other systems. The very last thing you ever want to do is find out your backups are not working only when you actually need to do a restore!

One other thing you may want to make sure you look at is the Reports section. It gives you just what you need to check and confirm operations completed successfully, without burying you in details you won’t want to read anyway. It tells you the what, when, and how, and that things were successful, or if something wasn’t, why.

You cannot export the reports, but you can see everything you need. If the boss wants a printout, screenshots of the console should suffice.

Setting up scheduled backups and retention policies are both simple drag and drop operations. Pick a default schedule or create your own, drag the VMs to it you want to be covered, and that’s it. It will take you longer to read this paragraph than to set up a backup and retention schedule!

Final thoughts

If your business relies upon virtual machines, then you need backups that you can rely upon. Altaro VM Backup gives you all the functionality you need to backup, and restore VMs and individual items from VM backups with the security and flexibility every business needs, without the costs that only enterprises can afford. You owe it to yourself to get the peace of mind that Altaro VM Backup offers by downloading and installing it today.

The post How to Back Up your Hyper-V Virtual Machines appeared first on Altaro DOJO | Backup & DR.

How to back up vSphere Host Configuration Settings

Brien Posey — Wed, 28 Oct 2020 05:51:34 +0000

Although most VMware administrators probably back up their virtualized environments regularly, backups are often centred around the virtual machines and not necessarily on the underlying infrastructure. Even so, it’s a good idea to backup (or at least document) your ESXi host configuration. Doing so will allow you to put things back to normal following a configuration error, boot disk failure, or another minor catastrophe. Fortunately, VMware provides native tools that make the process easy.

Before I Begin…

The approach that you will have to use varies depending on the version of VMware that you are using. If you are running VMware 6.x for example, then you will need to leverage vSphere CLI. However, this tool has been deprecated, which means that a different tool will be needed for those who wish to create a backup of a vSphere 7.x environment.

vSphere 6.x

The VMware tool for backing up host configurations is a command-line utility named vicfg-cfgbackup. There are two important things that you need to know about this utility before using it.

First, vicfg-cfgbackup is designed to be run in the VMware CLI environment. You can download the current version of vSphere CLI here. This tool adds VMware support to the Windows Command Prompt. Hence, the commands that I will be discussing in this blog post should be entered into a command-line environment, not PowerShell.

The second thing that you need to know about the vicfg-cfgbackup tool is that the vicfg-cfgbackup command’s syntax varies slightly depending on whether or not you are running the command on a Windows system. For this blog post, I will be working in a Windows environment. You will therefore see me adding a .PL file extension to the end of the vicfg-cfgbackup command. The .PL extension should not be used in non-Windows environments.

The reason why the .PL extension is required in Windows environments is because the vicfg-cfgbackup tool is based on Perl. You will therefore need to have Perl installed to use the vicfg-cfgbackup tool. It is also worth noting that your Perl deployment will need to have the XML/LibXML.pm library installed. Otherwise, you will get an error stating that windows “can’t locate XML/LibXML.pm in @INC (you may need to install the XML::LibXML module).”

You can download this library at https://metacpan.org/pod/XML::LibXML

If you happen to be using ActiveState Perl, then you can install the required module by entering the following command:

Ppm install XML::LibXML

You can see what the process of installing the required module looks like in Figure 1.

Figure 1

This is how you install the XML::LibXML module if you are running ActiveState Perl.

Backing Up the Host Configuration

As previously noted, the process of backing up a vSphere host’s configuration is relatively easy. You can do it with a single command. In a Windows environment, that command is:

Vicfg-cfgbackup.pl –server= –username=root -s

To give you a more concrete example, I have an ESXi host with an IP address of 147.100.100.224. If I wanted to create a configuration backup named 224CFG and place that backup into my PC’s C:\Data folder, then the command that I would use to do so would be:

Vicfg-cfgbackup.pl –server=147.100.100.224 –username=root -s C:\Data\224cfg

As you can see in Figure 2, upon entering this command, you are prompted to enter the root password. Once you do, the firmware configuration is written to backup.

Figure 2

This is how you create a backup file.

As you look at the figure above, there are a couple of things that are worth paying attention to. First, you will notice that I had to execute the command in the C:\Program Files (x86)\VMware\VMware vSphere CLI\bin folder. This is not the default folder that appears when you open the command prompt environment.

Another thing to note is that when you enter the root password, nothing appears on the screen as you are typing. This is normal behaviour.

One more thing that I want to mention is that VMware allows you to use any filename that you want when creating a backup. I used 224cfg as a file name because 224 is the last part of the host’s IP address, and the cfg portion of the filename indicates that this is a configuration backup. Such a filename is fine for use in a demo environment such as the one that I am using, but in a production environment, there are two things that I recommend including in your filename.

The first thing is some sort of host identifier. Having a host identifier isn’t mandatory, but it is helpful to know which host a backup was created on. The other thing you should include in the file name is a reference to the host’s build number. Knowing the build number will be helpful if you ever have to restore the backup.

Restore vSphere Host Configuration

The single most important thing that you need to know before attempting to restore a host’s configuration data is that the host’s build number must match the build number saved within the backup file.

Suppose for a moment that you were to create a host configuration backup but don’t really have any immediate need for it. Now suppose that some time passes, and during that time, you perform a couple of software upgrades to your ESXi host. If there were suddenly a need to restore the host configuration, you would find that the restore operation would fail because the configuration backup build number is older than the host’s build number.

In a situation like that, there are two options available to you. The first option is to try to force the restoration despite the version mismatch. The vicfg-cfgbackup command supports the use of an -f flag. This flag causes vicfg-cfgbackup to ignore the version mismatch and restore the configuration anyway. Of course, in a version mismatch situation, forcing a restoration can yield unpredictable results.

The other option is to reinstall ESXi onto the host from scratch (being careful to install the correct version). You would then be able to restore the configuration backup and then update the host to the same build number it had been running before the failure. At that point, it would be an excellent idea to create a new configuration backup so that you can avoid having to work through the process that I just described if another restoration becomes necessary.

This brings up another important point. I very strongly recommend retaining copies of the installation media for all of the VMware software versions that you have installed in the past. You cannot simply assume that VMware will always make old versions of its software available for download.

This, of course, raises the question of how you can compare build numbers. There doesn’t seem to be an easy way to derive the build number from the backup file. That’s why I recommend including it as a part of the file name. You can check the host’s build number by using this command:

esxcfg-info -u

Before you restore the backup, you will need to place the host into maintenance mode. Upon doing so, you can restore your backup by using this command:

vicfg-cfgbackup.pl –server= –username=root -l

If, for example, I wanted to restore the backup that I created earlier, I could use this command:

vicfg-cfgbackup.pl –server=147.100.100.224 –username=root -l c:\data\backup

You will notice that this command contains the -l switch. This switch tells vicfg-cfgbackup that you are performing a restoration, not a backup.

vSphere 7

As previously noted, VMware has deprecated the vSphere CLI (https://kb.vmware.com/s/article/78473), and will not be using it in the future. As such, vSphere 7.x uses a completely different (but easier) method for backing up and restoring an ESXi host configuration.

Rather than using the vSphere CLI, the new technique involves using the ESXi Command line. In order to do so, you will need to be logged in as the root user.

The first step in the process is to synchronize the configuration with persistent storage. To do so, enter the following command:

Vim-cmd hostsvc/firmware.sync_config

Once the synchronization process completes, you can perform the backup. The command used for doing so is:

Vim-cmd hostsvc/firmware/backup_config

This command backs up the host configuration by creating a .TGZ bundle file. Upon the command’s completion, the console will display a URL from which you can download the bundle. At that point, you can simply open a Web browser and download the bundle file.

It is worth noting that downloading the bundle file will require you to make a slight modification to the URL that is provided. The URL includes an asterisk that takes the place of your host’s fully qualified domain name (FQDN). You will need to substitute your host’s FQDN or IP address for the asterisks. Suppose for example that your host’s IP address was 10.1.1.0. In that case, you would change http://*/downloads… to http://10.1.1.0/downloads…

Restoring the vSphere Host Configuration Backup

The process of restoring the backup is just as easy as that of creating the backup. As was the case for vSphere 6.x however, you must make sure that the VMware host is running the same build as was used when your backup was created.

The first step in restoring a backup is to rename your backup file. Remember, the backup is a .TGZ file. In order for you to be able to restore this file, you will need to change its name to configBunfle.tgz. The filename is case sensitive.

The next thing that you will need to do is to place the VMware host server into maintenance mode. You can do so by entering the following command:

Vim-cmd hostsvc/maintenance_mode_enter

Now that the host is in maintenance mode, you will need to copy your backup file to the ESXi host. As an alternative, however, you can copy the file to a datastore. To perform the actual restoration, enter the following command:

Vim-cmd hostsvc/firmware/restore_config 1 //configBundle.tgz

The number 1 in the command shown above forces an override of any UUID mismatch that may occur. If you do not wish to override such a mismatch, you can omit the 1. It is also worth noting that the host will be rebooted upon the command’s completion.

Conclusion

VMware makes it really easy to backup and restore a host’s configuration, regardless of which version of VMware you are using. The one thing to remember, however, is to keep your host configuration backups up to date as you update your VMware hosts to newer builds.

The post How to back up vSphere Host Configuration Settings appeared first on Altaro DOJO | Backup & DR.

How to Back Up Hyper-V Host Configuration Settings

Brien Posey — Wed, 28 Oct 2020 05:45:59 +0000

Although most Hyper-V admins meticulously back up their virtualized environments, these backups may sometimes focus on the virtual machines themselves rather than on the underlying Hyper-V infrastructure. Even so, having a backup of an organization’s Hyper-V host servers can be useful for any number of reasons. An organization might, for example, wish to undo an unwanted configuration change on a Hyper-V host or rebuild a host following a boot disk failure. This would be one reason where you would be looking to back up Hyper-V itself, or more specifically, it’s configuration.

Unfortunately, Hyper-V doesn’t contain a “click here to back up the host configuration,” nor does it include a host configuration backup utility like VMware does. Even so, there are several different options for backing up (or at least documenting) a Hyper-V host’s configuration. I will discuss two of the more effective options in this article.

Create an Image Backup

The first option for backing up the host’s configuration is to perform an image backup. An image backup (at least within the context of the technique that I am about to explain) cannot be used to restore individual files and folders. However, this type of backup works really well for reverting a Hyper-V host to a previous state.

Before you can create a system image, you will need to install the Windows Server Backup feature onto your Hyper-V server. You can do this from the Server Manager’s Add Roles and Features Wizard by choosing the Windows Server Backup option from the wizard’s Add Features screen, as shown in Figure 1.Once Windows Server Backup has been installed, you can access it from the Server Manager’s Tools menu.

Figure 1

You will need to install the Windows Server Backup feature.

From within the Windows Server Backup console, click on the Local Backup container, and then click on the Backup Once link. This will cause Windows to launch the Backup Once Wizard. Click Next to accept the default option on the Backup Options screen. You will then be taken to the Select Backup Configuration screen. Choose the Custom option and click Next. This will take you to the Select Items for Backup screen. Choose the Custom option, and then click Next. You will now be prompted to select the items that you want to back up. Be sure to choose the Bare Metal Recovery option, the System State, the EFI System Partition, and the C: drive, as shown in Figure 2. Since you are not backing up individual virtual machines, there is no need to include volumes containing your VMs. Now work through the remainder of the wizard, and create the backup.

Figure 2

Select the items that you wish to include in the backup.

If you need to restore the host, you can do so by booting from the Windows Server installation media. Upon doing so, select your language, and then click the Repair Your Computer link found on the following screen, as shown in Figure 3.

Figure 3

Click the Repair Your Computer link.

Now, click the Troubleshoot icon displayed on the following screen, and then click the System Image Recovery icon, shown in Figure 4. Now, follow the prompts to restore a system image, as shown in Figure 5.

Figure 4

Click on the System Image Recovery option.

Figure 5

Select the system image that you want to restore.

It probably goes without saying, but Windows Server Backup is a lightweight backup utility that will work in a pinch, but it should not be used for your day to day backups. Look to a third party for that.

Using PowerShell

The previously described method uses backup and restore to reimage the Hyper-V server’s boot drive. While this technique will certainly return the Hyper-V server to a previous configuration, a full-blown backup and restore is going to be overkill unless the server is experiencing significant problems.

A better solution may be to use PowerShell to retrieve the various Hyper-V host configuration settings and write them to a file that you can use as a configuration baseline. If there were ever to be a problem with a Hyper-V host’s configuration, you could use the exact same cmdlets (which can be scripted) to review the server’s current configuration. These configuration items can then be compared to the baseline configuration, thereby allowing you to reconfigure the server to use known good settings.

Unfortunately, there isn’t one PowerShell cmdlet that retrieves every possible configuration item. You will have to use a variety of cmdlets. Before I show you some of these cmdlets, I need to take a moment and talk about file redirection.

The output from a PowerShell cmdlet can be redirected to a file. To create a text file, you would want to append the following line of code to your first PowerShell statement:

| Out-File -Encoding Ascii

For example, if you were trying to write a list of your server’s VM’s to a text file, you could use a command that looks something like this:

Get-VM | Out-File -Encoding ASCII VMs.txt

If you look at Figure 6, you can see what this command looks like in action. In the screen capture, I have used the Get-Content cmdlet to display the contents of the text file.

Figure 6

I have written a list of VMs to a text file.

For subsequent commands that output Hyper-V configuration information, you will need to append that information to the file created by your first command. Otherwise, the file will be overwritten. The easiest way to accomplish this is to add the -append parameter to the Out-File cmdlet. If, for some reason, I wanted to append another list of VMs to my existing text file, for example, I could do so by using this command:

Get-VM | Out-File -Encoding ASCII -append VMs.txt

You can see how this works in Figure 7.

Figure 7

This is how you append data to an existing text file.

So, what types of data should you focus on exporting? Although not a comprehensive list, it is a good idea to make sure that you focus on networking, storage, and general Hyper-V configuration.

Storage

When it comes to exporting your Hyper-V host’s storage configuration, you will have to consider the type of storage that your server is using. You would use a different technique to export information about direct-attached storage than you would use to export information about remote storage, for example. A few of the more basic cmdlets that might come in handy when documenting your server’s storage configuration include:

Get-Disk

Get-Partition

Get-Volume

You can see examples of these cmdlets in Figure 8.

These are a few cmdlets that you can use to retrieve storage configuration information.

Networking

PowerShell makes it possible to export a plethora of network configuration data. At the very least, though, you should probably export data related to your host’s physical network configuration and to the Hyper-V virtual switch. Here are some commands that you may find useful:

Get-WMIObject -Class Win32_NetworkAdapterConfiguration -Filter IPEnabled=True

Get-VMSwitch

As you can see in Figure 9, the first command shows how the physical network adapters are configured, while the second command displays the Hyper-V virtual switch configuration.

You can use PowerShell to retrieve the physical network adapter configuration and information about your Hyper-V virtual switches.

Hyper-V Host Information

When it comes to retrieving Hyper-V host information, the main command that you should use is Get-VMHost. Keep in mind that this command shows very little information by default, so you should use the Select-Object cmdlet to show all of the available attributes, as shown in Figure 10.

PowerShell displays the Hyper-V host’s configuration.

One thing to keep in mind is that the Get-VMHost cmdlet will not necessarily tell you everything you need to know about your Hyper-V host. You may need to run supplementary commands depending on how the host is set up. For example, if the host belongs to a cluster, you can use the Get-VMHostCluster and the Get-Cluster cmdlets to retrieve cluster-specific data.

Conclusion

As you can see, there isn’t a seamless and straightforward way to back up Hyper-V’s host configuration. With a little imagination, though, you can protect your configuration by either creating an image backup or by exporting configuration information through PowerShell.

The post How to Back Up Hyper-V Host Configuration Settings appeared first on Altaro DOJO | Backup & DR.

Backup and Recovery Tricks using Windows Task Scheduler

Symon Perriman — Wed, 28 Oct 2020 05:20:55 +0000

Task Scheduler is a built-in Windows utility that does not often get the respect it deserves as a powerful automation tool. Although it only has a few features, Task Scheduler’s ability to run any script at a scheduled time or be triggered via an event is very useful!

This blog will show you how you can utilize Task Scheduler to make your backups and recoveries more successful, even if you use a basic utility like Windows Server Backup. In fact, even Windows Server Backup, along with other leading backup providers like Altaro, will use the Task Scheduler engine’s via calling the TaskSchd.h APIs.

This blog will show you how you can use Task Scheduler to automate your backup, recovery, and administrative tasks before you even need to start the backup itself.

Task Scheduler for Backups Overview

Let’s review a few relevant features of Task Scheduler, which can be used to improve your backup and recovery procedures:

Trigger – This defines an action or time which causes a task to begin. There are two trigger types we can use, including:
- On a schedule – This will let you run a task at a specific time daily, weekly or monthly, such as creating a daily backup. You can also repeat the task multiple times a day in increments of 5 minutes, 10 minutes, 15 minutes, 30 minutes, or once an hour.
- On an event – This trigger will let you run a task if a specific event is detected in an event log, such as a failed backup attempt.

Figure 1: Task Scheduler is configured to trigger an action if Event 4 (Backup Successful) is written to the event log.

This task will then monitor the event log, and if a matching Event ID is discovered, it will trigger an action.

Figure 2: Event Viewer detects a backup event and causes Task Scheduler to trigger a task.

Action – This defines what happens once the task is triggered.
- Start a program – this is almost always used when a task is triggered. It will either start a program or run a script, so you can use this to launch your backup provider once your prechecks have passed. Since you can run any program with a command-line interface, it gives you the flexibility to use traditional scripting languages or PowerShell. You can also send files or pass variables into these scripts.

Figure 3: Task Scheduler is triggered to run PowerShell and then provide the location of a script

Send an email – This feature will connect to an SMTP mail server and send an email to an administrator. Microsoft has deprecated support for this feature, which means that it should work, but Microsoft will not update it and no longer provides technical support for it.

For additional information about Task Scheduler, visit Microsoft’s documentation About the Task Scheduler.

Running a PowerShell Script with Task Scheduler

Since Windows Server uses PowerShell as its primary scripting language, this section offers guidance for running PowerShell from Task Scheduler. If you are integrating your script with any third-party API, such as your backup provider, you should ensure that they have an API or PowerShell cmdlets that you can use to automate your backup tasks. Altaro offers both an API and PowerShell, making it easy to perform various automated tasks, such as applying a configuration template, taking an offsite copy, and restoring a VM.

For any task which requires you to run a PowerShell script, you have to be able to launch PowerShell and send it the file path of the script to run. For Windows or Windows Server, you will usually use the following parameters from the Start a program option:

Program/script:C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
Add arguments:
- Cmdlet such as Start-WBBackup -Policy $Policy -Async
- File location such as C:\Users\Admin\Scripts\Backup\backup.ps1
Start in: Directory location to start the command prompt, such asC:\Users\Admin\Scripts\Backup\backup.ps1

If you are using Windows Server Backup with PowerShell, you can find complete documentation here.

Creating a Backup with Task Scheduler

The first collection of tasks you should consider are related to creating a backup. While every backup provider offers this fundamental feature, they may not allow you to automate pre-backup testing and post-backup verification tasks. Remember that for a backup to be completed successfully, dozens of sequential actions need to happen correctly before and after the backup. You can use the following checklist against your system, but be sure to consider any other custom steps in your standard backup workflow. The order suggests which actions can be run concurrently (1a,1b, etc.) and should be consecutive (4,5, 6, etc.), but you should reorder to whatever makes sense for your infrastructure. The backup process can be triggered by either a manual request, an event, a previous task, or a scheduled time.

Order	Component	Type	Trigger	Action
1a	Storage	Backup storage health check	Backup started	Test storage is online & healthy Test disk has available space
1b	Virtual Machine	VM health check	Backup started	Test VM is online & healthy Test backup provider has access to guest OS or VHD
2a	Application	Application health check	VM health check completed	Test application is online & healthy
2b	VSS Writer	VSS health check	VM health check completed	Test VSS writer is online & healthy
2c	Backup Software	Backup software health check	VM health check completed	Test VSS provider is online & healthy
3a	Network	Network availability check	All health checks complete	Test backup network is online Test backup network has bandwidth
4	Network	Optimize network access	Network availability check complete	Prioritize traffic in your virtual and physical networks for the backup network and its traffic using QoS and network prioritization
5	Backup Software	Start Backup	Optimize network access complete	Start the backup. The provider should take care of application-level tasks such as quiescing the traffic.
6	Backup Software	Monitor Backup	Error during backup	Monitor the backup provider’s event log Send a message to the admin
7	Backup Software	Verify Backup	Backup Complete	Send a message to the admin
8	Network Prioritization	Deoptimize network access	Verify backup complete	Deprioritize traffic in your virtual and physical networks for the backup network and its traffic using QoS and network prioritization
9	Compliance	Document Backup	Verify backup complete	Document that the backup completed through your compliance procedure

Managing your Backups with Task Scheduler

Task Scheduler can help you with regular backup management and maintenance tasks. Most enterprise backup providers will provide similar functionality, such as Altaro’s VM Backup, which allows you to replicate your backups to Microsoft Azure.

Component	Type	Trigger	Action
Storage	Available space for backups	At regular intervals and before each backup, verify there is enough free disk space.	Send an alert/email to an admin or automatically procure additional storage.
Storage	Replication	On a schedule or after a successful backup	Replicate backup file to a secondary or offsite storage location
Compliance	Backup retention	If a backup is no longer needed or must be deleted after a certain time	Delete the backup

Triggering a Recovery from Backup with Task Scheduler

Whenever there is an outage, you will want to verify which services were affected and whether any data loss happened. While these tests are running, you may want to also initiate parts of the recovery process so that you can minimize the total service outage. It is important to automate the detection and recovery because any manual steps will slow down the process. You can find more best practices from Altaro’s blog about how to Recover and Restore your Backups Faster after a Disaster. The following order and steps may vary based on your infrastructure and recovery solution.

Order	Component	Type	Trigger	Action
1	Monitoring Software (SCOM or Event Log)	Failure detection	Failure event detected	Verify failure Send alert/email to an admin Verify data loss
2a	Monitoring Software (SCOM or Event Log)	Failure verification	Failure event verified	Send alert/email to an admin
2b	Monitoring Software (SCOM or Event Log)	Failure verification	False failure detected	Cancel the recovery process
3	Backup Software	Verify data loss	Failure verification	Determine whether data loss is acceptable, or to use the last good backup
4	Backup Software	Find the best backup	Data loss determined	Find the best available backup
5	Storage	Prepare storage	Backup ready for recovery	Determine the fastest media Prepare server or VM storage for recovery file
6	Network	Network availability check	All health checks complete	Test backup network is online Test backup network has bandwidth
7	Network	Optimize network access	Network availability check complete	Prioritize traffic in your virtual and physical networks for the backup network and its traffic using QoS and network prioritization
8	Backup Software	Recover backup	Storage ready for recovery	Restore backup to server or VM
9	Virtual Machine	Create VM	Recovery storage ready	Attach disk for recovery file Provide VM specification, increase the startup memory, increase the VM priority Create VM Start VM
10	Application	Start application	VM ready	Start the restored application
11	Backup Software	Verify recovery is successful	Application ready	Send alert/email to an admin
12a	Network Prioritization	Deoptimize network access	Verify recovery is successful	Deprioritize traffic in your virtual and physical networks for the backup network and its traffic using QoS and network prioritization
12b	Compliance	Post-Recovery Triage	Verify recovery is successful	Collect all event logs and other data required for failure analytics

If, at any time during this process, you need to terminate a task that is running, such as detecting a false failure alert, you can use the PowerShell cmdlet to stop any Task Scheduler task. Keep in mind that this will terminate the task only if it is still managed by Task Scheduler. If this task has already triggered some other action, such as running a PowerShell script, then that other process must also be terminated.

Task Scheduler offers simple yet powerful tools for managing backups and recovery. Whether you depend on Task Scheduler or a third-party backup solution to manage your critical data, make sure that your backup infrastructure is also highly-available and resilient. Windows Server Task Scheduler can even be deployed on a Failover Cluster so that its tasks failover between nodes and will always run. Check out the documentation on how to use PowerShell to manage Clustered Scheduled Tasks. With these Task Scheduler tips and tricks, you will now be able to add more resiliency to your backup and recovery workflow.

The post Backup and Recovery Tricks using Windows Task Scheduler appeared first on Altaro DOJO | Backup & DR.

Configuring Network Prioritization for Cluster Backup and Recovery

Symon Perriman — Wed, 28 Oct 2020 05:02:52 +0000

One of the most underutilized features in Windows Server Failover Clustering is the ability to configure and prioritize network traffic of Hyper-V virtual machines (VMs) and their applications. This is mainly because this feature is hidden from the Failover Cluster Manager GUI and must be accessed using PowerShell. However, many admins consider this a critical feature to optimize their cluster traffic and build resiliency into their infrastructure. This blog post will give you the best practices which I learned while I was an engineer on Microsoft’s Clustering Team, focusing on how to optimize your backup and recovery traffic when you really need it.

Remember that your organization is using a failover cluster to provide high availability for your important VMs and applications. A key tenant of clustering is that all the hardware must be redundant to avoid any single point of failure. This means that in addition to having multiple servers (hosts) and shared redundant storage, there are also multiple networks connecting each host to the rest of the datacenter. If any cluster network is being overused, then traffic using that network may be blocked or delayed. If traffic cannot get through, it can cause adverse effects on the system, such as triggering a false failover. By using network prioritization to define the importance of each type of traffic, you can eliminate bottlenecks and maximize your traffic flow.

Implementing network prioritization in your cluster environment is essential for optimizing traffic flow and minimizing the risk of bottlenecks. By defining the importance of each type of traffic, you can ensure that critical data and applications receive the necessary bandwidth and resources to function effectively.

In an ideal world, your cluster should have at least 4 networks, each with a primary role. If any of these networks fail, then the traffic will be rerouted through a different network. If you have fewer than 4 networks, you can combine roles, however, you generally want to separate and prioritize the following types of traffic as follows:

Cluster and Cluster Shared Volumes (CSV) – The cluster network provides the critical communication path needed for the nodes to interface with each other, perform health checks, and route traffic. This network should be assigned the highest priority because if it is unavailable or nodes cannot communicate with each other, the cluster could pause, trigger a live migration, or a failover, which can incur downtime.
Storage – If you are using ethernet-based connections to access your shared storage, such as SMB, iSCSI or Fibre Channel over Ethernet (FCoE), then you want to separate this network and make it your second highest priority. Since the performance of your VMs or applications may be limited by their ability to access storage, generally this network should be optimized and dedicated to only this type of traffic.
Public and Applications – This network should be used exclusively to connecting the clustered workloads or VMs to their applications or users. If this network is accessible via the public Internet, then it may be subject to DOS attacks, which flood the network and block other traffic from going through. For this reason, you should always separate this network so that any interference will not trigger any changes within your cluster.
Management & Backup – The lowest priority network is usually the management network which is used support high throughput, yet infrequent traffic. This network will be used for live migration traffic to copy large amounts of memory between hosts; backup and recovery traffic to copy large amounts of critical data; patching traffic to distribute new updates to the hosts; and deployment traffic to copy OS images files, virtual hard disks (VHDs) or snapshots to VMs. It is usually recommended to be the lowest priority since it only gets used occasionally for specific tasks, and it is a good candidate to be the backup network for all the others.

Based on your hardware, once you’ve made your determinations, you are ready to assign a priority to your networks. Follow the guidance provided in this blog by Altaro on the Hyper-V Network Prioritization and Binding Order. At a high level, prioritization happens by assigning a value (“Metric”) to each network, with the lowest value being the highest priority. For example, you may have assigned your 4 networks as follows:

Cluster Network = 1000
Storage Network = 2000
Public Network = 3000
Management &Backup Network = 4000

Based on these recommendations, the backup network will be considered the lowest priority network as the traffic is irregular, but does that really make sense for such a critical function? Ensuring that every backup gets completed is important to prevent data loss and decrease your recovery time objective (RTO) and recovery point objective (RPO). Ensuring that you can restore a backup quickly is even more important for the business, but when using the lowest priority network, it can be challenging to force this traffic through. This should be especially worrying if there has been a catastrophic failure which causes the cluster to restart, and it will prioritize its own cluster and storage traffic.

The easiest solution is to have a separate network entirely dedicated to backups, and assigning this as the second-highest priority network. If you have a fifth network that is easy:

Cluster Network = 1000
[NEW] Backup/Recovery Network = 1500
Storage Network = 2000
Public Network = 3000
Management Network = 4000

Or you are not using a network for ethernet-based storage, then reprioritize this one so the configuration would look like:

Cluster Network = 1000
[UPDATED] Backup/Recovery Network = 2000
Public Network = 3000
Management Network = 4000

However, most Hyper-V hosts have only 4 network interfaces and use Ethernet-based storage. In this case, then you can dynamically change the prioritization of the network using a PowerShell script. This is easy to do when you are taking a regularly scheduled backup, whether that is hourly, daily, or weekly. A few minutes before you begin this backup task, run the script to switch the priority of your backup network so it has the second-highest priority. Then several minutes after the backup has successfully completed, restore it to the original priority order using a second script.

Before Backup / Recovery

During Backup / Recovery

After Backup / Recovery

· Cluster Network = 1000

· Storage Network = 2000

· Public Network = 3000

· Management & Backup Network = 4000

· Cluster Network = 1000

· [UPDATED] Management & Backup Network = 1500

· Storage Network = 2000

· Public Network = 3000

· Cluster Network = 1000

· Storage Network = 2000

· Public Network = 3000

· [UPDATED] Management & Backup Network = 4000

If you have detected that you need to restore a backup to a cluster, then you will follow a similar workflow of dynamically adjusting the prioritization. This is a little more challenging as it will not automatically be scheduled and must be triggered only when a recovery is actually needed. Whether this task is run automatically (recommended) or requires manual intervention, make sure that this script has been written and tested in advance. Testing is critical to ensure regular backups and easy recovery, so if dynamic network adjustment is part of your backup plan, make sure that you are checking the states and priorities of your cluster networks before, during, and after this process.

Even if you follow the recommendation to set your management network as your lowest priority due to the irregularity of its traffic, you now know how to quickly adjust the priority when needed. This will give you the best chance to optimize your backups and restore them as quickly as possible when a disaster strikes.

The post Configuring Network Prioritization for Cluster Backup and Recovery appeared first on Altaro DOJO | Backup & DR.