Need help? Looking for tips and tricks?

This knowledge base contains loads of useful advice and answers to common questions.

If you're still stuck you can always submit a support request and we'll get back to you ASAP.

Disaster Recovery Plan

Louise Cato -

Below is an outline of the Disaster Recovery Process followed by Delib. This process is in place 24/7, 365 days of the year.

Here's a summary of the stages, and the target timescales. Each stage is detailed later in this article.

Stage When?
1. Detection & definition When notified by our monitoring systems or a customer
2. On-call team alerted As soon as possible after detection of critical issue
3. Initial investigation & assessment Within 1.5 hours of detection
4. Customer notification Within 2 hours of detection, as specified in our Service Level Agreement
5. Resolution Depends on complexity of problem. But our target resolution times for either product or infrastructure issues are detailed in our Service Level Agreement
6. Report & document Within 1 working day of resolution
7. Review & retrospective Within 3 working days of resolution

 

1. Detection & definition


Detection: we'll be made aware of a critical issue by one of the following methods:

(a) Automated alert from our monitoring systems

(b) Non-automated internal detection (e.g. system admin checks, investigations arising from third-party security announcements)

(c) Non-automated customer or end-user report (including Zendesk support requests)


Definition: of a critical issue:

1. Has a customer site been unavailable to the general public for more than 5 minutes?

2. Is there a reproducible issue which prevents a user from entering or submitting data?

3. Is there a reproducible issue which causes unavoidable or unexpected data loss?

4. Are spambots successfully posting content which is visible to the public?

5. Is there a bug or security vulnerability that constitutes a realistic threat to privacy? 

2. On-call team alerted

1. If a critical error has been picked up by one of our monitoring systems, the team will be alerted by email and text message. Unavailability lasting five minutes or longer is reported.

2. The on-call team will include at least one technical team member and one account manager or other customer-facing team member.

When? As soon as possible after detection of critical issue

 

3. Initial investigation and assessment

The technical lead aims to establish the cause of the server issues, and assess the severity and likely duration of the service interruption, and communicates this with the account manager.

Ideally, this will include: 

1. Identification of the root cause

2. An assessment of the severity and scale of the problem, including which customers are affected

3. An estimated time to resolution

When? Within 1.5 hours of detection

4. Customer notification

The account manager will contact affected customer(s) to inform them of the service interruption, and that Delib are actively investigating the problem.

This communication will most likely be by email, but depends on the severity of the incident. Any wider-reaching issues may be posted as a homepage announcement on delib.zendesk.com in the first instance, ahead of any direct communication. There is also a Delib Service Status Twitter account which may be used as well.

When? Our Service Level Agreement gives a maximum initial response time for critical errors of 2 hours

5. Resolution

Once the technical on-call lead has assessed the problem, they will report back to the on-call account manager as follows:

(1) If the problem can be easily solved, it will be fixed. The technical lead will report back to the account manager, and document the problem and solution.

or

(2) If the issue is more complex, a resolution plan is put in place to address the service outage. This may require more technical team members to be contacted, or for the investigation to be continued in office hours. An interim report, summarising expected cause, and steps to resolution will be provided to the account manager.

In both cases, any information we have will be communicated to affected customers by the on-call account manager. The account manager will continue to keep all affected customers updated with progress until we reach a resolution to the issue.

When? This depends on the complexity of the problem, but our target resolution times are set out in our Service Level Agreement.

6. Reporting, documentation and tidying up

Once the problem has been resolved, the account manager will provide a written report for affected customers and Delib reference.  This will include:

  • How the problem was detected
  • The scope of the problem and how it may have affected end-user interaction
  • The root cause
  • Steps to resolution, including any measures put in place to mitigate the risk of repeat occurrence
  • Total downtime
  • Any service credit or other compensation offered by Delib, should the error have caused us to miss our Service Level Agreement targets

When? All of this should happen within 1 working day of the resolution of the issue.

7. Review and retrospective

Once the error has been resolved, Delib will have a retrospective to identify any long term counter-measures which can be put in place to prevent a recurrence of the issue.

This disaster recovery process is also reviewed to identify any improvements that can be made.

When? Within 3 working days of resolution

 

Other information:

Would we ever take sites offline?

This will be the informed decision of Delib's Managing Director, who will be given a full brief of the situation by the on-call team. We will ask ourselves some specific questions to determine whether this may be necessary:

  • If the site stays online, could users submit data that gets lost without them knowing?
  • If the site stays online, could any existing data loss or corruption be made worse?
  • If the site stays online, is there a possibility of the loss or exposure of any personal information?
  • Conversely, if the site is taken offline, could any existing data loss be exacerbated?

This is a last resort for us, and we would never take sites offline unless leaving them online would pose more of a risk to the customer(s) or their respondents.