Below is an outline of the Disaster Recovery Process followed by Delib. This process is in place 24/7, 365 days of the year.
Here's a summary of the stages, and the target timescales. Each stage is detailed later in this article.
|1. Detection & definition||When notified by our monitoring systems or a customer|
|2. On-call team alerted||As soon as possible after detection of critical issue|
|3. Initial investigation & assessment||Within 1.5 hours of detection|
|4. Customer notification||Within 2 hours of detection, as specified in our Service Level Agreement|
|5. Resolution||Depends on complexity of problem. But our target resolution times for either product or infrastructure issues are detailed in our Service Level Agreement|
|6. Report & document||Within 1 working day of resolution|
|7. Review & retrospective||Within 3 working days of resolution|
1. Detection & definition
Detection: we'll be made aware of a critical issue by one of the following methods:
(a) Automated alert from our monitoring systems
(b) Non-automated internal detection (e.g. system admin checks, investigations arising from third-party security announcements)
(c) Non-automated customer or end-user report (including Zendesk support requests)
Definition: of a critical issue:
1. Has a customer site been unavailable to the general public for more than 5 minutes?
2. Is there a reproducible issue which prevents a user from entering or submitting data?
3. Is there a reproducible issue which causes unavoidable or unexpected data loss?
4. Are spambots successfully posting content which is visible to the public?
5. Is there a bug or security vulnerability that constitutes a realistic threat to privacy?
2. On-call team alerted
1. If a critical error has been picked up by one of our monitoring systems, the team will be alerted by email and text message. Unavailability lasting five minutes or longer is reported.
2. The on-call team will include at least one technical team member and one account manager or other customer-facing team member.
When? As soon as possible after detection of critical issue
3. Initial investigation and assessment
The technical lead aims to establish the cause of the server issues, and assess the severity and likely duration of the service interruption, and communicates this with the account manager.
Ideally, this will include:
1. Identification of the root cause
2. An assessment of the severity and scale of the problem, including which customers are affected
3. An estimated time to resolution
When? Within 1.5 hours of detection
4. Customer notification
The account manager will contact affected customer(s) to inform them of the service interruption, and that Delib are actively investigating the problem.
This communication will most likely be by email, but depends on the severity of the incident. Any wider-reaching issues may be posted as a homepage announcement on delib.zendesk.com in the first instance, ahead of any direct communication. There is also a Delib Service Status Twitter account which may be used as well.
When? Our Service Level Agreement gives a maximum initial response time for critical errors of 2 hours.
Once the technical on-call lead has assessed the problem, they will report back to the on-call account manager as follows:
(1) If the problem can be easily solved, it will be fixed. The technical lead will report back to the account manager, and document the problem and solution.
(2) If the issue is more complex, a resolution plan is put in place to address the service outage. This may require more technical team members to be contacted, or for the investigation to be continued in office hours. An interim report, summarising expected cause, and steps to resolution will be provided to the account manager.
In both cases, any information we have will be communicated to affected customers by the on-call account manager. The account manager will continue to keep all affected customers updated with progress until we reach a resolution to the issue.
When? This depends on the complexity of the problem, but our target resolution times are set out in our Service Level Agreement.
6. Reporting, documentation and tidying up
Once the problem has been resolved, the account manager will provide a written report for affected customers and Delib reference. This will include:
- How the problem was detected
- The scope of the problem and how it may have affected end-user interaction
- The root cause
- Steps to resolution, including any measures put in place to mitigate the risk of repeat occurrence
- Total downtime
- Any service credit or other compensation offered by Delib, should the error have caused us to miss our Service Level Agreement targets
When? All of this should happen within 1 working day of the resolution of the issue.
7. Review and retrospective
Once the error has been resolved, Delib will have a retrospective, to identify any long term counter-measures can be put in place to prevent a recurrence of the issue.
This disaster recovery process is also reviewed to identify any improvements that can be made.
When? Within 3 working days of resolution
Would we ever take sites offline?
This will be the informed decision of Delib's Managing Director, who will be given a full brief of the situation by the on-call team. We will ask ourselves some specific questions to determine whether this may be necessary:
- If the site stays online, could users submit data that gets lost without them knowing?
- If the site stays online, could any existing data loss or corruption be made worse?
- If the site stays online, is there a possibility of the loss or exposure of any personal information?
- Conversely, if the site is taken offline, could any existing data loss be exacerbated?
This is a last resort for us, and we would never take sites offline unless leaving them online would pose more of a risk to the customer(s) or their respondents.