Non-Functional Requirements (NFRs) Framework for Software Systems - Best Practice: Consider Resilience Non-Functional Requirements (NFRs)

Non-Functional Requirements (NFRs) Framework for Software Systems

Chapter 14. Best Practice: Consider Resilience Non-Functional Requirements (NFRs)

Overview

Resilience Non-Functional Requirements (NFRs) define how a software system should withstand, absorb, isolate, and recover from failures, overloads, dependency interruptions, abnormal inputs, infrastructure degradation, and other adverse operating conditions. Resilience is closely related to availability, reliability, recoverability, safety, and operability, but it focuses specifically on how the system behaves while stress, failure, or degraded conditions are occurring.

Resilience requirements should identify what failures are expected, which functions must continue, which functions may degrade, how users and operators are informed, how dependent systems are protected, and what evidence proves that the system can operate safely and predictably when conditions are not ideal.

Best Practice: Define fault-tolerance non-functional requirements

Description

Fault-tolerance NFRs define the degree to which a system continues operating when components, services, nodes, data paths, infrastructure elements, or dependencies fail. These requirements should clarify which failures the system must tolerate, which capabilities must remain available, what level of degradation is acceptable, and how recovery or rerouting occurs.

Benefits

Fault-tolerance requirements reduce the risk that a single failing component causes broad service disruption. They also guide architecture decisions for redundancy, isolation, state management, failover, retries, and operational response.

Example non-functional requirements

The software system shall continue processing read-only user requests when one non-primary application node fails, provided at least one healthy node remains available.

Validation method: Validate through controlled node-failure testing in a production-like environment and verify that requests continue to route to healthy nodes.
Example validation evidence: Node-failure test report, load balancer health-check logs, application monitoring dashboard, incident simulation notes, and test approval record.

The software system shall prevent failure in a non-critical reporting component from interrupting critical transaction submission.

Validation method: Validate by disabling or faulting the reporting component during transaction testing and confirming that critical transaction submission remains successful.
Example validation evidence: Fault-injection test results, transaction success report, component dependency map, monitoring screenshots, and architecture review approval.

Typical stakeholders include solution architects, application architects, SRE teams, platform engineers, developers, QA teams, operations teams, product owners, and business continuity stakeholders.

Fault-tolerance NFRs are defined during architecture and design; implemented during development and platform build; validated during integration testing, resiliency testing, failure injection, release readiness, and recurring operational drills.

Best Practice: Define degraded-mode operating non-functional requirements

Description

Degraded-mode operating NFRs define how a system should behave when it cannot deliver full functionality but can still deliver partial, limited, delayed, or read-only capability. These requirements should identify which capabilities are essential, which capabilities may be temporarily disabled, and how users, operators, and dependent systems are informed.

Benefits

Degraded-mode requirements allow teams to design graceful failure behavior instead of allowing uncontrolled errors, user confusion, or unnecessary total outages. They help preserve critical business functions during partial service disruption.

Example non-functional requirements

If the recommendation service is unavailable, the shopping application shall continue to support product search, product detail viewing, cart updates, and checkout without displaying recommendation content.

Validation method: Validate by simulating recommendation service unavailability and confirming that critical shopping workflows complete successfully without user-facing system errors.
Example validation evidence: Degraded-mode test report, user journey test results, application logs, screenshots of fallback behavior, and product owner approval.

If the external credit-check provider is unavailable, the application shall hold submitted requests in a pending-review state and notify authorized operations users for manual review.

Validation method: Validate through integration failure testing and verify pending-state creation, user notification, operations notification, and later recovery or replay behavior.
Example validation evidence: Integration failure test report, pending queue record, notification evidence, operations dashboard screenshot, and replay/recovery test evidence.

Typical stakeholders include product owners, business owners, UX designers, operations teams, SRE teams, architects, developers, customer support, and compliance stakeholders when manual review or regulated workflows are involved.

Degraded-mode NFRs are defined during requirements, architecture, and UX design; implemented during development; validated during integration testing, user acceptance testing, resiliency testing, incident simulation, and operational readiness review.

Best Practice: Define throttling, circuit-breaker, and back-pressure non-functional requirements

Description

Throttling, circuit-breaker, and back-pressure NFRs define how a system limits traffic, stops repeated failing calls, slows producers, protects consumers, and prevents overload from cascading across services or integrations. These requirements should clarify limits, thresholds, retry behavior, queue behavior, rejection behavior, alerts, and recovery conditions.

Benefits

These requirements help protect systems from overload, cascading failure, uncontrolled retries, queue exhaustion, and dependency collapse. They also make high-load behavior more predictable for users, operators, and integrated systems.

Example non-functional requirements

The payment API shall reject requests above the approved tenant-specific rate limit with a documented HTTP 429 response and shall include retry-after guidance when appropriate.

Validation method: Validate through rate-limit testing that exceeds the approved threshold and verifies response code, response payload, retry guidance, logging, and alerting behavior.
Example validation evidence: Rate-limit test results, API gateway configuration, HTTP response samples, logs, alerts, and approved API contract documentation.

The order service shall open a circuit breaker after five consecutive timeout failures from the inventory service within one minute and shall attempt recovery using the approved half-open retry policy.

Validation method: Validate through dependency timeout simulation and verify circuit breaker state transitions, fallback behavior, recovery behavior, and operator visibility.
Example validation evidence: Circuit-breaker test report, service logs, trace records, resilience dashboard, dependency simulation configuration, and SRE review approval.

Typical stakeholders include application architects, integration architects, API owners, developers, platform engineers, SRE teams, operations teams, product owners, and dependent system owners.

These NFRs are defined during architecture and integration design; implemented during development and platform configuration; validated during load testing, failure injection, integration testing, release readiness, and production monitoring.

Best Practice: Define dependency isolation non-functional requirements

Description

Dependency isolation NFRs define how failures, slowness, overload, data errors, or operational issues in one dependency are prevented from spreading to unrelated components, services, tenants, workflows, or business capabilities. These requirements may address bulkheads, queues, timeouts, resource pools, tenant isolation, network segmentation, data partitioning, and fallback paths.

Benefits

Dependency isolation requirements reduce blast radius and make failures easier to contain, diagnose, and recover from. They also support safer multi-tenant operation, safer integration design, and better operational accountability.

Example non-functional requirements

Failure of the notification service shall not prevent completion of the primary transaction workflow; failed notifications shall be queued for retry or routed to an approved exception process.

Validation method: Validate by disabling the notification service during transaction testing and confirming that primary transactions commit successfully and notification exceptions are captured.
Example validation evidence: Transaction test report, queue records, exception records, application logs, trace data, and operational runbook evidence.

Workloads for one customer tenant shall not consume shared processing resources in a way that prevents other tenants from completing approved critical transactions within defined service targets.

Validation method: Validate through multi-tenant load testing, resource contention testing, and tenant-level monitoring under approved high-load scenarios.
Example validation evidence: Multi-tenant load test report, resource utilization dashboards, tenant-level service metrics, capacity analysis, and architecture review record.

Typical stakeholders include solution architects, data architects, integration architects, platform engineers, SRE teams, developers, operations teams, security architects, product owners, and tenant/customer representatives when applicable.

Dependency isolation NFRs are defined during architecture, platform design, data design, and integration design; validated during integration testing, load testing, resiliency testing, tenant testing, production monitoring, and incident review.

Best Practice: Define resilience testing non-functional requirements

Description

Resilience testing NFRs define the scenarios, environments, tools, frequencies, roles, controls, and evidence required to prove that the system can withstand and recover from defined failure or degradation conditions. These requirements may include failure injection, chaos testing, dependency simulation, load with failure, regional impairment, queue buildup, and degraded-mode workflow tests.

Benefits

Resilience testing reduces uncertainty about how a system behaves during failure. It also helps teams find hidden coupling, weak runbooks, untested fallback paths, false monitoring assumptions, and recovery gaps before real incidents occur.

Example non-functional requirements

The software system shall complete approved resilience tests for critical user journeys before production release and after major architecture changes.

Validation method: Validate by reviewing the approved resilience test plan, executed test results, critical journey coverage, defects, remediation status, and release approval.
Example validation evidence: Resilience test plan, executed test report, critical journey coverage matrix, defect records, remediation evidence, and release readiness approval.

Each critical external dependency shall have at least one documented and executed failure simulation test before production release.

Validation method: Validate by comparing the dependency inventory to executed failure simulation records and confirming coverage for each critical dependency.
Example validation evidence: Dependency inventory, failure simulation results, trace logs, fallback screenshots, monitoring evidence, and architecture signoff.

Typical stakeholders include SRE teams, QA teams, architects, developers, platform engineers, operations teams, product owners, business continuity teams, and dependency owners.

Resilience testing NFRs are defined during test strategy and architecture; validated during integration testing, performance testing, pre-production readiness, disaster recovery preparation, post-release drills, and recurring operational exercises.

Best Practice: Define resilience validation and evidence non-functional requirements

Description

Resilience validation and evidence NFRs define the proof required to show that resilience requirements have been satisfied. Evidence may include test results, monitoring records, incident simulations, architecture reviews, trace data, failure-injection results, runbook execution records, and post-test approvals.

Benefits

Clear validation and evidence expectations make resilience claims auditable and repeatable. They also help teams avoid relying on architecture intent alone when actual system behavior has not been proven.

Example non-functional requirements

Each critical resilience NFR shall identify the validation method, validation environment, required evidence, validation owner, and approval stakeholder before release readiness review.

Validation method: Validate through requirements review and release readiness review, confirming that every critical resilience NFR includes the required validation fields and completed evidence.
Example validation evidence: Requirements traceability matrix, validation plan, release readiness checklist, evidence repository links, and approval record.

Resilience validation evidence shall be retained for each major production release and shall be available for architecture, operations, risk, and audit review.

Validation method: Validate by sampling release records and confirming that required resilience evidence exists, is accessible, and maps to approved NFRs.
Example validation evidence: Release evidence repository, resilience test results, traceability matrix, review notes, retention record, and audit sampling evidence.

Typical stakeholders include architects, SRE teams, QA teams, operations teams, product owners, risk stakeholders, audit stakeholders, and governance bodies.

Resilience validation and evidence NFRs are defined during requirements and test planning; validated during release readiness, operational readiness, incident simulation, post-incident review, governance review, and audit review.

How to cite this page

When referencing this page in academic work, internal standards, or external publications, include the page title, IF4IT as publisher, the URL, and your access date.

Example (informal web citation):

International Foundation for Information Technology (IF4IT). Best Practice: Consider Resilience Non-Functional Requirements (NFRs) | Non-Functional Requirements (NFRs) Framework for Software Systems. https://if4it.org/best-practices/non-functional-requirements-nfrs-framework-for-software-systems/best-practice-consider-resilience-non-functional-requirements-nfrs/ (accessed 2026-06-24).

See About Us for content governance and site-wide citation guidance.

Legal Disclaimers

Overview

Best Practice: Define fault-tolerance non-functional requirements

Description

Benefits

Example non-functional requirements

Related stakeholders

Related lifecycle phases

Best Practice: Define degraded-mode operating non-functional requirements

Description

Benefits

Example non-functional requirements

Related stakeholders

Related lifecycle phases

Best Practice: Define throttling, circuit-breaker, and back-pressure non-functional requirements

Description

Benefits

Example non-functional requirements

Related stakeholders

Related lifecycle phases

Best Practice: Define dependency isolation non-functional requirements

Description

Benefits

Example non-functional requirements

Related stakeholders

Related lifecycle phases

Best Practice: Define resilience testing non-functional requirements

Description

Benefits

Example non-functional requirements

Related stakeholders

Related lifecycle phases

Best Practice: Define resilience validation and evidence non-functional requirements

Description

Benefits

Example non-functional requirements

Related stakeholders

Related lifecycle phases

How to cite this page