Skip to content

Failure Detector

Energized with the Coffee ☕, let's get back to modeling distributed systems. After the two phase commit protocol, the next protocol that we will jump to is a simple broadcast-based failure detector!

By this point in the tutorial, we have gotten familiar with the P language and most of its features. So, working through this example should be super fast!

How to use this example

We assume that you have cloned the P repository locally.

git clone https://github.com/p-org/P.git

The recommended way to work through this example is to open the P/Tutorial folder in IntelliJ side-by-side a browser using which you can simultaneously read the description for each example and browse the P program in IntelliJ.

To know more about P language primitives used in the example, please look them up in the language manual.

System: We consider a simple failure detector that basically broadcasts ping messages to all the nodes in the system and uses a timer to wait for pong responses from all nodes. If a node does not respond with a pong message after multiple attempts (either because of network failure or node failure), the failure detector marks the node as down and notifies the clients about the nodes that are potentially down. We use this example to show how to model network message loss in P and discuss how to model other types of network behaviours.

Placeholder

Correctness Specification: We would like to check - using a liveness specification - that if the failure injector shuts down a particular node then the failure detector always eventually detects the node failure and notifies the client.

P Project

The 4_FailureDetector folder contains the source code for the FailureDetector project. Please feel free to read details about the recommended P program structure and P project file.

Models

The P models (PSrc) for the FailureDetector example consist of four files:

[Expand]: Let's walk through FailureDetector.p
  • (L1 - L4) → Event ePing and ePong are used to communicate between the FailureDetector and the Node state machines (manual: event declaration).
  • (L6) → Event eNotifyNodesDown is used by the FailureDetector to inform the clients about the nodes that are potentially down.
  • (L14 - L129) → Declares the FailureDetector state machine (manual: P state machine). The key points to note in the FailureDetector machine are the usage of the Timer machine to model the usage of OS timer, the usage of ReliableBroadCast, and the usage of UnReliableBroadCast defined in NetworkFunctions.p.
  • Node.p: Implements the Node machine.
[Expand]: Let's walk through Node.p
  • (L4 - L14) → Declares the Node state machine. The Node machine responds with a ePong message on receiving a ePing message from the FailureDetector. On receiving a eShutDown message from the FailureInjector, the machine halts itself.
  • Client.p: Declares the Client machine.
[Expand]: Let's walk through Client.p

The Client machine is a dummy machine that gets a set of alive nodes when the system starts and maintains the set of currently alive nodes by removing the nodes that are marked as down by the FailureDetector.

[Expand]: Let's walk through FailureDetectorModules.p

Declares the FailureDetector module which is the union of the module consisting of the FailureDetector, Node, and Client machines and the Timer module.

Specifications

The P Specification (PSpec) for the FailureDetector is implemented in ReliableFailureDetector.p. We define a simple ReliableFailureDetector liveness specification to assert that all nodes that have been shutdown by the failure injector will eventually be detected by the failure detector as failed nodes.

[Expand]: Let's walk through ReliableFailureDetector.p
  • (L6 - L57) → Declares the ReliableFailureDetector liveness monitor. ReliableFailureDetector spec machine basically maintains two sets nodesDownDetected (nodes that are detected as down by the detector) and nodesShutdownAndNotDetected (nodes that are shutdown by the failure injector but not yet detected). ReliableFailureDetector monitor observes the eNotifyNodesDown and eShutDown events to update these maps and move between the hot state (unstable state) and non-hot states. The system is in a hot state if there are nodes that are shutdown but not yet detected by the failure detector. The system violates a liveness specification if any of its execution paths terminates in a hot state.
  • To understand the semantics of the P spec machines and the details about liveness monitors, please read the manual: p monitors.

Test Scenarios

The test scenarios folder in P has two parts: TestDrivers and TestScripts. TestDrivers are collections of state machines that implement the test harnesses (or environment state machines) for different test scenarios. TestScripts are collections of test cases that are automatically run by the P checker.

The test scenarios folder for FailureDetector (PTst) consists of two files TestDriver.p and TestScript.p.

[Expand]: Let's walk through TestDriver.p

This file consists of a single test driver machine that sets up the system under test given the number of nodes and clients in the system. The SetupSystemWithFailureInjector function creates the clients, nodes, failure injector and the failure detector machines.

[Expand]: Let's walk through TestScript.p

There is a single testcase (TestFailureDetector) defined for the FailureDetector system. The test case asserts the ReliableFailureDetector specification on a system which is a composition of the FailureDetector, FailureInjector, and the test-driver TestMultipleClients.

Compiling FailureDetector

Run the following command to compile the FailureDetector project:

p compile
Expected Output
$ p compile

.. Searching for a P project file *.pproj locally in the current folder
.. Found P project file: P/Tutorial/4_FailureDetector/FailureDetector.pproj
----------------------------------------
==== Loading project file: P/Tutorial/4_FailureDetector/FailureDetector.pproj
....... includes p file: P/Tutorial/4_FailureDetector/PSrc/FailureDetectorModules.p
....... includes p file: P/Tutorial/4_FailureDetector/PSrc/Client.p
....... includes p file: P/Tutorial/4_FailureDetector/PSrc/FailureDetector.p
....... includes p file: P/Tutorial/4_FailureDetector/PSrc/Node.p
....... includes p file: P/Tutorial/4_FailureDetector/PSpec/ReliableFailureDetector.p
....... includes p file: P/Tutorial/4_FailureDetector/PTst/TestDriver.p
....... includes p file: P/Tutorial/4_FailureDetector/PTst/TestScript.p
==== Loading project file: P/Tutorial/Common/Timer/Timer.pproj
....... includes p file: P/Tutorial/Common/Timer/PSrc/Timer.p
....... includes p file: P/Tutorial/Common/Timer/PSrc/TimerModules.p
==== Loading project file: P/Tutorial/Common/FailureInjector/FailureInjector.pproj
....... includes p file: P/Tutorial/Common/FailureInjector/PSrc/NetworkFunctions.p
....... includes p file: P/Tutorial/Common/FailureInjector/PSrc/FailureInjector.p
----------------------------------------
Parsing ...
Type checking ...
Code generation ...
Generated FailureDetector.cs.
----------------------------------------
Compiling FailureDetector...
MSBuild version 17.3.1+2badb37d1 for .NET
Determining projects to restore...
Restored P/Tutorial/4_FailureDetector/PGenerated/CSharp/FailureDetector.csproj (in 93 ms).
FailureDetector -> P/Tutorial/4_FailureDetector/PGenerated/CSharp/net6.0/FailureDetector.dll

Build succeeded.
0 Warning(s)
0 Error(s)

Time Elapsed 00:00:04.42


----------------------------------------
~~ [PTool]: Thanks for using P! ~~

Checking FailureDetector

There is only a single test case in the FailureDetector project and we can directly run the test case for 10,000 schedules:

p check -s 10000

Discussion: Modeling Message Reordering

(to be added soon)

Exercise Problem

What did we learn through this example?

In this example, we saw how to use data nondeterminism to model message loss and unreliable sends. We also discussed how to model other types of network nondeterminism.