Failure Detector
Energized with the Coffee , let's get back to modeling distributed systems. After the two phase commit protocol, the next protocol that we will jump to is a simple broadcast-based failure detector!
By this point in the tutorial, we have gotten familiar with the P language and most of its features. So, working through this example should be super fast!
How to use this example
We assume that you have cloned the P repository locally.
git clone https://github.com/p-org/P.git
The recommended way to work through this example is to open the P/Tutorial folder in IntelliJ side-by-side a browser using which you can simultaneously read the description for each example and browse the P program in IntelliJ.
To know more about P language primitives used in the example, please look them up in the language manual.
System: We consider a simple failure detector that basically broadcasts ping messages to all the nodes in the system and uses a timer to wait for pong responses from all nodes. If a node does not respond with a pong message after multiple attempts (either because of network failure or node failure), the failure detector marks the node as down and notifies the clients about the nodes that are potentially down. We use this example to show how to model network message loss in P and discuss how to model other types of network behaviours.
Correctness Specification: We would like to check - using a liveness specification - that if the failure injector shuts down a particular node then the failure detector always eventually detects the node failure and notifies the client.
P Project
The 4_FailureDetector folder contains the source code for the FailureDetector project. Please feel free to read details about the recommended P program structure and P project file.
Models
The P models (PSrc) for the FailureDetector example consist of four files:
- FailureDetector.p: Implements the
FailureDetector
machine.
[Expand]: Let's walk through FailureDetector.p
- (L1 - L4) → Event
ePing
andePong
are used to communicate between theFailureDetector
and theNode
state machines (manual: event declaration). - (L6) → Event
eNotifyNodesDown
is used by the FailureDetector to inform the clients about the nodes that are potentially down. - (L14 - L129) → Declares the
FailureDetector
state machine (manual: P state machine). The key points to note in theFailureDetector
machine are the usage of the Timer machine to model the usage of OS timer, the usage of ReliableBroadCast, and the usage of UnReliableBroadCast defined in NetworkFunctions.p.
- Node.p: Implements the
Node
machine.
[Expand]: Let's walk through Node.p
- (L4 - L14) → Declares the
Node
state machine. TheNode
machine responds with aePong
message on receiving aePing
message from theFailureDetector
. On receiving aeShutDown
message from theFailureInjector
, the machine halts itself.
- Client.p: Declares the
Client
machine.
[Expand]: Let's walk through Client.p
The Client
machine is a dummy machine that gets a set of alive nodes when the system starts and maintains the set of currently alive nodes by removing the nodes that are marked as down by the FailureDetector
.
- FailureDetectorModules.p: Declares the
FailureDetector
module.
[Expand]: Let's walk through FailureDetectorModules.p
Declares the FailureDetector
module which is the union of the module consisting of the FailureDetector
, Node
, and Client
machines and the Timer
module.
Specifications
The P Specification (PSpec) for the FailureDetector is implemented in ReliableFailureDetector.p. We define a simple ReliableFailureDetector
liveness specification to assert that all nodes that have been shutdown
by the failure injector will eventually be detected by the failure detector as failed nodes.
[Expand]: Let's walk through ReliableFailureDetector.p
- (L6 - L57) → Declares the
ReliableFailureDetector
liveness monitor.ReliableFailureDetector
spec machine basically maintains two setsnodesDownDetected
(nodes that are detected as down by the detector) andnodesShutdownAndNotDetected
(nodes that are shutdown by the failure injector but not yet detected).ReliableFailureDetector
monitor observes theeNotifyNodesDown
andeShutDown
events to update these maps and move between thehot
state (unstable state) and non-hot states. The system is in a hot state if there are nodes that are shutdown but not yet detected by the failure detector. The system violates a liveness specification if any of its execution paths terminates in a hot state. - To understand the semantics of the P spec machines and the details about liveness monitors, please read the manual: p monitors.
Test Scenarios
The test scenarios folder in P has two parts: TestDrivers and TestScripts. TestDrivers are collections of state machines that implement the test harnesses (or environment state machines) for different test scenarios. TestScripts are collections of test cases that are automatically run by the P checker.
The test scenarios folder for FailureDetector (PTst) consists of two files TestDriver.p and TestScript.p.
[Expand]: Let's walk through TestDriver.p
This file consists of a single test driver machine that sets up the system under test given the number of nodes and clients in the system. The SetupSystemWithFailureInjector
function creates the clients, nodes, failure injector and the failure detector machines.
[Expand]: Let's walk through TestScript.p
The test script contains two test cases:
-
Basic test case (TestFailureDetector) that asserts the
ReliableFailureDetector
specification on a system composed of theFailureDetector
,FailureInjector
, and test-driverTestMultipleClients
. -
Parameterized test case (tcTest_BalancedLoad) that systematically explores different system configurations:
This test:test param (numNodes in [3, 4, 5], numClients in [2, 3, 4]) tcTest_BalancedLoad: assume (numClients <= numNodes); // Only test valid configurations assert ReliableFailureDetector in union { TestWithConfig }, FailureDetector, FailureInjector;
- Uses parameters declared in TestDriver.p:
param numNodes: int; param numClients: int;
- Tests different system sizes (3-5 nodes) and monitoring loads (2-4 clients)
- Uses
assume
to ensure clients don't outnumber nodes for better monitoring distribution - Tests each valid configuration with the
ReliableFailureDetector
specification
- Uses parameters declared in TestDriver.p:
Compiling FailureDetector
Run the following command to compile the FailureDetector project:
p compile
Expected Output
$ p compile
.. Searching for a P project file *.pproj locally in the current folder
.. Found P project file: P/Tutorial/4_FailureDetector/FailureDetector.pproj
----------------------------------------
==== Loading project file: P/Tutorial/4_FailureDetector/FailureDetector.pproj
....... includes p file: P/Tutorial/4_FailureDetector/PSrc/FailureDetectorModules.p
....... includes p file: P/Tutorial/4_FailureDetector/PSrc/Client.p
....... includes p file: P/Tutorial/4_FailureDetector/PSrc/FailureDetector.p
....... includes p file: P/Tutorial/4_FailureDetector/PSrc/Node.p
....... includes p file: P/Tutorial/4_FailureDetector/PSpec/ReliableFailureDetector.p
....... includes p file: P/Tutorial/4_FailureDetector/PTst/TestDriver.p
....... includes p file: P/Tutorial/4_FailureDetector/PTst/TestScript.p
==== Loading project file: P/Tutorial/Common/Timer/Timer.pproj
....... includes p file: P/Tutorial/Common/Timer/PSrc/Timer.p
....... includes p file: P/Tutorial/Common/Timer/PSrc/TimerModules.p
==== Loading project file: P/Tutorial/Common/FailureInjector/FailureInjector.pproj
....... includes p file: P/Tutorial/Common/FailureInjector/PSrc/NetworkFunctions.p
....... includes p file: P/Tutorial/Common/FailureInjector/PSrc/FailureInjector.p
----------------------------------------
Parsing ...
Type checking ...
Code generation ...
Generated FailureDetector.cs.
----------------------------------------
Compiling FailureDetector...
MSBuild version 17.3.1+2badb37d1 for .NET
Determining projects to restore...
Restored P/Tutorial/4_FailureDetector/PGenerated/CSharp/FailureDetector.csproj (in 93 ms).
FailureDetector -> P/Tutorial/4_FailureDetector/PGenerated/CSharp/net6.0/FailureDetector.dll
Build succeeded.
0 Warning(s)
0 Error(s)
Time Elapsed 00:00:04.42
----------------------------------------
~~ [PTool]: Thanks for using P! ~~
Checking FailureDetector
You can get the list of test cases defined in the FailureDetector project by running the P Checker:
p check
Expected Output
$ p check
.. Searching for a P compiled file locally in folder ./PGenerated/
.. Found a P compiled file: ./PGenerated/PChecker/net8.0/FailureDetector.dll
.. Checking ./PGenerated/PChecker/net8.0/FailureDetector.dll
Error: We found '9' test cases. Please provide a more precise name of the test case you wish to check using (--testcase | -tc).
Possible options are:
tcTest_FailureDetector
tcTest_BalancedLoad___numNodes_3__numClients_2
tcTest_BalancedLoad___numNodes_4__numClients_2
tcTest_BalancedLoad___numNodes_5__numClients_2
tcTest_BalancedLoad___numNodes_3__numClients_3
tcTest_BalancedLoad___numNodes_4__numClients_3
tcTest_BalancedLoad___numNodes_5__numClients_3
tcTest_BalancedLoad___numNodes_4__numClients_4
tcTest_BalancedLoad___numNodes_5__numClients_4
~~ [PTool]: Thanks for using P! ~~
Above shows the test cases defined in the FailureDetector project and you can specify which
test case to run by using the -tc
parameter along with the -s
parameter for the number of schedules to explore.
Check the tcTest_FailureDetector
test case for 10,000 schedules:
p check -tc tcTest_FailureDetector -s 10000
Discussion: Modeling Message Reordering
(to be added soon)
Exercise Problem
What did we learn through this example?
In this example, we saw how to use data nondeterminism to model message loss and unreliable sends. We also discussed how to model other types of network nondeterminism.