# Lazy x86-TSO Memory Model for C++ Verification

Vladimír Štill



Masaryk University

Brno, Czech Republic

26th February 2018

- design of parallel programs is hard
- easy to make mistakes data races, deadlocks



- design of parallel programs is hard
- easy to make mistakes data races, deadlocks
- memory behaviour is very complex
  - effects of caches, out-of-order and speculative execution



- design of parallel programs is hard
- easy to make mistakes data races, deadlocks
- memory behaviour is very complex

effects of caches, out-of-order and speculative execution

```
int x, y = 0;
void thread0() {
    y = 1;
    int a = x;
}
void thread1() {
    x = 1;
    int b = y;
}
```

■ is it possible to end with a == 0 && b == 0?



- design of parallel programs is hard
- easy to make mistakes data races, deadlocks
- memory behaviour is very complex

effects of caches, out-of-order and speculative execution

```
int x, y = 0;
void thread0() {
    y = 1;
    int a = x;
}
void thread1() {
    x = 1;
    int b = y;
}
```

■ is it possible to end with a == 0 && b == 0? yes





### $\blacksquare$ C and C++



- C and C++
- program is translated into LLVM intermediate language
- LLVM is executed by the model checker
- exploration of all possible runs of the program



## • C and C++

- program is translated into LLVM intermediate language
- LLVM is executed by the model checker
- exploration of all possible runs of the program
  - actually of some representants of classes of equivalent runs



## • C and C++

- program is translated into LLVM intermediate language
- LLVM is executed by the model checker
- exploration of all possible runs of the program
  - actually of some representants of classes of equivalent runs
- detect assertions, memory errors, compiler traps, ...



void thread1() {
 x = 1;
 int b = y;
 int c = x;
}



### int x, y = 0; void thread0() { y = 1; int a = x; }

void thread1() {
 x = 1;
 int b = y;
 int c = x;
}
thread 1
 x = 1;



thread 0



store buffer of t. 0



































void thread1() {
 x = 1;
 int b = y;
 int c = x;
}
thread 1



y = 1; load x;

thread 0

store buffer of t. 0



### int x, y = 0;void thread0() { y = 1;int a = x;}

void thread1() { x = 1;int b = y;int c = x;} thread 1 x = 1;



thread 0

у



1 store buffer of t. 0









х

y







- memory is significantly slower than processor cores
- processor has caches to speed up execution



- memory is significantly slower than processor cores
- processor has caches to speed up execution
- optimizations of cache coherency protocols
  - $\rightarrow$  observable effects



- memory is significantly slower than processor cores
- processor has caches to speed up execution
- optimizations of cache coherency protocols
  - $\rightarrow$  observable effects
- reordering of instructions might be also observable (not on x86)



- memory is significantly slower than processor cores
- processor has caches to speed up execution
- optimizations of cache coherency protocols → observable effects
- reordering of instructions might be also observable (not on x86)
- overall behaviour described by a (relaxed) memory model



- memory is significantly slower than processor cores
- processor has caches to speed up execution
- optimizations of cache coherency protocols → observable effects
- reordering of instructions might be also observable (not on x86)
- overall behaviour described by a (relaxed) memory model
- now: x86-TSO memory model



- memory is significantly slower than processor cores
- processor has caches to speed up execution
- optimizations of cache coherency protocols → observable effects
- reordering of instructions might be also observable (not on x86)
- overall behaviour described by a (relaxed) memory model
- now: x86-TSO memory model
  - stores are performed to store buffer
  - core-local FIFO buffers
  - entries flushed eventually to the memory



- encode the memory model into the program
- verify it using a verifier without memory model support
  - e.g. DIVINE, a lot of other verifiers
  - program transformation instead of modification of the verifier



- $\blacksquare$  encode the memory model into the program
- verify it using a verifier without memory model support
  - e.g. DIVINE, a lot of other verifiers
  - program transformation instead of modification of the verifier

```
x = 1;
int a = y;
```

```
\sim \rightarrow
```

```
_store( &x, 1 );
int a = _load( &y );
```

load, \_store simulate the memory model(more complex in practice)

# 出

#### program transformation

- can be improved with static analysis
- memory model independent
- most complexity is technical

#### program transformation

- can be improved with static analysis
- memory model independent
- most complexity is technical

#### memory operations

- memory model dependent
- rather complex (theoretically & technically)
- $\blacksquare$  impact efficiency a lot  $\rightarrow$  the main aim of my work
  - efficient data structures (time & memory)
  - amount of nondeterminism
- bounded reordering of (effects of) instructions



## State Space Explosion





# State Space Explosion







### Inot all memory is actually accessible by more than one thread (shared)



- I not all memory is actually accessible by more than one thread (shared)
- 2 not all shared memory is actually *accessed* by more that one thread



- not all memory is actually accessible by more than one thread (shared)
- Inot all shared memory is actually accessed by more that one thread
- even memory accessed by more than one threads is usually not accessed by all of them all the time



- not all memory is actually accessible by more than one thread (shared)
- Inot all shared memory is actually accessed by more that one thread
- even memory accessed by more than one threads is usually not accessed by all of them all the time
- DIVINE's state space reduction uses these observations



- Inot all memory is actually accessible by more than one thread (shared)
- Inot all shared memory is actually accessed by more that one thread
- even memory accessed by more than one threads is usually not accessed by all of them all the time
- DIVINE's state space reduction uses these observations
- but relaxed memory simulation has to be adapted to support this



- i.e. when someone tries to load given address
- $\blacksquare$  need to simulate all outcomes  $\rightarrow$  nondeterminism in load



- i.e. when someone tries to load given address
- $\blacksquare$  need to simulate all outcomes  $\rightarrow$  nondeterminism in load
- how to handle other entries in store buffer?



- i.e. when someone tries to load given address
- $\blacksquare$  need to simulate all outcomes  $\rightarrow$  nondeterminism in load
- how to handle other entries in store buffer?
- memory barriers and compare-and-swap/read-modify-write not fully lazy



- i.e. when someone tries to load given address
- $\blacksquare$  need to simulate all outcomes  $\rightarrow$  nondeterminism in load
- how to handle other entries in store buffer?
- memory barriers and compare-and-swap/read-modify-write not fully lazy
  - flushing of local store buffer can nondeterministically flush entries from other buffers
  - fully lazy barriers would show down DiOS



the lazy simulation of x86-TSO store buffers mostly done

- one known missing corner case
  - in sequence of stores to unrelated addresses
  - solution will probably increase laziness and therefore performance
- probably some space for speed improvement



#### current delays caused by interaction with state space reductions



current delays caused by interaction with state space reductions

- store buffers look like shared memory for reduction
- ensure reduction does not see every operation as visible



current delays caused by interaction with state space reductions

- store buffers look like shared memory for reduction
- ensure reduction does not see every operation as visible
- not everything in memory model implementation can be hidden
   "real" loads/stores



- 1 finish implementation
- 2 add more tests and benchmarks
- 3 compare with other tools
- 4 publish



- 1 finish implementation
- 2 add more tests and benchmarks
- 3 compare with other tools
- 4 publish
- 5 SV-COMP demo category?
- **6** (optimize & improve & publish)<sup>+</sup>



- **1** finish implementation
- 2 add more tests and benchmarks
- 3 compare with other tools
- 4 publish
- 5 SV-COMP demo category?
- **6** (optimize & improve & publish)<sup>+</sup>

That is all... Thank You!