TSO and IBM System/370

By Gepost door pveentjer

The IBM System/370 (IBM 370) is mainframe from the 70s.
You would probably wonder why I would discuss a machine a few people have heard of and even fewer people will ever need to work with. To understand abstract memory models like those from Java and C++ and to understand performance implications, I find it useful to have some insights into hardware memory models, how they are implemented, and especially how they are related. Total Store Order (TSO) is the memory model that is used on the X86, but there are ISA's like the SPARC V8/v9 that offer TSO. The problem that TSO solves with respect to sequential consistency (SC) is that when a CPU does a write and there is a write miss, the CPU needs to wait for the cache line to be invalidated on all CPUs that use this cache line before it can write to the cache-line. As a consequence, the performance of the CPU is reduced because it can't execute other instructions since this could lead to loads/stores being performed out of order in the global memory order.

So instead of blocking, since non-speculative stores are going to be written to the cache anyway, the stores are written to a store buffer that sits between the CPU and the cache. The stores in the store buffer are in program order and will be committed to the cache in program order.

Because stores are buffered, this can lead to loads/stores being applied in the global memory order in a different order than the program order. A reordering that is possible is that an older store is reordered with a newer load to a different address as demonstrated by the store buffering litmus test.
Initial: X=0, Y=0 CPU1: X=1 r1=Y CPU2: Y=1 r2=X
Can it be that r1=0, r2=0? With SC this can't happen because no reordering is allowed and therefore an earlier load can't be reordered with a later store to a different address. But with TSO this can happen due to store buffers; even though the store retires before the load retires, the store could be placed in the memory order after the load was placed in the memory order.

With SC you get a total order over all loads/stores. But because loads and stores to different addresses can be reordered, TSO will only provide a total order over all stores because stores are atomic and not reordered with other stores. The total order over all stores can be demonstrated with the Independent read of independent writes (IRIW) litmus test:

Initial: X=0, Y=0 CPU1: X=1 CPU2: Y=1 CPU3: r1=X r2=Y CPU4: r3=Y r4=X
Can it be that r1=1, r2=0, r3=1, r4=0; so could it be that the CPUs see the changes to different addresses issued by different CPUs in different orders? With TSO this isn't allowed and this is done by making the load and store atomic. In a future post, I'll explain why this will give a total order over the stores. Is it allowed for an earlier load to be reordered with a later store to a different address? This can be expressed using the load buffering litmus test:
Initial: X=0, Y=0 CPU1: r1=X Y=1 CPU2: r2=Y X=1
Is the outcome r1=1 and r2=1 possible on TSO? No, an older load can't be reordered with a newer store to a different address. Think about it; the load is globally performed (read from cache) before it retires. The store will retire after the load and the store will only be globally performed (written to cache) after it retires. So an earlier load to a different address will be globally performed before a later store. Is it allowed for an earlier load to be reordered with a later load to a different address? And the same goes for 2 stores? This is expressed using the message passing litmus test:
Initial: X=0, Y=0 CPU1: X=1 Y=1 CPU2: r1=Y r2=X
Can it be that r1=1 and r2=0?
With TSO this can't happen because stores can't be reordered with other stores and loads can't be reordered with other loads. The Intel X86 does make use of speculative out-of-order execution of loads, but if it detects a potential reordering, it will flush the pipeline and try again. This situation is called a memory order violation. In the "Load/Load and Store/Store reordering" section, I explain that 2 loads can't be reordered. Well, I lied because there is a very important edge case. What happens where there is a store to A followed by a load from A followed by a load from B. Once the store of A is added to the store buffer, the load of A has a few options. The first option would be to ignore the store buffer and the load from the cache. The big problem is that the load would not see the previous store and we end up with a violation of the program order. Since this isn't acceptable, a better approach is to let the load look in the store buffer for a matching store. This is called store-to- load-forwarding.

But this leads to an ordering problem because the load of A and B can now be reordered in the global memory order which is demonstrated with the following test:

Initial: X=0, Y=0 CPU1: X=1 r1=X r2=Y CPU2: Y=1 r3=Y r4=X
Can it be that r1=1, r3=1, and r2=0 and r4=0? So could it be that the r2=Y jumps before r1=X? And could r4=Y jump before r3=Y? With TSO this is allowed because X=1 could be globally performed after r2=Y and r1=X can only be globally performed after the X=1 is globally performed, so it could be that r1=X is globally performed after r2=Y. A load can only be globally performed after the store it reads from is globally performed. So effectively the store of X in the store buffer carries the load of X in the global memory order after the load of Y.

Keep in mind that globally performed isn't the same as executed because the r1=X could very well be executed before r2=Y.

So how does the IBM 370 fit into the picture? The IBM 370 provides TSO but with 1 addition. If there is a store followed by a load to the same address, it requires that the store becomes globally visible before the load can be performed. What is the point if the same value is returned for the load as with TSO? It will prevent those 2 loads can be reordered!

I'll explain it using the previous example.

Initial: X=0, Y=0 CPU1: X=1 r1=X r2=Y CPU2: Y=1 r3=Y r4=X
With TSO, the r1=X and r2=Y can be reordered due to STLF. But with IBM 370, the load of r1=X can only be globally performed after the X=1 is globally performed and the load of r2=Y can only be globally performed after r1=X is globally performed, so this forces the loads to be performed in order. And as a consequence we can't end up with r1=1, r2=0, r3=1,r4=0. TSO gives a total order over all stores because stores are atomic and can't be reordered with other stores. With IBM System/370, we also get a total order over all loads because loads are atomic and can't be reordered with other loads. So IBM 370 gives a total order over all loads and a total order over all stores. So it sits nicely between SC and TSO.

For more information check Shared Memory Consistency Models: A Tutorial.