About Cache Coherence, Atomic Operation, Memory Ordering, Memory Barrier, Volatile

写这篇文章的起因是看到何登成博士发的一个微博问题,如下:

自己想不太明白,顺下找了他以前分享的一些资料和其他人的博客阅读,在这里做个笔记,内容主要来自何博的ppt。关于微博问题的讨论最后再说。

实际上问题所涉及到的知识点非常多,我也有很多还没理解,这里尽量省去细枝末节,更详细的内容请参考附录链接。

一. Cache Coherence

1. What is a cache? cache line ?

cache : Small, fast storage used to improve average access time to slow memory.

cache line : The minimum amount of cache which can be loaded or stored to memory

2. Cache Write Policy

– Write Back

•脏数据,写出到Cache;

– Write Through

•脏数据,写穿到Memory;

– Write Invalidate(大部分系统采用)

Write时,同时发出Invalidate消息,使得所有其他CPU L1/L2 Cache中同一Cache Line失效

– Write Update

•Write时,同时更新其他CPU L1/L2 Cache中同一Cache Line;

3. Cache Coherence

在多核处理器上,由于每个核都有自己的cache,如果有多层cache,如L3往往是多核共享的。所以会存在Cache Coherence 问题,

False cache line sharing:When one processor modifies a value in its cache, other processors cannot use the old value anymore. 

That memory location will be invalidated in all of the caches. Furthermore, since caches operate on the granularity of cache lines and 

not individual bytes, the entire cache line will be invalidated in all caches!

Cache Coherence Protocol (MESI, MOESI),作用于CPU Cache与Memory层面,若操作的数据在Register,或者是Register与L1 

Cache之间(下面会提到的Store Buffer,Load Buffer),则这些数据不会参与Cache Coherence协议。

二、Atomic Operation

•An operation acting on shared memory is atomic if it completes in a single step relative to other threads. When an atomic store is performed on a shared variable, no other thread can observe the modification half-complete. When an atomic load is performed on a shared variable, it reads the entire value as it appeared at a single moment in time. 

1. 高级语言与汇编指令的映射

高级语言(如:C/C++),被编译为汇编语言,才能够被执行。因此,高级语言与汇编语言之间,存在着几种简单的映射关系。

•Simple Write

– Write to Memory                     

– Atomic

•Simple Read 

–Read from Memory                                                     

–Atomic  (注:实际上这里是指将a 读取到寄存器eax是atomic的,赋值语句b=a 包括两条汇编命令,不是atomic的)             

•Read-Modify-Write(RMW)                  

– Read from Memory

– Modify                                                       

– Write to Memory

– Non-Atomic

•Read/Write 64 Bits on 32 Bits Systems

– Write:Non-Atomic

– Read:Non-Atomic

2. Non-Atomic 的危害(在32位机上读写64位数如上图)

•Half Write

– mov dword ptr [c], 2  执行后,会短暂出现c的half write现象;

•Half Read

–若c 出现half write,则读取c 会出现half read现象;

•Composite Write

– 两个线程同时write c,一个完成,一个half write,则c的值,来自线程1,2两个write操作的组合;

•危害

– 出现Half Read,会导致程序判断逻辑出错;出现Composite Write,会导致数据出错

3. 如何消除Non-Atomic Read/Write?

•Intel/AMD CPU   平台方面 (参考各CPU白皮书)

– Aligned 2-,4-Byte Simple Read/Write -> Atomic

– Aligned 8-Byte,CPU型号判断-> 一般为Atomic

– Unaligned 2-, 4-, 8-Byte,CPU型号判断 -> 尽量少用

•RMW Operation

尽量使用系统自带的,或者是提供的原子操作函数;这些函数,对不同CPU类型,做了较好的封装,更加易用;

– Windows Synchronization Functions

– Linux  Built-in Functions for Atomic Memory Access

– C++ 11 Atomic Operations Library

4. Atomic Instructions and Lock

•Atomic Instructions

– 常见指令:CMPXCHG,XCHG,XADD,...

– CMPXCHG(compare-and-exchange)

•将Operand1(Reg/Mem)中的内容与EAX比较,若相等,则拷贝Operand2(Reg)中的内容至Operand1;若不等,

则将Operand2中的数据写入EAX;

•一个Atomic RMW操作,若Operand1为Memory,则CMPXCHG指令还需要Lock指令配合 (Lock prefix);

•Lock Instruction

– Lock指令是一个前缀,可以用在很多指令之前,代表当前指令所操作的内存(Memory),在指令执行期间,只能被当前CPU所用;

– Intel’s Description about Lock Instruction

– Lock with CMPXCHG

-  x++可以用汇编来写 : __asm LOCK inc dword ptr[x]

三、Memory Ordering(Reordering)

1.Reordering

Reads and writes do not always happen in the order that you have written them in your code.

用户程序,无论是在编译期间,还是在执行期间,都会产生Reordering;

•Why Reordering

– Performance

•Reordering Principle

– In single threaded programs from the programmer's point of view, all operations appear to have been executed in the order specified, with all inconsistencies hidden by hardware.

–- 一段程序,在Reordering前,与Reordering后,拥有相同的执行效果(Single Thread)

2.Reordering type

•Examples

– Example1                                                              •经过编译优化,A, B 赋值操作被Reorder;出现在编译期间的Reordering,称之为Compiler Reordering;                                        

-- Example 2  

load 操作被提前;  出现在执行期间的Reordering,称之为CPU Memory Reordering;                                         

 •假设X,Y初始化为0; 根据p1和p2的运行顺序不同,按常规不乱序执行来说,r1, r2 的值可能为0,1(p1 first);  1,0;(p2 first)  1,1(concurrent);  但一定不会出现0,0;的状态,实际上测试运行多次,会出现0,0;的状态,因为cpu在运行指令时将load 操作提前了。测试代码点这。                                        (where blank lines have been inserted to highlight the apparent order of operations)

3. CPU Memory Ordering/Reordering

The term memory ordering refers to the order in which the processor issues reads(loads) and writes(stores) through the system bus to system memory. (From Intel System Programming Guide 8.2)

– 为什么需要reordering?

•: L1 Latency 4 clks; L2 10 clks; L3 20 clks; Memory 200 clks -> Huge Latency

•: 考虑指令执行时,read与write的优先级;(CPU设计时,重点考虑)

•CPU Memory Reordering Types

–LoadLoad(读读乱序)、LoadStore(读写乱序)、StoreLoad(写读乱序)、StoreStore(写写乱序)

•CPU如何实现Memory Reordering?

– Load/Store Buffer;LineFill Buffer/Write Combining Buffer;Invalidate Message Queue;...

4. CPU Memory Models

Memory consistency models describe how threads may interact through shared memory consistently. 

There are many types of memory reordering, and not all types of reordering occur equally often. It all depends on processor you’re targeting and/or the tool chain you’re using for development.

•主要的CPU Memory Models

–Programming Order                    -> Stronger Memory Model

–Sequential Consistency 

–Strict Consistency  

–Data Dependency Order             -> Weaker Memory Model

–...

5. Intel X86/64 Memory Model

•In a single-processor system for memory regions defined as write-back cacheable.

– Reads are not reordered with other reads.

– Writes are not reordered with older reads.

– Writes to memory are not reordered with other writes.

– Reads may be reordered with older writes to different locations but not with older writes to the same location.

•In a multiple-processor system

– Individual processors use the same ordering principles as in a single-processor system.

– Writes by a single processor are observed in the same order by all processors.

– Writes from an individual processor are NOT ordered with respect to the writes from other processors.

– Memory ordering obeys causality (memory ordering respects transitive visibility).

– Any two stores are seen in a consistent order by processors other than those performing the stores.

•解读

– 普通内存操作,只可能存在StoreLoad Reordering;

– LoadLoad、LoadStore、StoreStore均不可能Reordering;

– 一个Processor的Writes操作,其他Processor看到的顺序是一致的;

– 不同Processors的Writes操作,是没有顺序保证的;

•StoreLoad Reordering Problem 

– Failure of Dekker’s algorithm

– Memory Reordering Caught in the Act

四、Memory Barrier

A memory barrier, is a type of barrier instruction which causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that certain operations are guaranteed to be performed before the barrier, and others after.

1.  Compiler Memory Barrier

Compiler Reordering 能够提高程序的运行效率。但有时候 (尤其是针对Parallel Programming),我们并不想让Compiler将我们的程序进行Reordering。此时,就需要有一种机制,能够告诉Compiler,不要进行Reordering,这个机制,就是Compiler Memory Barrier。

顾名思义,Complier Memory Barrier就是阻止Compiler进行Reordering的Barrier Instruction;

•Compiler Memory Barrier Instructions

– GNU

•asm volatile(""::: "memory"); •__asm__  __volatile__ ("" :::"memory");

– Intel ECC Compiler

•__memory_barrier();

– Microsoft Visual C++

•_ReadWriteBarrier();

•使用Compiler Memory Barrier后,即使-O2 编译优化,也不会乱序。

•注意:

– Compiler Memory Barrier只是一个通知的标识,告诉Compiler在看到此指令时,不要对此指令的上下部分做Reordering。

– 在编译后的汇编中,Compiler Memory Barrier消失,CPU不能感知到Compiler Memory Barrier的存在,这点与后面提到的CPU Memory Barrier有所不同;

2. Cpu Memory Barrier

顾名思义,Compiler Memory Barrier既然是用来告诉Compiler在编译阶段不要进行指令乱排,那么CPU Memory Barrier就是用来告诉CPU,在执行阶段不要交互两条操作内存的指令的顺序;

注意:由于CPU在执行时,必须感知到CPU Memory Barrier的存在,因此CPU Memory Barrier是一条真正的指令,存在于编译后的汇编代码中;

(1)、4种基本的CPU Memory Barriers

– LoadLoad Barrier

– LoadStore Barrier

– StoreLoad Barrier

– StoreStore Barrier

(2)、更为复杂的CPU Memory Barriers

Store Barrier (Write Barrier)

•所有在Store Barrier前的Store操作,必须在Store Barrier指令前执行完毕;而所有Store Barrier指令后的Store操作,必须在Store指令执行结束后才能开始;

•Store Barrier只针对Store(Write)操作,对Load无任何影响;

Load Barrier (Read Barrier)

•将Store Barrier的功能,全部换为针对Load操作即可;

–Full Barrier

•Load+ Store Barrier,Full Barrier两边的任何操作,均不可交换顺序;

(3)、Memory Barrier Instructions in CPU

•x86, x86-64, amd64

– lfence:  Load Barrier

– sfence:  Store Barrier

– mfence:  Full Barrier

•PowerPC

– sync:  Full Barrier

•MIPS

– sync:  Full Barrier

•Itanium

– mf:  Full Barrier

•ARMv7

– dmb

– dsb

– isb

3.Use CPU Memory Barrier Instructions(x86)

•Only CPU Memory Barrier

– asm volatile(“mfence”);

•CPU + Compiler Memory Barrier

– asm volatile(“mfence” :::  ”memory”);

•Use Memory Barrier in C/C++

4. Use Lock Instruction to Implement a Memory Barrier

Reads or writes cannot be reordered with I/O instructions,locked instructions, or serializing instructions.

既然read/write不能穿越locked instructions进行reordering,那么所有带有lock prefix的指令,都构成了一个天然的Full  Memory Barrier;

•lock addl

– asm volatile("lock; addl $0,0(%%esp)" ::: "memory")

addl $0,0(%%esp)  ->  do nothing

 lock;  ->  to be a cpu memory barrier

“memory”  ->  to be a compiler memory barrier

•xchg

– asm volatile(“xchgl (%0),%0” ::: “memory”)

Question:   why xchg don’t need lock prefix?

Answer:  The LOCK prefix is automatically assumed for XCHG instruction.

•lock cmpxchg

5.Memory Barriers in Compiler & OS 

•Linux(x86,x86-64)

– smp_rmb()

– smp_wmb()

– smp_mb()

•Windows(x86,x86-64)

– MemoryBarrier()

6. X86 Memory Ordering with Memory Barrier

•In a single-processor system for memory regions defined as write-back cacheable.

– 注:新增部分

– Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.

Reads cannot pass earlier LFENCE and MFENCE instructions.

Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.

– LFENCE instructions cannot pass earlier reads.

– SFENCE instructions cannot pass earlier writes.

– MFENCE instructions cannot pass earlier reads or writes.

•In a multiple-processor system

– 注:新增部分

Locked instructions have a total order.

五、Read Acquire vs Write Release

1.Read Acquire and Write Release  

–Two Special Memory Barriers.      

– Definition

•A read-acquire executes before all reads and writes by the same thread that follow it in program order.

•A write-release executes after all reads and writes by the same thread that precede it in program order.

2.Read Acquire and Write Release 的作用

•Read Acquire and Write Release Barriers   

– Read Acquire

•LoadLoad + LoadStore Barrier

– Write Release

•LoadStore + StoreStore Barrier              

•解读

– Read Acquire + Write Release语义,是所有锁实现的基础(Spinlock, Mutex, RWLock, ...),所有被[Read Acquire, Write Release]包含的区域,即构成了一个临界区,临界区内的指令,确保不会在临界区外运行。因此,Read Acquire又称为Lock Acquire,Write Release又称为Unlock Release;

3.How to Implement Read Acquire/Write Release?

•Intel X86, X86-64

– Full Memory Barrier

•mfence

•locked instruction

•Compiler and OS

– Linux

•smp_mb()

– Windows

•Functions with Acquire/Release Semantics

•InterlockedIncrementAcquire ()...

4. Extension:StoreLoad Reorder

•Question

–为什么Intel CPU在LoadLoad,LoadStore,StoreLoad,StoreStore乱序中,仅仅保持了StoreLoad乱序?

–为什么LoadLoad/LoadStore/StoreStore Barrier 乱序被称之为 lightweight Barrier?  而StoreLoad Barrier 则为 Expensive Barrier?

•on PowerPC, the lwsync (short for “lightweight sync”)instruction acts as all three #LoadLoad, #LoadStore and #StoreStore barriers 

at the same time, yet is less expensive than the sync instruction,which includes a #StoreLoad barrier.

•Answer

– Store Buffer;

– Store异步不影响指令执行;

– Load只能同步;

•注意

– Intel CPU,Load自带Acquire Semantics;Store自带Release Semantics;

六、Volatile:C/C++ vs Java 

•Volatile

– 易失的,不稳定的...

•Volatile in C/C++

The volatile keyword is used on variables that may be modified simultaneously by other threads. This warns the compiler to fetch them fresh each time, rather than caching them in registers. (read/write actually happens)

– No reordering occurs between different volatile reads/writes. (only volatile variables guarantee no reordering)

• Volatile in Java

– (The Same as C/C++),Plus

– Volatile reads and writes establish a happens-before relationship, much like acquiring and releasing a mutex. (no reordering takes place)

• Read Volatile:Acquire Semantics;Write Volatile:Release Semantics;

– Writes and reads of volatile long and double values are always atomic.

• The Java Language Specification Java SE 7 Edition:Chapter 17.7

•Examples       

– int answer = 0;

– bool volatile ready;

•Question

– Thread2 的answer 会输出什么结果?

– C/C++

• answer= 42 or 0,均有可能;

– Java

• answer= 42,只有唯一的结果;

–Why?

•Java’s Volatile:Write Release Semantics -> ready(true)一定在answer(42)之后执行;

上述讨论表明:C/C++ Volatile变量,与非Volatile变量之间的操作,是可能被编译器优化交换顺序的。但即使把answer也设置为volatile,虽然能够避免被编译器优化而reorer,但不能确保在不同平台cpu上会不会产生指令执行的乱序,所以一定要小心使用,如下所述。

• C/C++ and "volatile"

When writing single-threaded code, declaring a variable “volatile” can be very useful. The compiler will not omit or reorder accesses to volatile locations. Combine that with the sequential consistency provided by the hardware, and you’re guaranteed that the loads and stores will appear to happen in the expected order.

However, accesses to volatile storage may be reordered with non-volatile accesses, so you have to be careful in multi-threaded uniprocessor environments (explicit compiler reorder barriers may be required). There are no atomicity guarantees, and no memory barrier provisions, so “volatile” doesn’t help you at all in multi-threaded SMP environments. The C and C++ language standards are being updated to address this with built-in atomic operations.

If you think you need to declare something “volatile”, that is a strong indicator that you should be using one of the atomic operations instead.

七、微博问题讨论

测试代码如下:

// g++ -o reorder -O3 reorder.c -lpthread
// Run in X86 CPU (Intel/AMD)

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <semaphore.h>

#define N 100000


sem_t sem_ac, sem_bc, sem_fin;

static volatile int A;
static volatile int B;
static volatile int C;

void *setAC(void *args)
{
    for (;;)
    {
        sem_wait(&sem_ac);
        A = 1;
        __sync_synchronize();
        C = B;
        sem_post(&sem_fin);
    }
}

void *setBC(void *args)
{
    for (;;)
    {
        sem_wait(&sem_bc);
        B = 1;
        __sync_synchronize();
        C = A;
        sem_post(&sem_fin);
    }
}

int main(int argc, char *argv[])
{
    pthread_t t1, t2;
    int got = 0, iters = 0;

    sem_init(&sem_ac, 0, 0);
    sem_init(&sem_bc, 0, 0);
    sem_init(&sem_fin, 0, 0);

    pthread_create(&t1, NULL, setAC, NULL);
    pthread_create(&t2, NULL, setBC, NULL);

    for (;; iters++)
    {
        A = B = C = 0;
        sem_post(&sem_ac);
        sem_post(&sem_bc);
        sem_wait(&sem_fin);
        sem_wait(&sem_fin);
        if (C == 0)
        {
            printf("got %d reorders after %d iterations\n", ++got, iters);
        }
    }

    return 0;
}

注:__sync_synchronize() 是gcc 提供的一个 full memory barrier.

在多核处理器上不管加不加内存屏障都可能会输出C==0的情况。

如果不加,有两个原因,一个是指令执行乱序如(A=1 与 C=B 交换),另一个是线程调度在多个核上跑的时候。每个核心的寄存器都是独立的,而C=B 是两条汇编指令,出现 c==0 的情况跟下面单核类似,不赘述。

如果加了,就只剩下线程跑在多核上的影响了。

在单核处理器上(或者通过绑定到一个核上运行),即使加了内存屏障还是可以输出c=0的情况,虽然概率小很多。

A = 1;
 80486d4:       c7 05 3c a0 04 08 01    movl   $0x1,0x804a03c
 
  C = B;
 80486ea:       a1 38 a0 04 08          mov    0x804a038,%eax
 80486ef:       a3 34 a0 04 08          mov    %eax,0x804a034
B = 1;
 8048694:       c7 05 38 a0 04 08 01    movl   $0x1,0x804a038
 
 C = A;
 80486aa:       a1 3c a0 04 08          mov    0x804a03c,%eax
 80486af:       a3 34 a0 04 08          mov    %eax,0x804a034

假设先运行线程1一直到mov    0x804a038,%eax 完,此时线程1的eax为0,然后切换到线程2运行一直到一个循环结束,此时c=1,接着切换回线程1

运行 mov    %eax,0x804a034, c又被覆盖成0了。但在这么短的时间内一般不会发生context switching,所以c==0 出现的概率很小。

附录链接:

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏逆向技术

16位汇编第七讲汇编指令详解第第三讲

                             16位汇编第六讲汇编指令详解第第三讲 1.十进制调整指令 1. 十进制数调整指令对二进制运算的结果进行...

1565
来自专栏小詹同学

Leetcode打卡 | No.009 回文数

欢迎和小詹一起定期刷leetcode,每周一和周五更新一题,每一题都吃透,欢迎一题多解,寻找最优解!这个记录帖哪怕只有一个读者,小詹也会坚持刷下去的!

622
来自专栏听雨堂

MapX中取得图元操作的速度测试

        最常见的操作,是取得图层中的某个图元。假如需要根据一个属性(无重复)来获得图元的话,发现速度相差极大。 遍历比较是最慢的。 用图层的search...

1886
来自专栏Python小屋

Python花式编程案例集锦(9):sorted()函数中消失的cmp参数

明天开启全国巡讲Python模式,连续8场20天讲课,外加路上来回大约16天,这个假期有的忙了。所以接下来的一段时间里不一定能像以前更新的那么频繁,我尽量。

713
来自专栏ml

HDUOJ---(2203)亲和串

亲和串 Time Limit: 3000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/...

3188
来自专栏冰霜之地

Google S2 中的四叉树求 LCA 最近公共祖先

首先需要回顾一下希尔伯特曲线的生成方式,具体代码见笔者上篇文章的分析,在这个分析中,有4个方向比较重要,接下来的分析需要,所以把这4个方向的图搬过来。

583
来自专栏mathor

2018年全国多校算法寒假训练营练习比赛(第一场)

 说实话,一开始看到这个题,还以为是动态规划,后来想了一下好像并不存在什么子问题,就是单纯要求个最大值而已,枪的威力由强本身的威力加上配件的加成,那么配件加...

933
来自专栏小樱的经验随笔

洛谷 P1177 【模板】快速排序【13种排序模版】

P1177 【模板】快速排序 题目描述 利用快速排序算法将读入的N个数从小到大排序后输出。 快速排序是信息学竞赛的必备算法之一。对于快速排序不是很了解的同学可以...

3114
来自专栏Java学习网

Java实现的图片合并方法,支持水平和垂直合并

直接上代码,这是实际项目中应用的,如果需要可以直接复制使用即可。 import java.awt.Color; import java.awt.Graphics...

3247
来自专栏用户2442861的专栏

CSS基础(七):z-index详解

z-index 属性设置元素的堆叠顺序。拥有更高堆叠顺序的元素总是会处于堆叠顺序较低的元素的前面。

431

扫码关注云+社区