先看一段代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
#include <pthread.h> #include <stdlib.h> #include <stdio.h> #include <algorithm> using namespace std; static const int64_t MAX_THREAD_NUM = 128; static int64_t n = 0; static int64_t loop_count = 0; #pragma pack (1) struct data { int32_t pad[15]; int64_t v; }; #pragma pack () static data value __attribute__((aligned(64))); static int64_t counter[MAX_THREAD_NUM]; void worker(int *cnt) { for (int64_t i = 0; i < loop_count; ++i) { const int64_t t = value.v; if (t != 0L && t != ~0L) { *cnt += 1; } value.v = ~t; asm volatile("" ::: "memory"); } } int main(int argc, char *argv[]) { pthread_t threads[MAX_THREAD_NUM]; /* Check arguments to program*/ if(argc != 3) { fprintf(stderr, "USAGE: %s <threads> <loopcount>\n", argv[0]); exit(1); } /* Parse argument */ n = min(atol(argv[1]), MAX_THREAD_NUM); loop_count = atol(argv[2]); /* Don't bother with format checking */ /* Start the threads */ for (int64_t i = 0L; i < n; ++i) { pthread_create(&threads[i], NULL, (void* (*)(void*))worker, &counter[i]); } int64_t count = 0L; for (int64_t i = 0L; i < n; ++i) { pthread_join(threads[i], NULL); count += counter[i]; } printf("data size: %lu\n", sizeof(value)); printf("data addr: %lX\n", (unsigned long)&value.v); printf("final: %016lX\n", value.v); return 0; } |
这段代码的逻辑很简单,开多个线程并行执行一个不断对全局变量取反的操作,你觉得最后的结果会是什么呢?
简单理解似乎没什么可考虑的,不断取反即使并发产生冲突,但结果也只有两个情况:全0或者全1,运行一下看看结果(一定要在多核机器上运行):
1 2 3 4 |
[jingyan.kfy@OceanBase224006 test]$ ./alignment 24 10000 data size: 68 data addr: 6016FC final: FFFFFFFF00000000 |
最后的结构居然是一半1和一半0!是不是很神奇~
出现这种结果的原因其实很简单,我在程序中设置了特殊的对齐,把这个变量放在了跨越两个cacheline的位置(仔细看代码中高亮的部分)。这样的设置会引发一个反直觉的事实:CPU的一条访存指令是分成两个访存操作执行的。
如果你看过我的前一篇文章,那你应该会很容易理解这个现象:Cache-Coherence的基本单元就是cache line,为了写内存,CPU必须Exclusive的占有这个cache line,而如果一个变量分布在两个不同的cache line上,那么cache line的争用过程是没有原子性保证的。读的过程也是类似的。
这一点在Intel的文档1中也得到了验证:
Intel 64 memory ordering guarantees that for each of the following memory-access instructions, the constituent memory operation appears to execute as a single memory access regardless of memory type:
- Instructions that read or write a single byte.
- Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary.
- Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary.
- Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary.
All locked instructions (the implicitly locked xchg instruction and other read-modify-write instructions with a lock prefix) are an indivisible and uninterruptible sequence of load(s) followed by store(s) regardless of memory type and alignment.
Other instructions may be implemented with multiple memory accesses. From a memory- ordering point of view, there are no guarantees regarding the relative order in which the constituent memory accesses are made. There is also no guarantee that the constituent operations of a store are executed in the same order as the constituent operations of a load.
可以看到Intel只保证了满足对齐规则的变量的访存操作原子性,这样的对齐规则保证变量不会跨越多个cache line。
那么我们该怎么办呢?其实很简单,Gcc默认的变量对齐是符合Intel的对齐要求的,所以正常情况下这种“异常”是完全不会发生的。但是当你自行操作内存的时候就一定要注意了:因为这个时候没有人会再来帮你对变量进行对齐了,You are on your own。
所以在进行底层系统编程的时候,一定要了解硬件的脾性,小心小心再小心。
参考资料
- Intel® 64 Architecture Memory Ordering White Paper ↩