多核并发编程中的cache line对齐问题
先看一段代码:
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <algorithm>
using namespace std;
static const int64_t MAX_THREAD_NUM = 128;
static int64_t n = 0;
static int64_t loop_count = 0;
#pragma pack (1)
struct data
{
int32_t pad[15];
int64_t v;
};
#pragma pack ()
static data value __attribute__((aligned(64)));
static int64_t counter[MAX_THREAD_NUM];
void worker(int *cnt)
{
for (int64_t i = 0; i < loop_count; ++i) {
const int64_t t = value.v;
if (t != 0L && t != ~0L) {
*cnt += 1;
}
value.v = ~t;
asm volatile("" ::: "memory");
}
}
int main(int argc, char *argv[])
{
pthread_t threads[MAX_THREAD_NUM];
/* Check arguments to program*/
if(argc != 3) {
fprintf(stderr, "USAGE: %s <threads> <loopcount>\n", argv[0]);
exit(1);
}
/* Parse argument */
n = min(atol(argv[1]), MAX_THREAD_NUM);
loop_count = atol(argv[2]); /* Don't bother with format checking */
/* Start the threads */
for (int64_t i = 0L; i < n; ++i) {
pthread_create(&threads[i], NULL, (void* (*)(void*))worker, &counter[i]);
}
int64_t count = 0L;
for (int64_t i = 0L; i < n; ++i) {
pthread_join(threads[i], NULL);
count += counter[i];
}
printf("data size: %lu\n", sizeof(value));
printf("data addr: %lX\n", (unsigned long)&value.v);
printf("final: %016lX\n", value.v);
return 0;
}
这段代码的逻辑很简单,开多个线程并行执行一个不断对全局变量取反的操作,你觉得最后的结果会是什么呢?
简单理解似乎没什么可考虑的,不断取反即使并发产生冲突,但结果也只有两个情况:全0或者全1,运行一下看看结果(一定要在多核机器上运行):
[jingyan.kfy@OceanBase224006 test]$ ./alignment 24 10000
data size: 68
data addr: 6016FC
final: FFFFFFFF00000000
最后的结构居然是一半1和一半0!是不是很神奇~
出现这种结果的原因其实很简单,我在程序中设置了特殊的对齐,把这个变量放在了跨越两个cacheline的位置(仔细看代码中高亮的部分)。这样的设置会引发一个反直觉的事实:CPU的一条访存指令是分成两个访存操作执行的。
如果你看过我的前一篇文章,那你应该会很容易理解这个现象:Cache-Coherence的基本单元就是cache line,为了写内存,CPU必须Exclusive的占有这个cache line,而如果一个变量分布在两个不同的cache line上,那么cache line的争用过程是没有原子性保证的。读的过程也是类似的。
这一点在Intel的文档1中也得到了验证:
Intel 64 memory ordering guarantees that for each of the following memory-access instructions, the constituent memory operation appears to execute as a single memory access regardless of memory type:
- Instructions that read or write a single byte.
- Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary.
- Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary.
- Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary.
All locked instructions (the implicitly locked xchg instruction and other read-modify-write instructions with a lock prefix) are an indivisible and uninterruptible sequence of load(s) followed by store(s) regardless of memory type and alignment.
Other instructions may be implemented with multiple memory accesses. From a memory- ordering point of view, there are no guarantees regarding the relative order in which the constituent memory accesses are made. There is also no guarantee that the constituent operations of a store are executed in the same order as the constituent operations of a load.
可以看到Intel只保证了满足对齐规则的变量的访存操作原子性,这样的对齐规则保证变量不会跨越多个cache line。
那么我们该怎么办呢?其实很简单,Gcc默认的变量对齐是符合Intel的对齐要求的,所以正常情况下这种“异常”是完全不会发生的。但是当你自行操作内存的时候就一定要注意了:因为这个时候没有人会再来帮你对变量进行对齐了,You are on your own。
所以在进行底层系统编程的时候,一定要了解硬件的脾性,小心小心再小心。
参考资料
-
Intel® 64 Architecture Memory Ordering White Paper ↩
Comments