本文将mark下Linux kernel中cmpxchg/cmpxchg64的相关笔记。对于底层实现的硬件原理,待有空了再分析。

本文参考的内核版本为v5.0

基本原理

cmpxchg(void* ptr, int old, int new) 将old和ptr指向的内容比较

  • 如果相等,则将new写入到ptr中,返回old
  • 如果不相等,则返回ptr指向的内容

整个过程中操作是原子的。

From the point of view of a Linux kernel programmer, compare-and-swap has the following prototype:

1
T cmpxchg(T *ptr, T old, T new);

where T can be either an integer type that is at most as wide as a pointer, or a pointer type. In order to support such polymorphism, cmpxchg() is defined as a macro rather than a function, but the macro is written carefully to avoid evaluating its arguments multiple times. Linux also has a cmpxchg64() macro that takes 64-bit integers as the arguments, but it may not be available on all 32-bit platforms.

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* Posted-Interrupt Descriptor */
struct pi_desc {
u32 pir[8]; /* Posted interrupt requested */
union {
struct {
/* bit 256 - Outstanding Notification */
u16 on : 1,
/* bit 257 - Suppress Notification */
sn : 1,
/* bit 271:258 - Reserved */
rsvd_1 : 14;
/* bit 279:272 - Notification Vector */
u8 nv;
/* bit 287:280 - Reserved */
u8 rsvd_2;
/* bit 319:288 - Notification Destination */
u32 ndst;
};
u64 control;
};
u32 rsvd[6];
} __aligned(64);

static void __pi_post_block(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
struct pi_desc old, new;
unsigned int dest;

do {
old.control = new.control = pi_desc->control;
WARN(old.nv != POSTED_INTR_WAKEUP_VECTOR,
"Wakeup handler not enabled while the VCPU is blocked\n");

dest = cpu_physical_id(vcpu->cpu);

if (x2apic_enabled())
new.ndst = dest;
else
new.ndst = (dest << 8) & 0xFF00;

/* set 'NV' to 'notification vector' */
new.nv = POSTED_INTR_VECTOR;
} while (cmpxchg64(&pi_desc->control, old.control,
new.control) != old.control);
...
}

值得注意的是:ndstnvfield均属于pi_desc的control

在上述代码片段中,old.control的类型是u64,在第31行到43行执行过程中,IOMMU硬件是有可能更改 on (Outstanding Notification)bit的,也就是说pi_desc->control的内容可能会发生变化。
因此会在第44行到45行检测pi_desc->control与old.control是否相等:

  • 如果相等(说明过程中IOMMU硬件没有更改on),就将new.control写入到&pi_desc->control中,并返回old.control,则会跳出do while循环
  • 如果不相等(说明过程中IOMMU硬件更改了on),就会返回pi_desc->control,则继续进入do while循环

参考资料:

  1. Linux内核中的cmpxchg函数
  2. Linux Kernel CMPXCHG函数分析
  3. Atomic - Reference Count
  4. LWN: Lockless编程模式 - 介绍compare-and-swap!
  5. Lockless patterns: an introduction to compare-and-swap