我写了一个小基准程序:
#include <boost/shared_ptr.hpp>
#include <boost/thread.hpp>
#include <boost/timer.hpp>
#include <iostream>
#include <vector>
#include <stdlib.h>
using namespace std;
enum { BufferSize = 1<<24, SLsPerCacheLine = 1 };
int ibuffer[BufferSize];
using boost::detail::spinlock;
size_t nslp = 41;
spinlock* pslp = 0;
spinlock& getSpinlock(size_t h)
{
return pslp[ (h%nslp) * SLsPerCacheLine ];
}
void threadFunc(int offset)
{
const size_t mask = BufferSize-1;
for (size_t ii=0, index=(offset&mask); ii<BufferSize; ++ii, index=((index+1)&mask))
{
spinlock& sl = getSpinlock(index);
sl.lock();
ibuffer[index] += 1;
sl.unlock();
}
};
int _tmain(int argc, _TCHAR* argv[])
{
if ( argc>1 )
{
size_t n = wcstoul(argv[1], NULL, 10);
if ( n>0 )
{
nslp = n;
}
}
cout << "Using pool size: "<< nslp << endl;
cout << "sizeof(spinlock): "<< sizeof(spinlock) << endl;
cout << "SLsPerCacheLine: "<< int(SLsPerCacheLine) << endl;
const size_t num = nslp * SLsPerCacheLine;
pslp = new spinlock[num ];
for (size_t ii=0; ii<num ; ii++)
{ memset(pslp+ii,0,sizeof(*pslp)); }
const size_t nThreads = 4;
boost::thread* ppThreads[nThreads];
const int offset[nThreads] = { 17, 101, 229, 1023 };
boost::timer timer;
for (size_t ii=0; ii<nThreads; ii++)
{ ppThreads[ii] = new boost::thread(threadFunc, offset[ii]); }
for (size_t ii=0; ii<nThreads; ii++)
{ ppThreads[ii]->join(); }
cout << "Elapsed time: " << timer.elapsed() << endl;
for (size_t ii=0; ii<nThreads; ii++)
{ delete ppThreads[ii]; }
delete[] pslp;
return 0;
}
我编译了两个版本的代码,一个是SLsPerCacheLine==1
,另一个是SLsPerCacheLine==8
。采用MSVS 2010进行优化的32bit,采用4核Xeon W3520 @ 2.67Ghz(禁用HyperThreading)运行。
我很难从这些测试中获得一致的结果 - 偶尔观察到高达50%的虚假定时变化。但是平均而言,SLsPerCacheLine==8
版本比SLsPerCacheLine==1
带有大小41的螺旋锁定表的版本要快25-30%。
看看如何通过更大数量的内核,NUMA,超线程等等来扩展,我会感兴趣的。我目前不能访问那种硬件。
发布于 2018-05-24 17:41:17
简要
你是对的,因为这些自旋锁共享相同的缓存行,当它们可能位于不同的缓存行时。又名虚假分享。通过在不同的缓存行中分配锁,可能会有一些性能增益(但请参见下文)。
首先,一些背景:
经典的spinloops是测试和测试设置的:
loop:
tmp := load( memory_location_of_sw_lock )
if( is_locked(tmp) ) goto loop
was_locked := atomic_hw_locked_rmw( memory_location_of_sw_lock,
value_that_will_get_sw_lock )
if( was_locked(tmp) ) goto loop
got_the_sw_lock:
...
... critical section
...
// release the lock, e.g.
memory_location_of_sw_lock := 0
还有测试和设置的自旋锁,看起来像
loop:
was_locked := atomic_hw_locked_rmw( memory_location_of_sw_lock,
value_that_will_get_sw_lock )
if( was_locked(tmp) ) goto loop
在大多数具有缓存,直写或回写的现代内存系统中,这些性能可能会表现非常差。
O(N ^ 2)突发缓存未命中(软件)锁争用
如
Lock is held by P0
P1-PN are spinning in test-and-test-and-set spinloops waiting for it.
P0 releases the lock, e.g. by writing it to 0
P1's "test-" instruction reads the lock
...
PN's "test-" instruction reads the lock
All of P1-PN see the lock as released,
so they fall out of the "test-" part of the spinloop which uses ordinary instructions
to the test-and-set part of the spinloop, which uses a hardware atomic RMW like LOCK INC mem
P1's locked RMW happens, and acquires the spinlock for P1
It invalidates all other cache lines
P1 goes away and does something with the data protected by the lock
P2's locked RMW fails to acquire the spinlock
It invalidates all other caches because it has a write
P1 falls back into its test- spinloop
P3's locked RMW fails
It invalidates all other caches because it has a write
P1 falls back into its test- spinloop
...
PN's locked RMW fails
现在,至少,所有其余的P2..PN处理器都必须为其测试 - 自旋循环执行普通的解锁缓存未命中。这意味着至少有N +(N-1)个缓存未命中。这可能会更糟糕,因为每次写入时,服务可能无法获得锁定,从而触发所有其他服务进行解锁读取。也就是说,根据时间的不同,你可能会得到
1 release the lock
N reads to test- the lock
1 locked atomic RMW to acquire the lock
N reads...
for each of the remaining N-1 processors
1 locked RMW that fails to acquire the lock
N reads
https://stackoverflow.com/questions/-100003362
复制相似问题