首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >错误共享boost :: detail :: spinlock_pool是什么?

错误共享boost :: detail :: spinlock_pool是什么?
EN

Stack Overflow用户
提问于 2018-05-24 08:11:13
回答 1查看 0关注 0票数 0

我写了一个小基准程序:

代码语言:javascript
复制
#include <boost/shared_ptr.hpp>
#include <boost/thread.hpp>
#include <boost/timer.hpp>
#include <iostream>
#include <vector>
#include <stdlib.h>

using namespace std;

enum { BufferSize = 1<<24,  SLsPerCacheLine = 1 };

int          ibuffer[BufferSize];

using boost::detail::spinlock;
size_t nslp = 41;
spinlock* pslp = 0;

spinlock& getSpinlock(size_t h)
{
  return pslp[ (h%nslp) * SLsPerCacheLine ];
}


void threadFunc(int offset)
{
  const size_t mask = BufferSize-1;
  for (size_t ii=0, index=(offset&mask); ii<BufferSize; ++ii, index=((index+1)&mask))
  {
    spinlock& sl = getSpinlock(index);
    sl.lock();
    ibuffer[index] += 1;
    sl.unlock();
  }
};


int _tmain(int argc, _TCHAR* argv[])
{
  if ( argc>1 )
  {
    size_t n = wcstoul(argv[1], NULL, 10);
    if ( n>0 )
    {
      nslp = n;
    }
  }

  cout << "Using pool size: "<< nslp << endl;
  cout << "sizeof(spinlock): "<< sizeof(spinlock) << endl;
  cout << "SLsPerCacheLine: "<< int(SLsPerCacheLine) << endl;
  const size_t num = nslp * SLsPerCacheLine;
  pslp = new spinlock[num ];
  for (size_t ii=0; ii<num ; ii++)
  { memset(pslp+ii,0,sizeof(*pslp)); }

  const size_t nThreads = 4;
  boost::thread* ppThreads[nThreads];
  const int offset[nThreads] = { 17, 101, 229, 1023 };

  boost::timer timer;

  for (size_t ii=0; ii<nThreads; ii++)
  { ppThreads[ii] = new boost::thread(threadFunc, offset[ii]); }

  for (size_t ii=0; ii<nThreads; ii++)
  { ppThreads[ii]->join(); }

  cout << "Elapsed time: " << timer.elapsed() << endl;

  for (size_t ii=0; ii<nThreads; ii++)
  { delete ppThreads[ii]; }

  delete[] pslp;

  return 0;
}

我编译了两个版本的代码,一个是SLsPerCacheLine==1,另一个是SLsPerCacheLine==8。采用MSVS 2010进行优化的32bit,采用4核Xeon W3520 @ 2.67Ghz(禁用HyperThreading)运行。

我很难从这些测试中获得一致的结果 - 偶尔观察到高达50%的虚假定时变化。但是平均而言,SLsPerCacheLine==8版本比SLsPerCacheLine==1带有大小41的螺旋锁定表的版本要快25-30%。

看看如何通过更大数量的内核,NUMA,超线程等等来扩展,我会感兴趣的。我目前不能访问那种硬件。

EN

回答 1

Stack Overflow用户

发布于 2018-05-24 17:41:17

简要

你是对的,因为这些自旋锁共享相同的缓存行,当它们可能位于不同的缓存行时。又名虚假分享。通过在不同的缓存行中分配锁,可能会有一些性能增益(但请参见下文)。

首先,一些背景:

经典的spinloops是测试和测试设置的:

代码语言:javascript
复制
    loop: 
         tmp := load( memory_location_of_sw_lock )
         if( is_locked(tmp) ) goto loop
         was_locked := atomic_hw_locked_rmw( memory_location_of_sw_lock, 
                                             value_that_will_get_sw_lock )
         if( was_locked(tmp) ) goto loop
     got_the_sw_lock:
          ...
          ... critical section
          ...
          // release the lock, e.g.
          memory_location_of_sw_lock := 0

还有测试和设置的自旋锁,看起来像

代码语言:javascript
复制
    loop: 
         was_locked := atomic_hw_locked_rmw( memory_location_of_sw_lock, 
                                             value_that_will_get_sw_lock )
         if( was_locked(tmp) ) goto loop

在大多数具有缓存,直写或回写的现代内存系统中,这些性能可能会表现非常差。

O(N ^ 2)突发缓存未命中(软件)锁争用

代码语言:javascript
复制
Lock is held by P0
P1-PN are spinning in test-and-test-and-set spinloops waiting for it.

P0 releases the lock, e.g. by writing it to 0

P1's "test-" instruction reads the lock
...
PN's "test-" instruction reads the lock

All of P1-PN see the lock as released, 
   so they fall out of the "test-" part of the spinloop which uses ordinary instructions
   to the test-and-set part of the spinloop, which uses a hardware atomic RMW like LOCK INC mem

P1's locked RMW happens, and acquires the spinlock for P1
   It invalidates all other cache lines
   P1 goes away and does something with the data protected by the lock

P2's locked RMW fails to acquire the spinlock
   It invalidates all other caches because it has a write
   P1 falls back into its test- spinloop

P3's locked RMW fails
   It invalidates all other caches because it has a write
   P1 falls back into its test- spinloop

...

PN's locked RMW fails

现在,至少,所有其余的P2..PN处理器都必须为其测试 - 自旋循环执行普通的解锁缓存未命中。这意味着至少有N +(N-1)个缓存未命中。这可能会更糟糕,因为每次写入时,服务可能无法获得锁定,从而触发所有其他服务进行解锁读取。也就是说,根据时间的不同,你可能会得到

代码语言:javascript
复制
   1 release the lock
   N reads to test- the lock
   1 locked atomic RMW to acquire the lock
   N reads...
   for each of the remaining N-1 processors
      1 locked RMW that fails to acquire the lock
      N reads 
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/-100003362

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档