首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >将RGBA图像转换为RGB图像

将RGBA图像转换为RGB图像
EN

Stack Overflow用户
提问于 2021-11-02 13:20:43
回答 1查看 101关注 0票数 0

我尝试将RGBA图像转换为RGB图像(每个通道8位无符号整数)。首先,我使用了OpenCV和下面的函数

代码语言:javascript
运行
复制
m_bufferMat.data = (uchar*) (ptr1);
m_bufferMat.convertTo(m_bufferMat, CV_8UC3);

但是对于应用程序的其他部分,我不需要使用OpenCV,所以我尝试自己转换图像,这样我就不需要链接和包含OpenCV库。我能想到的最快的方法是遍历缓冲区,只将前3个字节复制到另一个缓冲区,如下所示:

代码语言:javascript
运行
复制
for(int i = 0; i < width * height; i++) {
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    ptr1++;
}

但为此,我需要复制,这可能不是很快。OpenCV函数比我自己的函数快1.5倍。有人知道为什么吗?我可以实现一个不需要复制的函数吗?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-11-02 20:04:34

有许多优化是可以完成的。下面是一个测试平台程序来尝试它们和一些示例优化:

代码语言:javascript
运行
复制
#include <iostream>
#include <string>
#include <vector>
#include <intrin.h>
#include <functional>

volatile int width = 1920;
volatile int height = 1080;

unsigned char* src = new unsigned char[width * height * 4];
unsigned char* dst = new unsigned char[width * height * 3];
unsigned char* refDst = new unsigned char[width * height * 3];


void DefaultFunc() {
  auto ptr1 = src;
  auto ptr2 = dst;
  for (int i = 0; i < width * height; i++) {
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    ptr1++;
  }
}

void NPreCalculatedFunc() {
  auto ptr1 = src;
  auto ptr2 = dst;
  auto n = width * height;
  for (int i = 0; i < n; i++) {
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    ptr1++;
  }
}

void ReadFullPixelFunc() {
  unsigned int* ptr1 = (unsigned int*)src;
  auto ptr2 = dst;
  auto n = width * height;
  for (int i = 0; i < n; i++) {
    auto srcPix = *(ptr1++);
    *(ptr2++) = srcPix & 0xff;
    *(ptr2++) = (srcPix >> 8) & 0xff;
    *(ptr2++) = (srcPix >> 16) & 0xff;
  }
}
  

void ReadAndWriteFullPixelFunc() {
  unsigned int* ptr1 = (unsigned int*)src;
  unsigned int* ptr2 = (unsigned int*)dst;
  auto n = width * height / 4; 
  unsigned int writeBuf = 0;
  for (int i = n; i; i--) {   
    // by reading 4 pixels, we get to store 3 unsigned ints
    auto srcPix = *(ptr1++);    
    writeBuf = srcPix & 0x00ffffff;
    srcPix = *(ptr1++);
    writeBuf |= srcPix << 24;
    *(ptr2++) = writeBuf;
    
    writeBuf = (srcPix >> 8) & 0xffff;
    srcPix = *(ptr1++);
    writeBuf |= (srcPix << 16);
    *(ptr2++) = writeBuf;

    writeBuf = (srcPix >> 16) & 0xff;
    srcPix = *(ptr1++);
    writeBuf |= (srcPix << 8);
    *(ptr2++) = writeBuf;
  }
  // todo: if width * height is not divisible by 4, process the last max 3 pixels here with the unoptimized loop
}

void ReadAndWriteFullPixelXmmFunc() {
  unsigned int* ptr1 = (unsigned int*)src;
  unsigned int* ptr2 = (unsigned int*)dst;
  auto n = width * height / 4;
  unsigned int writeBuf = 0;   
  __m128i reorder = _mm_set_epi8(0x80, 0x80, 0x80, 0x80, 14, 13, 12, 10, 9, 8, 6, 5, 4, 2, 1, 0);
  for (int i = n; i; i--) {        
    auto srcPix4_ro = _mm_shuffle_epi8(_mm_loadu_si128((__m128i*)ptr1), reorder);    // read 4 source pixels, remove alpha bytes, pack to low 12 bytes of srcPix4
    ptr1 += 4;
    _mm_storel_epi64((__m128i*)ptr2, srcPix4_ro); // store 2 first pixels
    ptr2 += 2;
    auto shifted = _mm_bsrli_si128(srcPix4_ro, 8);
    _mm_storeu_si32(ptr2, shifted); // store 3rd pixel
    ptr2 += 1;    
  }
  // todo: if width * height is not divisible by 4, process the last max 3 pixels here with the unoptimized loop
}



unsigned long long PrintShortestTime(std::function<void()> f, const char *label, unsigned long long refTime) {
  unsigned long long minTicks = ~0ull;
  memset(dst, 0, width * height * 3);
  for (int i = 0; i < 500; i++) {
    auto start = __rdtsc();
    f();
    auto end = __rdtsc();
    auto duration = end - start;
    if (duration < minTicks) {
      minTicks = duration;
    }
  }
  if (memcmp(refDst, dst, width * height * 3)) { // test that we got the right answer
    printf("Fail - result does not equal refrence!\n");
  }
  printf("%s : %llu clock cycles - %0.3lf x base implementation time\n", label, minTicks, refTime ? ((double)minTicks/(double)refTime):1.0);
  return minTicks;
}

int main() {
  for (int i = 0; i < width * height * 4; i++) {
    src[i] = rand() & 0xff;
  }
  DefaultFunc();
  memcpy(refDst, dst, width * height * 3);

  auto refTime = PrintShortestTime(DefaultFunc, "default, unoptimized", 0);  
  PrintShortestTime(NPreCalculatedFunc, "n precalculated", refTime);
  PrintShortestTime(ReadFullPixelFunc, "n precalculated, reading 1 pixel at a time", refTime);    
  PrintShortestTime(ReadAndWriteFullPixelFunc, "reading and writing ints at a time", refTime);
  PrintShortestTime(ReadAndWriteFullPixelXmmFunc, "with xmm intrinsincs", refTime);
}

对我来说,在visual studio & x64或x86上,最新版本花费的时间大约是基本版本的0.4倍:

代码语言:javascript
运行
复制
default, unoptimized : 7511848 clock cycles - 1.000 x base implementation time
n precalculated : 7383696 clock cycles - 0.983 x base implementation time
n precalculated, reading 1 pixel at a time : 7354644 clock cycles - 0.979 x base implementation time
reading and writing ints at a time : 4613816 clock cycles - 0.614 x base implementation time
with xmm intrinsincs : 3036824 clock cycles - 0.404 x base implementation time

通过展开循环,以更大的块写入内存,可能有可能进一步优化。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69811198

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档