blocks|key|2514550|text|我不认为在标准的C/C%2B%2B中有比你已经拥有的更好的方法来做到这一点。我要做的是编写一个简单的程序集包装器，返回您想要的结果。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2514551|并不是询问Windows，而是作为一个示例，尽管Windows有一个API，听起来它可以实现您想做的事情(+32位乘32位，同时获得完整的64位结果)，但它将乘法实现为宏，执行您正在做的事情：|2514552|#define+UInt32x32To64(+a,+b+)+(ULONGLONG)((ULONGLONG)(DWORD)(a)+*+(DWORD)(b))|code-block|syntax|javascript|2514553|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|L|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|M|8|@]|9|@]|A|$G|H]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

I don't think there's a way to do this in standard C/C++ better than what you already have. What I'd do is write up a simple assembly wrapper that returned the result you want.

Not that you're asking about Windows, but as an example even though Windows has an API that sounds like it does what you want (a 32 by 32 bit multiply while obtaining the full 64 bit result), it implements the multiply as a macro that does what you're doing:

<pre><code>#define UInt32x32To64( a, b ) (ULONGLONG)((ULONGLONG)(DWORD)(a) * (DWORD)(b))
</code></pre>

blocks|key|988750|text|在32位英特尔上，乘法会影响输出的两个寄存器。也就是说，无论您是否愿意，64位都是完全可用的。这仅仅是编译器是否聪明到能够利用它的一个函数。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|988751|现代编译器做了令人惊奇的事情，所以我的建议是尝试更多的优化标志，至少在Intel上是这样。您可能认为优化器可能知道处理器从32位乘32位产生64位值。|988752|尽管如此，在某种程度上，我试图让编译器使用模块以及除法结果的红利，但1998年的旧Microsoft编译器还不够聪明，无法实现产生这两种结果的相同指令。|988753|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|H|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|I|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|J|8|@]|9|@]|A|$]]|$1|F|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|G|$]]

On 32 bit intel, a multiply affects two registers for the output. That is, the 64 bits are fully available, whether you want it or not. Its just a function of whether the compiler is smart enough to take advantage of it.

Modern compilers do amazing things, so my suggestion is to experiment with optimization flags some more, at least on Intel. You would think that the optimizer might know that the processor produces a 64 bit value from 32 by 32 bits.

That said, at some point I tried to get the compiler to use the modulo as well as the dividend on a division result, but the old Microsoft compiler from 1998 was not smart enough to realize the same instruction produced both results.

blocks|key|1024726|text|gcc+4.3.2，使用-O1优化或更高版本，将您的函数转换为您向IA32程序集展示的功能，如下所示：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1024727|umulhi32:
++++++++pushl+++%25ebp
++++++++movl++++%25esp,+%25ebp
++++++++movl++++12(%25ebp),+%25eax
++++++++mull++++8(%25ebp)
++++++++movl++++%25edx,+%25eax
++++++++popl++++%25ebp
++++++++ret|code-block|syntax|javascript|1024728|它只是执行一个32位的mull，并将结果的高32位(从%25edx)放到返回值中。|offset|length|style|CODE|1024729|这就是你想要的，对吧？听起来，您只需要对编译器进行优化;)可以通过消除中间变量将编译器推向正确的方向：|1024730|unsigned+int+umulhi32(unsigned+int+x,+unsigned+int+y)
{
++return+(unsigned+int)(((unsigned+long+long)x+*+y)>>32);
}|1024731|entityMap^0|0|0|B|4|R|4|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|U|8|@$I|V|J|W|K|L]|$I|X|J|Y|K|L]]|9|@]|A|$]]|$1|M|3|N|5|6|7|Z|8|@]|9|@]|A|$]]|$1|O|3|P|5|D|7|10|8|@]|9|@]|A|$E|F]]|$1|Q|3|-4|5|6|7|11|8|@]|9|@]|A|$]]]|R|$]]

gcc 4.3.2, with -O1 optimisation or higher, translated your function exactly as you showed it to IA32 assembly like this:

<pre><code>umulhi32:
 pushl %ebp
 movl %esp, %ebp
 movl 12(%ebp), %eax
 mull 8(%ebp)
 movl %edx, %eax
 popl %ebp
 ret
</code></pre>

Which is just doing a single 32 bit <code>mull</code> and putting the high 32 bits of the result (from <code>%edx</code>) into the return value.

That's what you wanted, right? Sounds like you just need to turn up the optimisation on your compiler ;) It's possible you could push the compiler in the right direction by eliminating the intermediate variable:

<pre><code>unsigned int umulhi32(unsigned int x, unsigned int y)
{
 return (unsigned int)(((unsigned long long)x * y)&gt;&gt;32);
}
</code></pre>

Many CPUs have single assembly opcodes for returning the high order bits of a 32 bit integer multiplication. Normally multiplying two 32 bit integers produces a 64 bit result, but this is truncated to the low 32 bits if you store it in a 32 bit integer.

For example, on PowerPC, the <a href="http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.aixassem/doc/alangref/mulhw.htm" rel="noreferrer">mulhw</a> opcode returns the high 32 bits of the 64 bit result of a 32x32 bit multiply in one clock. This is exactly what I'm looking for, but more portably. There's a similar opcode, umulhi(), in NVidia CUDA.

In C/C++, is there an efficient way to return the high order bits of the 32x32 multiply?
Currently I compute it by casting to 64 bits, something like:

<pre><code>unsigned int umulhi32(unsigned int x, unsigned int y)
{
 unsigned long long xx=x;
 xx*=y;
 return (unsigned int)(xx&gt;&gt;32);
}
</code></pre>

but this is over 11 times slower than a regular 32 by 32 multiply because I'm using overkill 64 bit math even for the multiply.

Is there a faster way to compute the high order bits?

This is clearly not best solved with a BigInteger library (which is overkill and will have huge overhead).

SSE seems to have <a href="http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc241.htm" rel="noreferrer">PMULHUW</a>, a 16x16 -> top 16 bit version of this, but not a 32x32 -> top 32 version like I'm looking for.

Efficient computation of the high order bits of a 32 bit integer multiplication

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋 

腾讯云代码助手

CODING DevOps

Cloud Studio

SDK中心

API中心

命令行工具

许多CPU具有单个汇编操作码，用于返回32位整数乘法的高阶位。通常，将两个32位整数相乘会产生64位结果，但如果将其存储在32位整数中，则会将其截断为低32位。例如，在PowerPC上，操作码返回一个时钟中32x32位乘法的64位结果的高32位。这正是我要找的，但更轻便。在NVidia数据自动化系统中也有类似的操作码u...

问32位整数乘法高阶位的高效计算
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问32位整数乘法高阶位的高效计算EN