【PY】Python3 字节码混淆

sidiot

发布于 2023-08-31 13:15:37

8600

前言

emmm，关于字节码混淆，最早碰到还是在校赛的时候，当时一脸懵逼，什么情况，怎么 uncompyle6 不能反编译 pyc 了，不过之后也就不了了之了，今天特地写此博文纪念 DASCTF Oct X 吉林工师魔法赛中的一道 RE 题 —— 魔法叠加，出题人是真的阴间💩

什么是 pyc 文件？

简单来说，pyc 文件就是 Python 的字节码文件；

众所周知，Python 是一种全平台的解释性语言，全平台其实就是 Python 文件在经过解释器解释之后（或者称为编译）生成的 pyc 文件可以在多个平台下运行，这样同样也可以隐藏源代码。

其实，Python 是完全面向对象的语言，Python 文件在经过解释器解释后生成字节码对象 PyCodeObject，pyc 文件可以理解为是 PyCodeObject 对象的持久化保存方式，在 Python 源代码运行的时候，Python 解释器会先将代码处理成 PythonCodeObject 对象，保存在内存中处理。

有时候可能会见到 pyo 格式命名的文件，这个是经过 Python 解释器优化后生成的字节码，相对于 pyc 文件，pyo 文件只是缩小了体积，运行速度还是相差无几的；

pyc 的版本号

Python 在生成 pyc 文件的时候也引入了 MagicNumber，来标示此 pyc 文件对应的版本号，

需要注意的是，pyc 文件只能运行在生成出此文件的解释器版本上，

在 Python 解释器目录下 \Lib\importlib\_bootstrap_external.py 中有明确的版本号记录：

# Magic word to reject .pyc files generated by other Python versions.
# It should change for each incompatible change to the bytecode.
#
# The value of CR and LF is incorporated so if you ever read or write
# a .pyc file in text mode the magic number will be wrong; also, the
# Apple MPW compiler swaps their values, botching string constants.
#
# There were a variety of old schemes for setting the magic number.
# The current working scheme is to increment the previous value by
# 10.
#
# Starting with the adoption of PEP 3147 in Python 3.2, every bump in magic
# number also includes a new "magic tag", i.e. a human readable string used
# to represent the magic number in __pycache__ directories.  When you change
# the magic number, you must also set a new unique magic tag.  Generally this
# can be named after the Python major version of the magic number bump, but
# it can really be anything, as long as it's different than anything else
# that's come before.  The tags are included in the following table, starting
# with Python 3.2a0.
#
# Known values:
#  Python 1.5:   20121
#  Python 1.5.1: 20121
#     Python 1.5.2: 20121
#     Python 1.6:   50428
#     Python 2.0:   50823
#     Python 2.0.1: 50823
#     Python 2.1:   60202
#     Python 2.1.1: 60202
#     Python 2.1.2: 60202
#     Python 2.2:   60717
#     Python 2.3a0: 62011
#     Python 2.3a0: 62021
#     Python 2.3a0: 62011 (!)
#     Python 2.4a0: 62041
#     Python 2.4a3: 62051
#     Python 2.4b1: 62061
#     Python 2.5a0: 62071
#     Python 2.5a0: 62081 (ast-branch)
#     Python 2.5a0: 62091 (with)
#     Python 2.5a0: 62092 (changed WITH_CLEANUP opcode)
#     Python 2.5b3: 62101 (fix wrong code: for x, in ...)
#     Python 2.5b3: 62111 (fix wrong code: x += yield)
#     Python 2.5c1: 62121 (fix wrong lnotab with for loops and
#                          storing constants that should have been removed)
#     Python 2.5c2: 62131 (fix wrong code: for x, in ... in listcomp/genexp)
#     Python 2.6a0: 62151 (peephole optimizations and STORE_MAP opcode)
#     Python 2.6a1: 62161 (WITH_CLEANUP optimization)
#     Python 2.7a0: 62171 (optimize list comprehensions/change LIST_APPEND)
#     Python 2.7a0: 62181 (optimize conditional branches:
#                          introduce POP_JUMP_IF_FALSE and POP_JUMP_IF_TRUE)
#     Python 2.7a0  62191 (introduce SETUP_WITH)
#     Python 2.7a0  62201 (introduce BUILD_SET)
#     Python 2.7a0  62211 (introduce MAP_ADD and SET_ADD)
#     Python 3000:   3000
#                    3010 (removed UNARY_CONVERT)
#                    3020 (added BUILD_SET)
#                    3030 (added keyword-only parameters)
#                    3040 (added signature annotations)
#                    3050 (print becomes a function)
#                    3060 (PEP 3115 metaclass syntax)
#                    3061 (string literals become unicode)
#                    3071 (PEP 3109 raise changes)
#                    3081 (PEP 3137 make __file__ and __name__ unicode)
#                    3091 (kill str8 interning)
#                    3101 (merge from 2.6a0, see 62151)
#                    3103 (__file__ points to source file)
#     Python 3.0a4: 3111 (WITH_CLEANUP optimization).
#     Python 3.0b1: 3131 (lexical exception stacking, including POP_EXCEPT
                          #3021)
#     Python 3.1a1: 3141 (optimize list, set and dict comprehensions:
#                         change LIST_APPEND and SET_ADD, add MAP_ADD #2183)
#     Python 3.1a1: 3151 (optimize conditional branches:
#                         introduce POP_JUMP_IF_FALSE and POP_JUMP_IF_TRUE
                          #4715)
#     Python 3.2a1: 3160 (add SETUP_WITH #6101)
#                   tag: cpython-32
#     Python 3.2a2: 3170 (add DUP_TOP_TWO, remove DUP_TOPX and ROT_FOUR #9225)
#                   tag: cpython-32
#     Python 3.2a3  3180 (add DELETE_DEREF #4617)
#     Python 3.3a1  3190 (__class__ super closure changed)
#     Python 3.3a1  3200 (PEP 3155 __qualname__ added #13448)
#     Python 3.3a1  3210 (added size modulo 2**32 to the pyc header #13645)
#     Python 3.3a2  3220 (changed PEP 380 implementation #14230)
#     Python 3.3a4  3230 (revert changes to implicit __class__ closure #14857)
#     Python 3.4a1  3250 (evaluate positional default arguments before
#                        keyword-only defaults #16967)
#     Python 3.4a1  3260 (add LOAD_CLASSDEREF; allow locals of class to override
#                        free vars #17853)
#     Python 3.4a1  3270 (various tweaks to the __class__ closure #12370)
#     Python 3.4a1  3280 (remove implicit class argument)
#     Python 3.4a4  3290 (changes to __qualname__ computation #19301)
#     Python 3.4a4  3300 (more changes to __qualname__ computation #19301)
#     Python 3.4rc2 3310 (alter __qualname__ computation #20625)
#     Python 3.5a1  3320 (PEP 465: Matrix multiplication operator #21176)
#     Python 3.5b1  3330 (PEP 448: Additional Unpacking Generalizations #2292)
#     Python 3.5b2  3340 (fix dictionary display evaluation order #11205)
#     Python 3.5b3  3350 (add GET_YIELD_FROM_ITER opcode #24400)
#     Python 3.5.2  3351 (fix BUILD_MAP_UNPACK_WITH_CALL opcode #27286)
#     Python 3.6a0  3360 (add FORMAT_VALUE opcode #25483)
#     Python 3.6a1  3361 (lineno delta of code.co_lnotab becomes signed #26107)
#     Python 3.6a2  3370 (16 bit wordcode #26647)
#     Python 3.6a2  3371 (add BUILD_CONST_KEY_MAP opcode #27140)
#     Python 3.6a2  3372 (MAKE_FUNCTION simplification, remove MAKE_CLOSURE
#                         #27095)
#     Python 3.6b1  3373 (add BUILD_STRING opcode #27078)
#     Python 3.6b1  3375 (add SETUP_ANNOTATIONS and STORE_ANNOTATION opcodes
#                         #27985)
#     Python 3.6b1  3376 (simplify CALL_FUNCTIONs & BUILD_MAP_UNPACK_WITH_CALL
                          #27213)
#     Python 3.6b1  3377 (set __class__ cell from type.__new__ #23722)
#     Python 3.6b2  3378 (add BUILD_TUPLE_UNPACK_WITH_CALL #28257)
#     Python 3.6rc1 3379 (more thorough __class__ validation #23722)
#     Python 3.7a1  3390 (add LOAD_METHOD and CALL_METHOD opcodes #26110)
#     Python 3.7a2  3391 (update GET_AITER #31709)
#     Python 3.7a4  3392 (PEP 552: Deterministic pycs #31650)
#     Python 3.7b1  3393 (remove STORE_ANNOTATION opcode #32550)
#     Python 3.7b5  3394 (restored docstring as the first stmt in the body;
#                         this might affected the first line number #32911)
#     Python 3.8a1  3400 (move frame block handling to compiler #17611)
#     Python 3.8a1  3401 (add END_ASYNC_FOR #33041)
#     Python 3.8a1  3410 (PEP570 Python Positional-Only Parameters #36540)
#     Python 3.8b2  3411 (Reverse evaluation order of key: value in dict
#                         comprehensions #35224)
#     Python 3.8b2  3412 (Swap the position of positional args and positional
#                         only args in ast.arguments #37593)
#     Python 3.8b4  3413 (Fix "break" and "continue" in "finally" #37830)
#
# MAGIC must change whenever the bytecode emitted by the compiler may no
# longer be understood by older implementations of the eval loop (usually
# due to the addition of new opcodes).
#
# Whenever MAGIC_NUMBER is changed, the ranges in the magic_values array
# in PC/launcher.c must also be updated.

pyc 的基本格式

pyc 文件一般由3个部分组成：

最开始4个字节是一个 Maigc int，标识此 pyc 的版本信息，不同的版本的 Magic 都在 Python/import.c 内定义；
接下来四个字节还是个 int，是 pyc 产生的时间 (TIMESTAMP，1970.01.01 到产生 pyc 时候的秒数)；
接下来是个序列化了的 PyCodeObject（此结构在 Include/code.h 内定义），序列化方法在 Python/marshal.c 内定义；

这里对第二点再做些补充：

也就是魔术头之后其实是源代码文件信息，根据 Python 的版本不同，源代码文件信息差别还是比较大的；

在 Python2 的时候，这部分只有4个字节，为源代码文件的修改时间的 Unix timestamp（精确到秒），以小端序写入，(1586087865).to_bytes(4, ‘little’).hex() -> b9c7 895e；

Python3.5 和 3.6 相对于 Python2，源代码文件信息这部分，在时间后面增加了4个字节表示源代码文件的大小，单位字节，同样以小端序写入。如源码文件大小为87个字节，那么文件信息部分就写入 5700 0000。加上前面的修改时间，就是 b9c7 895e 5700 0000；

从 Python3.7 开始支持 hash-based pyc 文件，也是就说，Python 不仅支持校验 timestrap 来判断文件是否修改过了，也支持校验 hash 值。Python 为了支持 hash 校验又使源代码文件信息这部分增加了4个字节，变为一共12个字节。

但是这个 hash 校验默认是不启用的（可以通过调用 py_compile 模块的 compile 函数时传入参数invalidation_mode=PycInvalidationMode.CHECKED_HASH 启用）。不启用时前4个字节为 0000 0000，后8个字节为3.6和3.7版本一样的源码文件的修改时间和大小；当启用时前4个字节变为 0100 0000 或者 0300 0000，后8个字节为源码文件的 hash 值。

接下来直接上例子，例子来自于 bi0x 大佬的博文：

最前面的 4 个字节为 Magic Number，其中前两个直接为解释器的版本号：
- 此处前两个字节为 62211，也就是 Python 2.7.0a0 版本的字节码解释器；
- 注意这里是小端序（内存存储），高位在后面，所以是 0xF303；
Magic Number 之后的四个字节为时间戳，这里是 0x5EC652B0，之后就是 Python 代码对象 PyCodeObject；
代码对象首先一个字节表示此处的对象类型，这里值为 TYPE_CODE，值为 0x63；
此后四个字节表示参数的个数，也就是 co_argcount 的值；
往后四个字节是局部变量的个数 co_nlocals；
往后四个字节是栈空间大小 co_stacksize；
往后四个字节是 co_flags；
之后就是 co_code 了，也就是编译好的字节码的部分：
- co_code 部分首先的一个字节也是表示此处的对象类型，这里是 TYPE_STRING，为 0x73；
- 接下来四个字节表示此 co_code 对象的长度，此后就是代码对象，这里的代码长度为 0xA7，也就是后方 163 个字节的长度都是代码对象；
此 co_code 对象的字节码内容结束后，接着是 co_consts 内容，也就是用到的常量的内容：
- 最开始是 TYPE_TUPLE，表示这是个元组类型：
- 此后四个字节是元素个数，这里是 0x23，之后每一个字节与对应的值一组，一共 0x23 组：
  - 每组中第一个字节表示元素类型，比如 0x69 指 TYPE_INT，此后为对应的值；
后方也对应结构体中的相应内容；

pyc 混淆

思路：

由于 pyc 文件有现成的工具 uncompyle6，可以还原成 Python 代码，所以说我们不了解 pyc 格式也没有关系。这样我们混淆 pyc 的思路就可以是欺骗像 uncompyle6 这类反编译的工具，让它误以为指令的序列不合法，但是又不影响真正的 Python 虚拟机执行。

Python 的虚拟机是根据 PyCodeObject 中的 co_code 这个字段中存储的 opcode 序列来决定程序的执行流程的。所以说一个混淆的手段就是修改 co_code 字段中的 opcode 序列，可以添加一些加载超出范围的变量的指令，再用一些指令去跳过这些会出错的指令，这样执行的时候就不会出错了，但是反编译工具就不能正常工作了。

比如，

0 LOAD_NAME                0 (print)
2 LOAD_CONST               0 ('Hello World! --idiot.')
4 CALL_FUNCTION            1
6 POP_TOP

以上是由 Python 3.7 生成的 pyc 的一段 opcode 序列，这时考虑在它的前面加两条指令来进行混淆。

0 JUMP_ABSOLUTE            4
2 LOAD_CONST               255
4 LOAD_NAME                0 (print)
6 LOAD_CONST               0 ('Hello, world')
8 CALL_FUNCTION            1
10 POP_TOP

其中 JUMP_ABSOLUTE 4 表示直接跳转到 offset 为4的位置去执行指令，也就是插入的第二条指令 LOAD_CONST 255 并不会被执行，所以所以也并不会报错。但是对于反编译工具来说，这就是一个错误了，直接导致了反编译的失败。

实现：

根据上面的那个思路，我们可以插入许多这样类似的指令，任意的不合法指令（其实随机数据都可以），然后用一些 JUMP 指令去跳过这样的不合法指令，上面的 JUMP_ABSOLUTE 只是一个简单的例子。甚至我们可以跳转到一些自行添加的虚假分支再跳转到到真实的分支（参考 ROP 的思路）。

Python 的 opcode 中与 JUMP 相关的有：

'JUMP_FORWARD',
'JUMP_IF_FALSE_OR_POP',
'JUMP_IF_TRUE_OR_POP',
'JUMP_ABSOLUTE',
'POP_JUMP_IF_FALSE',
'POP_JUMP_IF_TRUE',

原则上这六个都可以使用，但是实际上为了方便的话，其实还是 JUMP_FORWARD 和 JUMP_ABSOLUTE 比较好用，因为其他的 JUMP 指令存在一些当前栈顶元素判断的问题（要做也可以，只不过实现同样的功能可能需要写更多的指令）。

在添加混淆指令的时候可能会遇到的问题：

首先是 Python 版本的问题，Python3.6 之前使用的是变长指令，3.6及之后都是用的是定长指令了，这样对于不同的版本需要有不用的处理；
由于添加了指令，一些原本存在的绝对跳转指令就会失效，所以需要对原本存在的绝对跳转指令计算偏；
不定长指令的参数长度是两个字节，而定长指令的参数只有一个字节，可能存在参数长度不够用的时候，这个时候可以使用 EXTENDED_ARG 指令去扩展参数的长度，最多可以有四个字节。
跳转的混淆最好还是不要从循环内到循环外或者循环外到循环内。其实最好是根据 co_lnotab 字段中的指令偏移和行号来插入混淆指令，不在属于同一行的指令中间插入，这样可以避免一些可能存在的问题。

解题

题面如下：

见习魔法少女发现了使用魔法可以提升自身状态，于是她创造了名为“耀眼·状态提升·超·up·无敌·闪亮光环！”的魔法((请把flag{}内的内容md5包上DASCTF{}))

下来一个压缩包，解压后是一个 pyc 和一个 txt，直接盲猜是将 flag 加密后写入 txt 中了，反编译 pyc，

这报了一个断言错误 AssertionError，奇奇怪怪的，也看不出啥，拖进010看看，

一看魔术头 03 F3，这是 Python2.7 版本写的，那得用 uncompyle2 进行反编译，尝试了一下，还是报断言错误，

当时我就纳闷了，然后仔细瞧了瞧二进制文件，忽然发现这 pyc 的结构像是 py3 写的，那就拿 py3 的魔术头替换了一下，好家伙，可以反编译了，这是第一个阴间的地方，

然鹅，这还没完，接下来是第二个阴间的地方，出题人在开头加了个混淆，得亏有点人性，没啥难度，

混淆就是这个诡异的跳转指令，类似于花指令一样，在二进制文件中去掉就好，

修改前：

修改后：

接下来就能顺利反编译了，uncompyle6 -o magic.py magic.pyc，源码如下：

import struct
O0O00O00O00O0O00O = [
 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '!', '#', '$', '%', '&', '(', ')', '*', '+', ',', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '~', '"']

def encode(O000O00000OO00OOO):
    """"""
    OOOO00OOO00O000OO = 0
    OOOOOOOOOO00O0OOO = 0
    OO0OOO000000OOOOO = ''
    for O0O0OO0OOOOOOOO00 in range(len(O000O00000OO00OOO)):
        O000O0OOOOO00O0O0 = O000O00000OO00OOO[O0O0OO0OOOOOOOO00:O0O0OO0OOOOOOOO00 + 1]
        OOOO00OOO00O000OO |= struct.unpack('B', O000O0OOOOO00O0O0)[0] << OOOOOOOOOO00O0OOO
        OOOOOOOOOO00O0OOO += 8
        if OOOOOOOOOO00O0OOO > 13:
            OO00O0OO00OOO000O = OOOO00OOO00O000OO & 8191
            if OO00O0OO00OOO000O > 88:
                OOOO00OOO00O000OO >>= 13
                OOOOOOOOOO00O0OOO -= 13
            else:
                OO00O0OO00OOO000O = OOOO00OOO00O000OO & 16383
                OOOO00OOO00O000OO >>= 14
                OOOOOOOOOO00O0OOO -= 14
            OO0OOO000000OOOOO += O0O00O00O00O0O0O0[(OO00O0OO00OOO000O % 91)] + O0O00O00O00O0O0O0[(OO00O0OO00OOO000O // 91)]

    if OOOOOOOOOO00O0OOO:
        OO0OOO000000OOOOO += O0O00O00O00O0O0O0[(OOOO00OOO00O000OO % 91)]
        if OOOOOOOOOO00O0OOO > 7 or OOOO00OOO00O000OO > 90:
            OO0OOO000000OOOOO += O0O00O00O00O0O0O0[(OOOO00OOO00O000OO // 91)]
    return OO0OOO000000OOOOO


O0O00O00O00O0O0O0 = []
OO000O00O00O0O0O0 = []
O0O0O0O0000O0O00O = input('plz input O0O0O0O0000O0O00O:\n')
for i in range(0, 52):
    O0O00O00O00O0O0O0 = O0O00O00O00O0O00O[i:] + O0O00O00O00O0O00O[0:i]
    O0O0O0O0000O0O00O = encode(O0O0O0O0000O0O00O.encode('utf-8'))

dic = open('./00.txt', 'a')
dic.write(O0O0O0O0000O0O00O)
dic.close

这就是本题的第三处阴间，这变量名是人看的吗，傻眼了都，不替换变量名，这谁受得了，

poc 脚本如下：

import struct
def decode(encoded_str):
    ''' Decode Base91 string to a bytearray '''
    v = -1
    b = 0
    n = 0
    out = b''
    for strletter in encoded_str:
        t = struct.pack('B', strletter)
        if not t in decode_table:
            continue
        c = decode_table[t]
        if v < 0:
            v = c
        else:
            v += c * 91
            b |= v << n
            n += 13 if (v & 8191) > 88 else 14
            while True:
                out += struct.pack('B', b & 255)
                b >>= 8
                n -= 8
                if not n > 7:
                    break
            v = -1
    if v + 1:
        out += struct.pack('B', (b | v << n) & 255)
    return out
keyMaps = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
           'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
           'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
           'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
           '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '!', '#', '$',
           '%', '&', '(', ')', '*', '+', ',', '.', '/', ':', ';', '<', '=',
           '>', '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '~', '"']
allMaps = []
for i in range(0, 52):
    allMaps.append(keyMaps[i:] + keyMaps[0:i])
dic = open(r"./00.txt", "rb")
flags = dic.read()
dic.close()
allMaps.reverse()
for i in range(0, 52):
    decode_table = dict((v.encode('utf-8'), k) for k, v in enumerate(allMaps[i]))
    flags = decode(flags)
    print(flags[:8])
print(flags)

# flag{L@y3r_6Y_1ayer_by_layer_by_layer}