json.Marshal为什么会对[]byte类型进行base64编码处理？

fliter

发布于 2024-02-05 17:56:09

2640

发布于 2024-02-05 17:56:09

文章被收录于专栏：旅途散记

json Marshal默认会对[]byte类型进行base64编码处理

base64.go:

package main

import (
 "encoding/json"
 "fmt"
)

// golang json Marshal默认对[]byte类型进行base64编码处理(源码里有base64的逻辑)，Unmarshal时也只能用[]byte类型接收才能还原。(如果用interface{}接收，得到的是base64后的内容)

type test1 struct {
 X string
 Y []byte
}
type test2 struct {
 X string
 Y interface{}
}

func main() {
 a := test1{X: "geek", Y: []byte("geek")}
 fmt.Println("原始的a:", a)

 b, _ := json.Marshal(a)
 fmt.Println("经过Marshal之后得到的b:", string(b))

 var c test1
 var d test2
 json.Unmarshal(b, &c)
 json.Unmarshal(b, &d)
 fmt.Println("Unmarshal 上面得到的b，之前的[]byte字段用[]byte接收:", c)
 fmt.Println("Unmarshal 上面得到的b，之前的[]byte字段用interface{}接收:", d)
}

在线运行[1]

输出：

原始的a: {geek [103 101 101 107]}
经过Marshal之后得到的b: {"X":"geek","Y":"Z2Vlaw=="}
Unmarshal 上面得到的b，之前的[]byte字段用[]byte接收: {geek [103 101 101 107]}
Unmarshal 上面得到的b，之前的[]byte字段用interface{}接收: {geek Z2Vlaw==}

src/encoding/json/encode.go[2]

func encodeByteSlice(e *encodeState, v reflect.Value, _ encOpts) {
 if v.IsNil() {
  e.WriteString("null")
  return
 }
 s := v.Bytes()
 e.WriteByte('"')
 encodedLen := base64.StdEncoding.EncodedLen(len(s))
 if encodedLen <= len(e.scratch) {
  // If the encoded bytes fit in e.scratch, avoid an extra
  // allocation and use the cheaper Encoding.Encode.
  dst := e.scratch[:encodedLen]
  base64.StdEncoding.Encode(dst, s)
  e.Write(dst)
 } else if encodedLen <= 1024 {
  // The encoded bytes are short enough to allocate for, and
  // Encoding.Encode is still cheaper.
  dst := make([]byte, encodedLen)
  base64.StdEncoding.Encode(dst, s)
  e.Write(dst)
 } else {
  // The encoded bytes are too long to cheaply allocate, and
  // Encoding.Encode is no longer noticeably cheaper.
  enc := base64.NewEncoder(base64.StdEncoding, e)
  enc.Write(s)
  enc.Close()
 }
 e.WriteByte('"')
}

在 json.Unmarshal时也有类似反向处理，src/encoding/json/decode.go[3]：

Java也类似这样，提供了 DatatypeConverter

为什么要这样做？

JSON 格式本身不支持二进制数据。必须对二进制数据进行转义，以便可以将其放入 JSON 中的字符串元素。

而在进行json处理时，**[]byte** 始终被编码为 base64格式，而不是直接作为utf8字符串输出。

因为JSON规范中不允许一些 ASCII 字符。 ASCII 的 33 个控制字符[4]（[0..31] 和 127）以及 " 和 \ 必须排除。这样剩下 128-35 = 93 个字符

而Base64[5]（基底64）是一种基于64个可打印字符来表示二进制数据的表示方法，Base64中的可打印字符包括字母A-Z、a-z、数字0-9，这样共有62个字符，此外还有两个可打印的符号(在不同系统中而有所不同)。

也就是说base64可以将任意的字符串，输出为用A-Z、a-z、数字0-9以及两个根据系统而定的可打印符号，这样共64个字符编码的格式。这样也就解决了35个特殊字符，不符合JSON规范的问题。

详见：

The problem with UTF-8 is that it is not the most space efficient encoding. Also, some random binary byte sequences are invalid UTF-8 encoding. So you can't just interpret a random binary byte sequence as some UTF-8 data because it will be invalid UTF-8 encoding. The benefit of this constrain on the UTF-8 encoding is that it makes it robust and possible to locate multi byte chars start and end whatever byte we start looking at.

As a consequence, if encoding a byte value in the range [0..127] would need only one byte in UTF-8 encoding, encoding a byte value in the range [128..255] would require 2 bytes ! Worse than that. In JSON, control chars, " and \ are not allowed to appear in a string. So the binary data would require some transformation to be properly encoded.

Let see. If we assume uniformly distributed random byte values in our binary data then, on average, half of the bytes would be encoded in one bytes and the other half in two bytes. The UTF-8 encoded binary data would have 150% of the initial size.

Base64 encoding grows only to 133% of the initial size. So Base64 encoding is more efficient.

What about using another Base encoding ? In UTF-8, encoding the 128 ASCII values is the most space efficient. In 8 bits you can store 7 bits. So if we cut the binary data in 7 bit chunks to store them in each byte of an UTF-8 encoded string, the encoded data would grow only to 114% of the initial size. Better than Base64. Unfortunately we can't use this easy trick because JSON doesn't allow some ASCII chars. The 33 control characters of ASCII ( [0..31] and 127) and the " and \ must be excluded. This leaves us only 128-35 = 93 chars.

So in theory we could define a Base93 encoding which would grow the encoded size to 8/log2(93) = 8*log10(2)/log10(93) = 122%. But a Base93 encoding would not be as convenient as a Base64 encoding. Base64 requires to cut the input byte sequence in 6bit chunks for which simple bitwise operation works well. Beside 133% is not much more than 122%.

This is why I came independently to the common conclusion that Base64 is indeed the best choice to encode binary data in JSON. My answer presents a justification for it. I agree it isn't very attractive from the performance point of view, but consider also the benefit of using JSON with it's human readable string representation easy to manipulate in all programming languages.

If performance is critical than a pure binary encoding should be considered as replacement of JSON. But with JSON my conclusion is that Base64 is the best.

图片来自Go-Json编码解码[6],推荐阅读

由此带来的问题及解决

通过对[]byte进行base64编码的方式，解决了[]byte转为字符串后可能不符合JSON规范的问题，但同时，使用base64编码，会使编码后的数据相较原数据，稳定增大1/3 (详见base64词条介绍)。由此会增大存储空间和传输过程的负担。

这里在讨论有没有更好的方式 binary-data-in-json-string-something-better-than-base64[7]

扩展： base64的变种

然而，标准的Base64并不适合直接放在URL里传输，因为URL编码器会把标准Base64中的/和+字符变为形如%XX的形式，而这些%号在存入数据库时还需要再进行转换，因为ANSI SQL中已将%号用作通配符。为解决此问题，可采用一种用于URL的改进Base64编码，它不在末尾填充=号，并将标准Base64中的+和/分别改成了-和_，这样就免去了在URL编解码和数据库存储时所要做的转换，避免了编码信息长度在此过程中的增加，并统一了数据库、表单等处对象标识符的格式。另有一种用于正则表达式的改进Base64变种，它将+和/改成了!和-，因为+，*以及前面在IRCu中用到的[和]在正则表达式中都可能具有特殊含义。此外还有一些变种，它们将+/改为_-或.（用作编程语言中的标识符名称）或.-（用于XML中的Nmtoken）甚至:（用于XML中的Name）。

所以在很多项目中，能看到类似代码[8]：

package TLSSigAPI

import (
 "encoding/base64"
 "strings"
)

func base64urlEncode(data []byte) string {
 str := base64.StdEncoding.EncodeToString(data)
 str = strings.Replace(str, "+", "*", -1)
 str = strings.Replace(str, "/", "-", -1)
 str = strings.Replace(str, "=", "_", -1)
 return str
}

func base64urlDecode(str string) ([]byte, error) {
 str = strings.Replace(str, "_", "=", -1)
 str = strings.Replace(str, "-", "/", -1)
 str = strings.Replace(str, "*", "+", -1)
 return base64.StdEncoding.DecodeString(str)
}

参考资料

[1]

在线运行: https://go.dev/play/p/T3ZP76gOxEP

[2]

src/encoding/json/encode.go: https://gitee.com/cuishuang/go1.17beta/blob/master/src/encoding/json/encode.go#L834

[3]

src/encoding/json/decode.go: https://gitee.com/cuishuang/go1.17beta/blob/master/src/encoding/json/decode.go#L950

[4]

33 个控制字符: https://baike.baidu.com/item/ASCII/309296

[5]

Base64: https://zh.m.wikipedia.org/zh-hans/Base64

[6]

Go-Json编码解码: https://blog.csdn.net/gusand/article/details/97337255

[7]

binary-data-in-json-string-something-better-than-base64: https://stackoverflow.com/questions/1443158/binary-data-in-json-string-something-better-than-base64

[8]

类似代码: https://github.com/tencentyun/tls-sig-api-golang/blob/master/base64url.go

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2024-02-03，如有侵权请联系 cloudcommunity@tencent.com 删除

数据