python源码阅读笔记之字符串对象

三、字符串对象
在讲字符串对象时,我大致的讲讲int类型的小问题:
1.对象缓冲池的概念,这也是为了避免大量的内存申请释放开销
2.小量的整数实际上都是缓存在一个内存池里[-5,257],其中的每一个整数都是共享的
#define NSMALLPOSINTS           257
#define NSMALLNEGINTS           5
/* References to small integers are saved in this array so that they
   can be shared.
   The integers that are saved are those in the range
   -NSMALLNEGINTS (inclusive) to NSMALLPOSINTS (not inclusive).
*/
3.大整数呢,就是给定内存空间,需要的时候再使用了

还是一样的从注释开始:
/* PyStringObject_SIZE gives the basic size of a string; any memory allocation
   for a string of length n should request PyStringObject_SIZE + n bytes.

   Using PyStringObject_SIZE instead of sizeof(PyStringObject) saves
   3 bytes per string allocation on a typical system.
*/
#define PyStringObject_SIZE (offsetof(PyStringObject, ob_sval) + 1)

因为字符串是不可变对象,所以在C这里都保存着它的基本信息。

在看看2.7里的string类型:
PyTypeObject PyString_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "str",
    PyStringObject_SIZE,
    sizeof(char),
    string_dealloc,                             /* tp_dealloc */
    (printfunc)string_print,                    /* tp_print */
    0,                                          /* tp_getattr */
    0,                                          /* tp_setattr */
    0,                                          /* tp_compare */
    string_repr,                                /* tp_repr */
    &string_as_number,                          /* tp_as_number */
    &string_as_sequence,                        /* tp_as_sequence */
    &string_as_mapping,                         /* tp_as_mapping */
    (hashfunc)string_hash,                      /* tp_hash */
    0,                                          /* tp_call */
    string_str,                                 /* tp_str */
    PyObject_GenericGetAttr,                    /* tp_getattro */
    0,                                          /* tp_setattro */
    &string_as_buffer,                          /* tp_as_buffer */
    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_CHECKTYPES |
        Py_TPFLAGS_BASETYPE | Py_TPFLAGS_STRING_SUBCLASS |
        Py_TPFLAGS_HAVE_NEWBUFFER,              /* tp_flags */
    string_doc,                                 /* tp_doc */
    0,                                          /* tp_traverse */
    0,                                          /* tp_clear */
    (richcmpfunc)string_richcompare,            /* tp_richcompare */
    0,                                          /* tp_weaklistoffset */
    0,                                          /* tp_iter */
    0,                                          /* tp_iternext */
    string_methods,                             /* tp_methods */
    0,                                          /* tp_members */
    0,                                          /* tp_getset */
    &PyBaseString_Type,                         /* tp_base */
    0,                                          /* tp_dict */
    0,                                          /* tp_descr_get */
    0,                                          /* tp_descr_set */
    0,                                          /* tp_dictoffset */
    0,                                          /* tp_init */
    0,                                          /* tp_alloc */
    string_new,                                 /* tp_new */
    PyObject_Del,                               /* tp_free */
};

结构体定义如下:
typedef struct {
    PyObject_VAR_HEAD
    long ob_shash;
    int ob_sstate;
    char ob_sval[1];

    /* Invariants:
     *     ob_sval contains space for 'ob_size+1' elements.
     *     ob_sval[ob_size] == 0.
     *     ob_shash is the hash of the string or -1 if not computed yet.
     *     ob_sstate != 0 iff the string object is in stringobject.c's
     *       'interned' dictionary; in this case the two references
     *       from 'interned' to this object are *not counted* in ob_refcnt.
     */
} PyStringObject;

在结构体的注释很清楚的告诉我们:这里面包含了四个对象,ob_sval是个一个字符的数组,
保存着字符串,ob_shash牵扯到hash值,与字典有关。
hash的算法如下:
static long
string_hash(PyStringObject *a)
{
    register Py_ssize_t len;
    register unsigned char *p;
    register long x;

#ifdef Py_DEBUG
    assert(_Py_HashSecret_Initialized);
#endif
    if (a->ob_shash != -1)
        return a->ob_shash;
    len = Py_SIZE(a);
    /*
      We make the hash of the empty string be 0, rather than using
      (prefix ^ suffix), since this slightly obfuscates the hash secret
    */
    if (len == 0) {
        a->ob_shash = 0;
        return 0;
    }
    p = (unsigned char *) a->ob_sval;
    x = _Py_HashSecret.prefix;
    x ^= *p << 7;
    while (--len >= 0)
        x = (1000003*x) ^ *p++;
    x ^= Py_SIZE(a);
    x ^= _Py_HashSecret.suffix;
    if (x == -1)
        x = -2;
    a->ob_shash = x;
    return x;
}

这里指的注意的是,在计算过程中使用位运算计算,这是为了栈溢出。

ob_sstate与string的intern机制有关。
具体实现如下:
void
PyString_InternInPlace(PyObject **p)
{
    register PyStringObject *s = (PyStringObject *)(*p);
    PyObject *t;
    if (s == NULL || !PyString_Check(s))
        Py_FatalError("PyString_InternInPlace: strings only please!");
    /* If it's a string subclass, we don't really know what putting
       it in the interned dict might do. */
    if (!PyString_CheckExact(s))
        return;
    if (PyString_CHECK_INTERNED(s))
        return;
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear(); /* Don't leave an exception */
            return;
        }
    }
    t = PyDict_GetItem(interned, (PyObject *)s);
    if (t) {
        Py_INCREF(t);
        Py_DECREF(*p);
        *p = t;
        return;
    }

    if (PyDict_SetItem(interned, (PyObject *)s, (PyObject *)s) < 0) {
        PyErr_Clear();
        return;
    }
    /* The two references in interned are not counted by refcnt.
       The string deallocator will take care of this */
    Py_REFCNT(s) -= 2;
    PyString_CHECK_INTERNED(s) = SSTATE_INTERNED_MORTAL;
}
在被intern之后的字符串,全局就只会存在一个PyStringObject对象。并且当判断两个字符串相同时,
只需要看PyObject *相同即可。至于引用计数,从注释来看,
    /* The two references in interned are not counted by refcnt.
       The string deallocator will take care of this */
 
具体机理:在一个新的字符串创建的时候,PyStringObject都会创建,然后再会去检查intern机制,
如果interned相同,最后销毁这个temp.

PyObject *
PyString_FromString(const char *str)
{
    register size_t size;
    register PyStringObject *op;

    assert(str != NULL);
    size = strlen(str);
    if (size > PY_SSIZE_T_MAX - PyStringObject_SIZE) {
        PyErr_SetString(PyExc_OverflowError,
            "string is too long for a Python string");
        return NULL;
    }
    if (size == 0 && (op = nullstring) != NULL) {
#ifdef COUNT_ALLOCS
        null_strings++;
#endif
        Py_INCREF(op);
        return (PyObject *)op;
    }
    if (size == 1 && (op = characters[*str & UCHAR_MAX]) != NULL) {
#ifdef COUNT_ALLOCS
        one_strings++;
#endif
        Py_INCREF(op);
        return (PyObject *)op;
    }

    /* Inline PyObject_NewVar */
    op = (PyStringObject *)PyObject_MALLOC(PyStringObject_SIZE + size);
    if (op == NULL)
        return PyErr_NoMemory();
    PyObject_INIT_VAR(op, &PyString_Type, size);
    op->ob_shash = -1;
    op->ob_sstate = SSTATE_NOT_INTERNED;
    Py_MEMCPY(op->ob_sval, str, size+1);
    /* share short strings */
    if (size == 0) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        nullstring = op;
        Py_INCREF(op);
    } else if (size == 1) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        characters[*str & UCHAR_MAX] = op;
        Py_INCREF(op);
    }
    return (PyObject *) op;
}

size > PY_SSIZE_T_MAX - PyStringObject_SIZE会先做判断,决定是否超过内存。
如果是空字符串,nullstring这个会去处理。建立完后,结果如下:
ref  type  ob_size  ob_hash  ob_sstate  ob_sval  字符串 '\0'

在stringobject.c还包含了很多字符串操作,比如split操作:
PyDoc_STRVAR(split__doc__,
"S.split([sep [,maxsplit]]) -> list of strings\n\
\n\
Return a list of the words in the string S, using sep as the\n\
delimiter string.  If maxsplit is given, at most maxsplit\n\
splits are done. If sep is not specified or is None, any\n\
whitespace string is a separator and empty strings are removed\n\
from the result.");

static PyObject *
string_split(PyStringObject *self, PyObject *args)
{
    Py_ssize_t len = PyString_GET_SIZE(self), n;
    Py_ssize_t maxsplit = -1;
    const char *s = PyString_AS_STRING(self), *sub;
    PyObject *subobj = Py_None;

    if (!PyArg_ParseTuple(args, "|On:split", &subobj, &maxsplit))
        return NULL;
    if (maxsplit < 0)
        maxsplit = PY_SSIZE_T_MAX;
    if (subobj == Py_None)
        return 
        _whitespace((PyObject*) self, s, len, maxsplit);
    if (PyString_Check(subobj)) {
        sub = PyString_AS_STRING(subobj);
        n = PyString_GET_SIZE(subobj);
    }
#ifdef Py_USING_UNICODE
    else if (PyUnicode_Check(subobj))
        return PyUnicode_Split((PyObject *)self, subobj, maxsplit);
#endif
    else if (PyObject_AsCharBuffer(subobj, &sub, &n))
        return NULL;

    return stringlib_split((PyObject*) self, s, len, sub, n, maxsplit);
}


原文发布于微信公众号 - 鸿的学习笔记(shujuxuexizhilu)

原文发表时间:2017-07-21

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏张善友的专栏

Quartz.net官方开发指南 第六课 : CronTrigger

如果你需要像日历那样按日程来触发任务,而不是像SimpleTrigger 那样每隔特定的间隔时间触发,CronTriggers通常比SimpleTrigger更...

2029
来自专栏鸿的学习笔记

python源码阅读笔记之函数的运行机制

841
来自专栏码匠的流水账

聊聊pg jdbc的queryTimeout及next方法

本文主要介绍一下pg jdbc statement的queryTimeout及resultSet的next方法

3051

.NET中的密钥加密

本教程将演示如何通过System.Security.Cryptography在.NET Framework 1.1中实现对称加密/密钥加密。

5898
来自专栏跟着阿笨一起玩NET

ASP.NET 存储过程操作

存储过程是存放在数据库服务器上的预先编译好的sql语句。使用存储过程,可以直接在数据库中存储并运行功能强大的任务。存储过程在第一应用程序执行时进行语法检查和编...

1291
来自专栏码匠的流水账

聊聊rocketmq的PushConsumerImpl

io/openmessaging/rocketmq/consumer/PushConsumerImpl.java

1662
来自专栏码匠的流水账

聊聊jesque的WorkerImpl与WorkerPool

Resque是一个使用redis来创建后台任务的ruby组件。而jesque是其java版本。通常用来做延时队列。

721
来自专栏别先生

一脸懵逼学习Hadoop中的序列化机制——流量求和统计MapReduce的程序开发案例——流量求和统计排序

一:序列化概念 序列化(Serialization)是指把结构化对象转化为字节流。 反序列化(Deserialization)是序列化的逆过程。即把字节流转回...

25710
来自专栏owent

POJ 2192 Zipper HDU 2059 龟兔赛跑

731
来自专栏分布式系统进阶

Influxdb中Select查询请求结果涉及到的一些数据结构

相当于c里面的链表元素,itr指向下一个元素的指针,buf表示当前元素,即FloatPoint类型的链表的迭代器.

1422

扫码关注云+社区

领取腾讯云代金券