流畅的Python-2
2023-08-03 14:28:26

Chap2. An Array of Sequences

0x01 Overview of Built-In Sequences

按存储的元素类型来分:

  • Container sequences

    hold items of different types, including nested containers.

    list、tuple、collections.deque

  • Flat sequences

    hold items of one simple type

    str、bytes、array.array

A container sequence holds reference to the objects it contains, which may be of any type

while a flat sequence stores the value of its contents in its own memory space

image-20230701124635659

Every Python object in memory has a header with metadata.

Take float for example. It has a value field and two metadata fields

  • ob_refcnt: the object’s reference count
  • ob_type: a pointer to the object’s type
  • ob_fval: a C double holding the value of the float

On a 64-bit Python build, each of those fields takes 8 bytes.

按存储的元素值是否可变来分:

  • Mutable sequences

    list、bytearray、array.array、collections.deque

  • Immutable sequences

    tuple、str、bytes

0x02 List Comprehensions and Generator Expressions

listcomps:列表推导式

genexps:生成器表达式

listcomps

列表推导式提供了一种更优雅的方式来生成列表

1
2
3
4
5
symbols = 'ABC'
codes = []
for symbol in symbols:
codes.append(ord(symbol))
codes # [65, 66, 67]
1
2
3
symbols = 'ABC'
codes = [ord(symbol) for symbol in symbols]
codes # [65, 66, 67]

listcomps or genexps, both have a local scope to hold the variables assigned in the for clause

However, variables assigned with the “Walrus operator” := remain accessible after those comprehensions or expressions return

1
2
3
4
5
6
7
x = 'ABC'
codes1 = [ord(x) for x in x]
print(x) # ABC

codes2 = [last := ord(_) for _ in x]
print(last) # 67
print(_) # NameError: name '_' is not defined
  • Listcomps VS map and filter

使用map和filter能实现和Listcomps一样的功能,但可读性差多了

1
2
3
symbols = '£¢$¥₣'
beyond_ascii = [ord(s) for s in symbols if ord(s) > 127]
print(beyond_ascii) # [65505, 65504, 65509, 8355]
1
2
3
symbols = '£¢$¥₣'
beyond_ascii = list(filter(lambda c: c > 127, map(ord, symbols)))
print(beyond_ascii) # [65505, 65504, 65509, 8355]
  • cartesian products

利用列表推导式来生成笛卡尔乘积

1
2
3
4
5
6
colors = ['balck', 'white']
sizes = ['S', 'M', 'L']
shirts = [(color, size) for size in sizes
for color in colors]
print(shirts)
# [('balck', 'S'), ('white', 'S'), ('balck', 'M'), ('white', 'M'), ('balck', 'L'), ('white', 'L')]

genexps

A generator expression saves memory because it yields items one by one using the iterator protocol

生成器表达式的语法和列表推导式一样,不过是用圆括号parentheses而不是中括号brackets

1
2
3
4
5
import array

symbols = '£¢$¥₣'
tuple(ord(symbol) for symbol in symbols)
array.array('I', (ord(symbol) for symbol in symbols))

若函数参数只有一个,则传递生成器表达式时不需要加圆括号

当数据量大时建议使用生成器表达式来节省内存空间

1
2
3
4
5
colors = ['balck', 'white']
sizes = ['S', 'M', 'L']
for shirt in (f'{c} {s}' for c in colors
for s in sizes):
print(shirt)

0x03 Tuples

Tuple is an immutable container sequence

Tuples do double duty:

  • immutable lists 不可变列表
  • records with no field names 没有字段名的记录

tuples as records

when using a tuple as a collection of fields, the number of items is usually fixed and their order is always important

1
2
3
4
5
6
7
8
coordinates = (39.91, 116.4026)
traveler_ids = [('USA', '31195855'), ('BRA', 'CE342567'), ('ESP', 'XDA205856')]
# unpacking an tuple
city, year, pop, area = ('Tokyo', 2003, 32_450, 8014) # parallel assignment
for passport in traveler_ids:
print("%s/%s" % passport)
for country, _ in traveler_ids:
print(country)

In general, using _ as a dummy variable is just a convention.

However, in a match/case statement, _ is a wildcard that matches any value

In the Python console, the result of the preceding command is assigned to _

tuples as immutable lists

相同长度的元组和列表,前者占用的内存空间更少

元组的长度和元素引用不可变

1
2
3
4
5
6
a = (10, 'alpha', [1, 2])
b = (10, 'alpha', [1, 2])
print(a == b) # True
b[-1].append(3)
print(a == b) # False
print(b) # (10, 'alpha', [1, 2, 3])

image-20230701150456848

The content of the tuple itself is immutable, but that only means the references held by the tuple will always point to the same object. However, if one of the referenced object is mutable—like a list—its content may change.

注意:元组中的可变元素很可能是导致bug的来源

An object is hashable only if its value cannot ever change

若一个元组里面包含可变元素,则其不能作为字典的键或插入一个集合

可以通过下面代码来判断元组是否有固定值

1
2
3
4
5
6
7
8
9
10
11
12
def is_fixed(o):
try:
hash(o)
except TypeError:
return False
return True


tf = (10, 'alpha', (1, 2))
tm = (10, 'alpha', [1, 2])
is_fixed(tf) # True
is_fixed(tm) # unhashable type: 'list'

0x04 Unpacking Sequences and Iterables

Unpacking avoids unnecessary and error-prone use of indexes to extract elements from sequences.

unpacking used in

  • parallel assignment
1
a, b = (1, 2)
  • swap values
1
a, b = b, a
  • prefix an argument with * when calling a function
1
2
3
4
5
divmod(20, 8)  # Return the tuple (x//y, x%y)
t = (20, 8)
quotient, remainder = divmod(*t) # break the tuple to parameters
print(quotient) # 2
print(remainder) # 4
  • return multiple values
1
2
3
4
import os

_, filename = os.path.split("/~/.ssh/id_rsa.pub")
print(filename) # id_rsa.pub

定义函数时可以用*args去获取任意多余的参数

1
2
3
4
5
def func(a, b, c, d, *rest):
print(a, b, c, d, rest)


func(*[1, 2], 3, *range(4, 7))

在并行赋值也可以使用*,并且*可以出现在任意位置

1
2
3
4
5
6
7
8
9
10
11
12
13
a, b, *rest = range(5)
a, b, rest
# (0, 1, [2, 3, 4])

a, b, *rest = range(2)
# 0 1 []

a, *body, c = range(5)
# (0, [1, 2, 3], 4)
*head, a, b = range(5)
# ([0, 1, 2], 3, 4)
*head, a, b = range(2)
# ([], 0, 1)

定义列表、元组、集合字面量时

1
2
[*range(4), 4]  # [0, 1, 2, 3, 4]
{*range(4), 4, *(5, 6, 7)} # {0, 1, 2, 3, 4, 5, 6, 7}

嵌套结构

1
2
3
4
5
6
7
8
9
10
11
12
13
metro_areas = [
('Tokyo', 'JP', 36.933, (35.689722, 139.691667)),
('Delhi NCR', 'IN', 21.935, (28.613889, 77.208889)),
('Mexico City', 'MX', 20.142, (19.433333, -99.133333)),
('New York-Newark', 'US', 20.104, (40.808611, -74.020386)),
('São Paulo', 'BR', 19.649, (-23.547778, -46.635833)),
]


print(f'{"":15} | {"latitude":>9} | {"longitude":>9}')
for name, _, _, (lat, lon) in metro_areas:
if lon <= 0:
print(f'{name:15} | {lat:9.4f} | {lon:9.4f}')

0x05 Pattern Matching with Sequences

Python 3.10 终于支持switch/case了,不过它叫match/case

它比其他语言的switch/case更强大,因为提供了destructing——a more advanced form of unpacking

1
2
3
4
5
6
7
8
9
10
11
12
metro_areas = [
('Tokyo', 'JP', 36.933, (35.689722, 139.691667)),
('Delhi NCR', 'IN', 21.935, (28.613889, 77.208889)),
('Mexico City', 'MX', 20.142, (19.433333, -99.133333)),
('New York-Newark', 'US', 20.104, (40.808611, -74.020386)),
('São Paulo', 'BR', 19.649, (-23.547778, -46.635833)),
]
print(f'{"":15} | {"latitude":>9} | {"longitude":>9}')
for record in metro_areas:
match record:
case [name, _, _, (lat, lon)] if lon <= 0:
print(f'{name:15} | {lat:9.4f} | {lon:9.4f}')

_匹配该位置的单项,但不会绑定该项的值,_可以出现多次

A sequence pattern matches the subject if:

  1. The subject is a sequence
  2. The subject and the pattern have the same number of items
  3. Each corresponding item matches, including nested items

Sequence patterns写成圆括号parentheses或方括号square brackets都一样

对于strbytesbytearray,在match/case下会被认为是单一的值,而不是sequence

当需要处理这类数据时,需要先转化为序列对象

1
2
3
4
5
6
7
8
9
10
phone = '1008611'
match tuple(phone):
case ['1', *rest]:
print(1)
case ['2', *rest]:
print(2)
case ['3', *rest]:
print(3)
case _:
print('default')

也可以在匹配检测类型

1
case [str(name), _, _, (float(lat), float(lon)]:

这里并非调用了strfloat构造器,而是这个语法会在运行时进行类型检查

另外,若我们想匹配的目标序列以字符串开头,以嵌套的两个浮点数组成的序列结尾,可以使用*_通配

1
case [str(name), *_, (float(lat), float(lon)]:

0x06 Slicing

执行seq[start:stop:step]实际上调用的是seq.__getitem__(slice(start, stop, step))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
invoice = """
0.....6.................................40........52...55........
1909 Pimoroni PiBrella $17.50 3 $52.50
1489 6mm Tactile Switch x20 $4.95 2 $9.90
1510 Panavise Jr. - PV-201 $28.00 1 $28.00
1601 PiTFT Mini Kit 320x240 $34.95 1 $34.95
"""
SKU = slice(0, 6)
DESCRIPTION = slice(6, 40)
UNIT_PRICE = slice(40, 52)
QUANTITY = slice(52, 55)
ITEM_TOTAL = slice(55, None)
line_items = invoice.split('\n')[2:]
for item in line_items:
print(item[UNIT_PRICE], item[DESCRIPTION])

0x07 sort or sorted

The list.sort method sorts a list in place——i.e. without making a copy

Python API convention: functions or methods that change an object in place should return None to make it clear to the caller that the receiver was changed and no new object was created.

e.g. random.shuffle(s) returns None

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>>> fruits = ['grape', 'raspberry', 'apple', 'banana']
>>> sorted(fruits)
['apple', 'banana', 'grape', 'raspberry']
>>> fruits
['grape', 'raspberry', 'apple', 'banana']
>>> sorted(fruits, reverse=True)
['raspberry', 'grape', 'banana', 'apple']
>>> sorted(fruits, key=len)
['grape', 'apple', 'banana', 'raspberry']
>>> sorted(fruits, key=len, reverse=True)
['raspberry', 'banana', 'grape', 'apple']
>>> fruits
['grape', 'raspberry', 'apple', 'banana']
>>> fruits.sort()
>>> fruits
['apple', 'banana', 'grape', 'raspberry']

0x08 when a list is not the answer

The list type is flexible and easy to use, but depending on specific requirements, there are better options.

Arrays

If a list only contains numbers, array.array is a more efficient replacement.

A python array is as lean as a C array.

image-20230704104833110

An array of float value does not hold full-fledged float instances, but only the packed bytes representing their matched valued. When creating an array, you provide a typecode, a letter to determine the underlying C type used to store each item in the array.

Arrays support all mutable sequence operations (including .pop, .insert, and .extend), as well as additional methods for fast loading and saving, such as .frombytes and .tofile.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from array import array
from random import random

floats1 = array('d', (random() for i in range(10 ** 7))) # genexp
print(floats1[-1]) # 0.009781869418644673
fp = open('floats.bin', 'wb')
floats1.tofile(fp)
fp.close()
floats2 = array('d')
fp = open('floats.bin', 'rb')
floats2.fromfile(fp, 10 ** 7)
fp.close()
print(floats2[-1]) # 0.009781869418644673
print(floats1 == floats2) # True

array.tofile and array.fromfile are easy and fast to use.

Type code C Type Python Type Minimum size in bytes
'c' char character 1
'b' signed char int 1
'B' unsigned char int 1
'u' Py_UNICODE Unicode character 2
'h' signed short int 2
'H' unsigned short int 2
'i' signed int int 2
'I' unsigned int long 2
'l' signed long int 4
'L' unsigned long long 4
'f' float float 4
'd' double float 8

Memory Views

The built-in memoryview class is a shared-memory sequence type that lets you handle slices of arrays without copying bytes. It was inspired by the NumPy library.

Using notation similar to the array module, the memoryview.cast method lets you change the way multiple bytes are read or written as units without moving bits around.

1
2
3
4
5
6
7
8
9
10
11
12
from array import array

octets = array('B', range(6))
m1 = memoryview(octets)
print(m1.tolist()) # [0, 1, 2, 3, 4, 5]
m2 = m1.cast('B', [2, 3])
print(m2.tolist()) # [[0, 1, 2], [3, 4, 5]]
m3 = m1.cast('B', [3, 2])
print(m3.tolist()) # [[0, 1], [2, 3], [4, 5]]
m2[1, 1] = 44
m3[1, 1] = 33
print(octets) # array('B', [0, 1, 2, 33, 44, 5])

Numpy

pass

Deques

Inserting and removing from the head of a list (the 0-index end) is costly because the entire list must be shifted in memory.

The class collections.deque is a thread-safe double-ended queue designed for fast inserting and removing from both ends.

If a bounded deque is full, when you add a new item, it discards an item from the opposite end.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from collections import deque

dq = deque(range(10), maxlen=10)
print(dq) # deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], maxlen=10)
dq.rotate()
print(dq) # deque([9, 0, 1, 2, 3, 4, 5, 6, 7, 8], maxlen=10)
dq.rotate(4)
print(dq) # deque([5, 6, 7, 8, 9, 0, 1, 2, 3, 4], maxlen=10)
dq.appendleft(-1)
print(dq) # deque([-1, 5, 6, 7, 8, 9, 0, 1, 2, 3], maxlen=10)
dq.extend([11, 22, 33])
print(dq) # deque([7, 8, 9, 0, 1, 2, 3, 11, 22, 33], maxlen=10)
dq.extendleft([10, 20, 30, 40])
print(dq) # deque([40, 30, 20, 10, 7, 8, 9, 0, 1, 2], maxlen=10)