Estrutura de dados e collections em Python

André Ericson

@_aericson

gihub.com/aericson

"We read Knuth so you don't have to. "

Tim Peters

Big-O notation O(n)

  • O(1) - Independentemente do número de elementos, sempre demora o mesmo tempo.
  • O(n) - Cresce na mesma proporção que o número de elementos.
  • O(n²) - 4, 9, 16, 25... (Ex: BubbleSort).

"Premature optimization is the root of all evil"

Donald Knuth

List

In [1]:
li = [1, 2, 3, 4]
print(li)
[1, 2, 3, 4]
In [2]:
import sys
li = list(range(20))
print(sys.getsizeof(li))
288
In [3]:
li.append(1)
print(sys.getsizeof(li))
288

List

List

List - O(1)

In [4]:
small_list = [x for x in range(3)]
large_list = [x for x in range(5000000)]
In [5]:
%timeit -qo small_list[2]
Out[5]:
<TimeitResult : 10000000 loops, best of 3: 55.5 ns per loop>
In [7]:
%timeit -qo large_list[400000]
Out[7]:
<TimeitResult : 10000000 loops, best of 3: 57.1 ns per loop>

List - O(1)

  • small_list[2] = 0
  • len(small_list)
  • small_list.append(5)
  • small_list.pop()
  • small_list.clear()

List - O(n)

  • small_list == another_list
  • small_list[3:7] = [9, 2]
  • del small_list[1]
  • small_list.remove(...)
  • small_list.pop(0)
  • for v in small_list:

List - O(n)

In [8]:
def pop_all_from_left_to_right(li):
    while li:
        val = li.pop(0)     
        # do something with val
%timeit -qo pop_all_from_left_to_right(list(range(100000)))  
Out[8]:
<TimeitResult : 1 loop, best of 3: 3.21 s per loop>

O()!

In [1]:
def pop_all_from_left_to_right_better(li):
    rev_li = list(reversed(li))
    while rev_li:
        val = rev_li.pop()
        # do something with val
    return reversed(rev_li)
%timeit pop_all_from_left_to_right_better(list(range(100000)))
10 loops, best of 3: 16.1 ms per loop

O(n)!

Tuples

In [10]:
a_tuple = (1, 2, 3)
print(a_tuple)
(1, 2, 3)
In [13]:
a_tuple[1] = 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-c9b81d58aa32> in <module>()
----> 1 a_tuple[1] = 1

TypeError: 'tuple' object does not support item assignment
In [14]:
sys.getsizeof(tuple(range(5)))
Out[14]:
88
In [15]:
sys.getsizeof(tuple(range(6)))
Out[15]:
96

Sets

In [16]:
a_set = {1, 5, 3, 'foo'}
print(a_set)
{3, 1, 'foo', 5}

Sets

  • intersecção (&), união (|), diferença (-) e diferença simétrica (^)
In [17]:
a_set.add([1,2])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-4f58d274d540> in <module>()
----> 1 a_set.add([1,2])

TypeError: unhashable type: 'list'
In [18]:
a_set.add(tuple([1, 2]))
a_set
Out[18]:
{3, 1, 'foo', 5, (1, 2)}

Sets vs List

set list
len O(1) O(1)
append/add O(1) O(1)
in/not in O(1) O(N)
remove O(1) O(N)
pop O(1) O(N)
In [19]:
import random
a_small_list = [random.randint(0, 500) for _ in range(500)]
a_large_list = [random.randint(0, 500) for _ in range(500000)]
In [20]:
def without_dupes(a_list):
    seen = set()
    items = []
    for elem in a_list:
        if elem not in seen:
            seen.add(elem)
            items.append(elem)
    return items

%timeit without_dupes(a_small_list)
%timeit without_dupes(a_large_list)
10000 loops, best of 3: 157 µs per loop
10 loops, best of 3: 112 ms per loop

O()!

Dict

In [21]:
a_dict = {'a': 1, 'd': 3, 'c':2}
print(a_dict)
{'a': 1, 'c': 2, 'd': 3}
In [22]:
a_dict[[1,2]] = 3
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-b9249f86eb14> in <module>()
----> 1 a_dict[[1,2]] = 3

TypeError: unhashable type: 'list'
In [23]:
a_dict[(1,2)] = 3
In [24]:
a_dict[frozenset([3, 4, 5])] = 'foo'
print(a_dict)
{(1, 2): 3, 'a': 1, 'c': 2, frozenset({3, 4, 5}): 'foo', 'd': 3}
  • Rápido, quase tudo é O(1)!
  • Amortizado

Collections

Counter

In [25]:
from collections import Counter
In [26]:
c = Counter(['a', 'b', 'c', 'd', 'a', 'd', 'b'])
print(c)
Counter({'b': 2, 'a': 2, 'd': 2, 'c': 1})
In [27]:
c['a'] += 1
print(c)
Counter({'a': 3, 'b': 2, 'd': 2, 'c': 1})
In [28]:
print(c['e'])
0
In [29]:
print(c.most_common(3))
[('a', 3), ('b', 2), ('d', 2)]
In [30]:
c.update(['a', 'b', 'b', 'd'])
print(c)
Counter({'b': 4, 'a': 4, 'd': 3, 'c': 1})

Deque

  • lists são lentas para pop(0) e insert(0, v) --> O(n)
  • acesso O(n)
In [31]:
from collections import deque
d = deque([1,2,3])
print(d)
deque([1, 2, 3])
In [32]:
d.append(4)
print(d)
deque([1, 2, 3, 4])
In [33]:
d.appendleft(5)
print(d)
deque([5, 1, 2, 3, 4])
In [34]:
print(d.pop(), d.popleft())
print(d)
4 5
deque([1, 2, 3])
In [35]:
d = deque([1,2,3], maxlen=3)
print(d)
deque([1, 2, 3], maxlen=3)
In [36]:
d.append(4)
print(d)      
deque([2, 3, 4], maxlen=3)
In [37]:
d = deque(range(50000))
def loop_with_for(deq):
    for elem in deq:
        elem

def loop_with_index(deq):
    for i in range(len(deq)):
        deq[i]

%timeit loop_with_for(d)
%timeit loop_with_index(d)
1000 loops, best of 3: 1.66 ms per loop
10 loops, best of 3: 29.4 ms per loop

Namedtuple

In [38]:
from collections import namedtuple

Point = namedtuple('Point', ['x', 'y'])
p = Point(2, 3)
print(p)
print(p.x, p.y)
Point(x=2, y=3)
2 3
In [39]:
Point(*(2,3))
Out[39]:
Point(x=2, y=3)
In [40]:
print(p[0:1])
(2,)

OrderedDict

In [41]:
from collections import OrderedDict
ord_dict = OrderedDict()
ord_dict['c'] = 3
ord_dict['a'] = 1
print(ord_dict)
OrderedDict([('c', 3), ('a', 1)])
In [42]:
ord_dict = OrderedDict({'b': 1, 'a': 2})
print(ord_dict)
OrderedDict([('b', 1), ('a', 2)])
In [43]:
ord_dict = OrderedDict([('b', 1), ('a', 2)])
print(ord_dict)
OrderedDict([('b', 1), ('a', 2)])

Dict in Python 3.6

  • Compact dict 20-25% menos memoria.
  • dict vai manter a ordem de criação.
  • Mas você não deve depender disso.
  • Continue usando OrderDict.

Também na stdlib

  • heapq
  • array
  • struct
  • bytes (immutable) e bytearray (mutable)
  • ChainMap
  • defaultdict

Ainda não achou o que precisa?

Boltons: https://boltons.readthedocs.org/

Boltons

  • dictutils - Mapping types (OMD)
  • namedutils - Lightweight containers
  • queueutils - Priority queues
  • setutils - IndexedSet type

Abstract Base Classes (ABC)

In [44]:
class Dicto(dict):
    def __getitem__(self, k):
        return 31

d = Dicto()
print(d[2])
31
In [45]:
print(d.get(2))
None

Abstract Base Classes (ABC)

In [46]:
from collections.abc import MutableMapping

class CaseInsensitiveDict(MutableMapping):

    def __init__(self):
        self.d = {}

    def __delitem__(self, i):
        del self.d[i]

    def __setitem__(self, k, v):
        self.d[k.lower()] = v

    def __getitem__(self, k):
        return self.d[k.lower()]

    def __iter__(self):
        return iter(self.d)

    def __len__(self):
        return len(self.d)

Abstract Base Classes (ABC)

In [47]:
d = CaseInsensitiveDict()
d['eggs'] = 'ovos'
d['ham'] = 'presunto'

for i in d:
    print(i + ': ' + d[i])
eggs: ovos
ham: presunto
In [48]:
d.get('HaM')
Out[48]:
'presunto'

Computação Ciêntifica

In [49]:
x_list = [float(x) for x in range(100000)]
sys.getsizeof(x_list) + len(x_list) * sys.getsizeof(float())
Out[49]:
3224464
In [50]:
import array 
x_array = array.array('d', [float(x) for x in range(100000)])
sys.getsizeof(x_array)
Out[50]:
800064
In [51]:
%timeit sum(x_list)
1000 loops, best of 3: 747 µs per loop
In [52]:
%timeit sum(x_array)
100 loops, best of 3: 4.06 ms per loop
In [53]:
import numpy
x_narray = numpy.array([float(x) for x in range(100000)])
%timeit -qo numpy.sum(x_narray)
Out[53]:
<TimeitResult : 10000 loops, best of 3: 61.3 µs per loop>

Dúvidas?

Slides: http://bit.ly/pybr12-vinta