Python list Get Item not O(1)? - Stack Overflow

I am trying to optimize a piece of code and I saw a lot of time spent into picking a random index out o

I am trying to optimize a piece of code and I saw a lot of time spent into picking a random index out of a list.

Using the following timeit, I am seeing that Get Item against a larger list can take ~3x longer, while generating the random index is constant time:

import timeit
setup = f"""
import random
test_10m = [x for x in range(10000000)]
test_1m = [x for x in range(1000000)]
test_10k = [x for x in range(10000)]
test_1k = [x for x in range(1000)]
"""

print(timeit.timeit('test_10m[int(random.random()*10000000)]', number=1000000, setup=setup))
print(timeit.timeit('test_1m[int(random.random()*1000000)]', number=1000000, setup=setup))
print(timeit.timeit('test_10k[int(random.random()*10000)]', number=1000000, setup=setup))
print(timeit.timeit('test_1k[int(random.random()*1000)]', number=1000000, setup=setup))
print(timeit.timeit('int(random.random()*10000000)', number=1000000, setup=setup))
print(timeit.timeit('int(random.random()*1000000)', number=1000000, setup=setup))
print(timeit.timeit('int(random.random()*10000)', number=1000000, setup=setup))
print(timeit.timeit('int(random.random()*1000)', number=1000000, setup=setup))

Example output (Python 3.8.6):

0.7138307300047018
0.5209437379962765
0.2407058280077763
0.22641731501789764

0.21460772102000192
0.21099105197936296
0.20940051099751145
0.21421014302177355

I am getting similar results in Python 3.9.21.

Questions:

  • How does the wiki claim O(1) for Get Item? Ie. What am I missing?

  • Is there a faster way to get a random value out of a large list (large>=100000 for now)

I am trying to optimize a piece of code and I saw a lot of time spent into picking a random index out of a list.

Using the following timeit, I am seeing that Get Item against a larger list can take ~3x longer, while generating the random index is constant time:

import timeit
setup = f"""
import random
test_10m = [x for x in range(10000000)]
test_1m = [x for x in range(1000000)]
test_10k = [x for x in range(10000)]
test_1k = [x for x in range(1000)]
"""

print(timeit.timeit('test_10m[int(random.random()*10000000)]', number=1000000, setup=setup))
print(timeit.timeit('test_1m[int(random.random()*1000000)]', number=1000000, setup=setup))
print(timeit.timeit('test_10k[int(random.random()*10000)]', number=1000000, setup=setup))
print(timeit.timeit('test_1k[int(random.random()*1000)]', number=1000000, setup=setup))
print(timeit.timeit('int(random.random()*10000000)', number=1000000, setup=setup))
print(timeit.timeit('int(random.random()*1000000)', number=1000000, setup=setup))
print(timeit.timeit('int(random.random()*10000)', number=1000000, setup=setup))
print(timeit.timeit('int(random.random()*1000)', number=1000000, setup=setup))

Example output (Python 3.8.6):

0.7138307300047018
0.5209437379962765
0.2407058280077763
0.22641731501789764

0.21460772102000192
0.21099105197936296
0.20940051099751145
0.21421014302177355

I am getting similar results in Python 3.9.21.

Questions:

  • How does the wiki claim O(1) for Get Item? Ie. What am I missing?

  • Is there a faster way to get a random value out of a large list (large>=100000 for now)

Share Improve this question asked Feb 23 at 11:14 urbanurban 5,7023 gold badges21 silver badges49 bronze badges 18
  • 1 Are you timing also the creation of the lists? – Markus Hirsimäki Commented Feb 23 at 11:21
  • 5 Random access to large lists simply causes more cache misses. See What every programmer should know about memory; Part 2: CPU caches – Homer512 Commented Feb 23 at 11:39
  • 1 ...which you're also not using. – jonrsharpe Commented Feb 23 at 12:03
  • 1 @Homer512 Technically, it is not just CPU caches (though mostly), but also paging and DRAM effects. Indeed, the timings continue to increase even when data does not fit in CPU caches anymore. For example, a 10M list is still significantly slower than a 100M list. On a mainstream 64-bit CPython/OS, the first takes 400MB (80 MB of pointers) while the later takes 4 GB (800 MB of pointers). Both are far bigger than mainstream CPU caches. Meanwhile, the later is 30% slower on my machine (i5-9600KF CPU with 9 MiB of L3 cache + 2 x 3200MHz DDR4 CL16 DIMM). – Jérôme Richard Commented Feb 23 at 22:57
  • 1 @Homer512 Fortunately, this famous document also talk about paging and how DRAM work, so other chapters of the document are also relevant for this question and it is a (really) good idea for the OP (and others) to read them ;) – Jérôme Richard Commented Feb 23 at 23:03
 |  Show 13 more comments

1 Answer 1

Reset to default -1

I was able to reproduce a O(1) behaviour for getting an item from a list. I think the workbench in the OP measures nested statements together and this lead to different results: int-casting, creation of random number, item-getting.

Unfortunately timeit doesn't support a nice way pass variable parameters to the function to timeit. Used string templating to focus only on the time-measurement of the item-getting step, generation of the random numbers is done before is done in advance in order not to affect the time measurement.

Notice that timeit-functions accept statements either in form of a string or of a callable.

import timeit
import random
from statistics import mean


SEED = 123456
random.seed(SEED)
print(f"{SEED = }")


setup = f"""
test_10m = [x for x in range(10000000)]
test_1m = [x for x in range(1000000)]
test_10k = [x for x in range(10000)]
test_1k = [x for x in range(1000)]
"""

# string 
template_stmts = (
    'test_10m[{}]',
    'test_1m[{}]',
    'test_10k[{}]',
    'test_1k[{}]',
)

Workbench 1: repeated access to same index

for stmt, size in zip(template_stmts, [10000000, 1000000, 10000, 1000]):
    # exclud the computation the index from the workbench
    r = int(random.random()*size)
    
    # statement to workbench
    t = timeit.timeit(stmt.format(r), number=10**6, setup=setup)
    print(f"array size: {size: <10}  time: {t:.5f}s")

Output

array size: 10000000    time: 0.01524s
array size: 1000000     time: 0.01456s
array size: 10000       time: 0.01432s
array size: 1000        time: 0.01549s

Workbench 2: repeated access to random indeces (slow method, notice also reduced the size of the repetitions)

from statistics import mean


# "restart" the seed to generate same values as before
random.seed(SEED) # !<- important for comparison of different methods

for stmt, size in zip(template_stmts, [10000000, 1000000, 10000, 1000]):
    # statement to workbench
    ts = [timeit.timeit(stmt.format(int(random.random()*size)), number=10**2, setup=setup)
          for _ in range(10**2)]
    print(f"array size: {size: <10}  time: {min(ts):f}s  mean: {mean(ts):f}s")    

Output

array size: 10000000    time: 3.4719996619969606e-06s  mean: 4.67159996e-06s
array size: 1000000     time: 3.5260000004200265e-06s  mean: 4.50958998e-06s
array size: 10000       time: 3.453998942859471e-06s   mean: 4.18842002e-06s
array size: 1000        time: 3.0489991331705824e-06s  mean: 3.94016999e-06s

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745156473a4614151.html

相关推荐

  • Python list Get Item not O(1)? - Stack Overflow

    I am trying to optimize a piece of code and I saw a lot of time spent into picking a random index out o

    15小时前
    40

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信