python - What is the fastest way to generate alternating boolean sequences in NumPy? - Stack Overflow

I want to create a 1D array of length n, every element in the array can be either 0 or 1. Now I want th

I want to create a 1D array of length n, every element in the array can be either 0 or 1. Now I want the array to contain alternating runs of 0s and 1s, every full run has the same length as every other run. Every run of 1s is followed by a run of 0s of the same length and vice versa, the gaps are periodic.

For example, if we start with five 0s, the next run is guaranteed to be five 1s, and then five 0s, and so on.

To better illustrate what I mean:

In [65]: (np.arange(100) // 5) & 1
Out[65]:
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

The runs can also be shifted, meaning we don't have a full run at the start:

In [66]: ((np.arange(100) - 3) // 7) & 1
Out[66]:
array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

Now as you can see I have found a way to do it, in fact I have found three ways, but all are flawed. The above works with shifted runs but is slow, one other is faster but it doesn't allow shifts.

In [82]: %timeit (np.arange(524288) // 16) & 1
6.45 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [83]: range1 = np.arange(524288)

In [84]: %timeit (range1 // 16) & 1
3.14 ms ± 201 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [85]: %timeit np.tile(np.concatenate([np.zeros(16, dtype=np.uint8), np.ones(16, dtype=np.uint8)]), 16384)
81.6 μs ± 843 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [86]: %timeit np.repeat([0, 1], 262144).reshape(32, 16384).T.flatten()
5.42 ms ± 74.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [87]: np.array_equal((range1 // 16) & 1, np.tile(np.concatenate([np.zeros(16, dtype=np.uint8), np.ones(16, dtype=np.uint8)]), 16384))
Out[87]: True

In [88]: np.array_equal(np.repeat([0, 1], 262144).reshape(32, 16384).T.flatten(), np.tile(np.concatenate([np.zeros(16, dtype=np.uint8), np.ones(16, dtype=np.uint8)]), 16384))
Out[88]: True

Is there a way faster than np.tile based solution that also allows shifts?


I have made the code blocks output the same result for fair comparison, and for completeness I have added another inefficient method.


Another method:

In [89]: arr = np.zeros(524288, dtype=np.uint8)

In [90]: %timeit arr = np.zeros(524288, dtype=np.uint8)
19.9 μs ± 156 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [91]: arr[262144:] = 1

In [92]: %timeit arr[262144:] = 1
9.91 μs ± 52 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [93]: %timeit arr.reshape(32, 16384).T.flatten()
932 μs ± 11.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [94]: %timeit arr.reshape(32, 16384).T
406 ns ± 1.81 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [95]: %timeit list(arr.reshape(32, 16384).T.flat)
24.7 ms ± 242 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you can see, np.repeat is extremely slow, creating the array and assigning 1 to half the values is very fast, and arr.reshape.T is extremely fast, but arr.flatten() is very slow.

I want to create a 1D array of length n, every element in the array can be either 0 or 1. Now I want the array to contain alternating runs of 0s and 1s, every full run has the same length as every other run. Every run of 1s is followed by a run of 0s of the same length and vice versa, the gaps are periodic.

For example, if we start with five 0s, the next run is guaranteed to be five 1s, and then five 0s, and so on.

To better illustrate what I mean:

In [65]: (np.arange(100) // 5) & 1
Out[65]:
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

The runs can also be shifted, meaning we don't have a full run at the start:

In [66]: ((np.arange(100) - 3) // 7) & 1
Out[66]:
array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

Now as you can see I have found a way to do it, in fact I have found three ways, but all are flawed. The above works with shifted runs but is slow, one other is faster but it doesn't allow shifts.

In [82]: %timeit (np.arange(524288) // 16) & 1
6.45 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [83]: range1 = np.arange(524288)

In [84]: %timeit (range1 // 16) & 1
3.14 ms ± 201 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [85]: %timeit np.tile(np.concatenate([np.zeros(16, dtype=np.uint8), np.ones(16, dtype=np.uint8)]), 16384)
81.6 μs ± 843 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [86]: %timeit np.repeat([0, 1], 262144).reshape(32, 16384).T.flatten()
5.42 ms ± 74.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [87]: np.array_equal((range1 // 16) & 1, np.tile(np.concatenate([np.zeros(16, dtype=np.uint8), np.ones(16, dtype=np.uint8)]), 16384))
Out[87]: True

In [88]: np.array_equal(np.repeat([0, 1], 262144).reshape(32, 16384).T.flatten(), np.tile(np.concatenate([np.zeros(16, dtype=np.uint8), np.ones(16, dtype=np.uint8)]), 16384))
Out[88]: True

Is there a way faster than np.tile based solution that also allows shifts?


I have made the code blocks output the same result for fair comparison, and for completeness I have added another inefficient method.


Another method:

In [89]: arr = np.zeros(524288, dtype=np.uint8)

In [90]: %timeit arr = np.zeros(524288, dtype=np.uint8)
19.9 μs ± 156 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [91]: arr[262144:] = 1

In [92]: %timeit arr[262144:] = 1
9.91 μs ± 52 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [93]: %timeit arr.reshape(32, 16384).T.flatten()
932 μs ± 11.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [94]: %timeit arr.reshape(32, 16384).T
406 ns ± 1.81 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [95]: %timeit list(arr.reshape(32, 16384).T.flat)
24.7 ms ± 242 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you can see, np.repeat is extremely slow, creating the array and assigning 1 to half the values is very fast, and arr.reshape.T is extremely fast, but arr.flatten() is very slow.

Share Improve this question edited Mar 24 at 20:54 mkrieger1 23.5k7 gold badges64 silver badges82 bronze badges asked Mar 24 at 16:52 Ξένη ΓήινοςΞένη Γήινος 3,4961 gold badge18 silver badges60 bronze badges 5
  • I have found the methods a few weeks ago, and failed to find a better method since then. I didn't ask this question because I don't know if there is a duplicate, I can't find it. – Ξένη Γήινος Commented Mar 24 at 17:01
  • This risks being an X/Y question if, for instance, your array here is being used as a mask. What is it actually being used for? – Reinderien Commented Mar 25 at 12:18
  • You should stick with the np.tile based solution. np.tile is quite fast, and it is difficult to outperform it. The reason your current solution is slow is that the input array is too small for np.tile. Instead, create a larger (e.g. repeat-count x 64) chunk, and then use np.tile to expand it. – ken Commented Mar 26 at 3:54
  • Additionally, np.tile is a wrapper API implemented in Python. If you're looking for microsecond scale optimizations, you may be able to get slightly faster performance by examining the internal implementation of np.tile and extracting only the parts you need. If I remember correctly, it uses np.repeat internally. – ken Commented Mar 26 at 3:56
  • This is definitely an XY problem based on OPs now deleted secondary question, in which these arrays are being stacked/copied to form an image. As Jerome's answer shows, a significant proportion of the time is spent in allocating temporary arrays (if these arrays are to be stacked/copied) so you would be better off allocating a single output array and modifiying the rows/columns of that, rather than allocating a new array for each row/column and stacking/copying into your output array. – Nin17 Commented Mar 26 at 21:15
Add a comment  | 

3 Answers 3

Reset to default 2

Since you are certainly on Windows (which is not mentioned in this question and critical for performance) and you likely use PIP packages, your Numpy implementation has been compiled with MSVC which does not use SIMD units of CPU for many functions, including np.tile. This means your Numpy package it pretty inefficient.

One simple solution is not to use such a package, but a build of Numpy using Clang (or not to use Windows) as previously mentioned in a past answer. Most functions should then be automagically faster with no effort.

Another solution is to use a trick (SLP-vectorization) to bypass this fundamental limitation. The idea is to do reduce the number of instruction done by operating on wider items. Here is the code:

arr = np.zeros(524288, dtype=np.uint8)
tmp = arr.view(np.uint64)
tmp[2::4] = 0x01010101_01010101
tmp[3::4] = 0x01010101_01010101

It takes 47 μs on my machine (with a i5-9600KF CPU, Numpy 2.1.3 on Windows). This is a bit faster than the 57 μs of the np.tile solution (which is the standard way of doing that in Numpy). All other proposed solutions are slower.

Note that two stores operations are sub-optimal, especially for larger arrays. That being said, on Windows a vectorised single store is slower (due to Numpy internal and more specifically generators).

I advise you to choose the first solution instead of using Numpy tricks for sake of performance. If you cannot, another solution is simply to use Cython or Numba for that:

import numba as nb

@nb.njit('(uint32,uint32)')
def gen(size, repeat):
    res = np.ones((repeat, size), dtype=np.uint8)
    for i in range(repeat):
        for j in range(size // 2):
            res[i, j] = 0
    return res.reshape(-1)

The Numba code only takes 35 µs on my machine. It is cleaner and should be faster on large arrays. It is also easy to debug/maintain. Note that the array allocation takes 40% of the time (14 µs)...

How about:

import numpy as np
import math
def generate_runs(n, m, shift=0):
    n_runs = math.ceil((n + shift) / (2*m))
    arr = np.tile(np.asarray([0]*m + [1]*m, dtype=np.uint8), n_runs)
    return arr[shift:shift + n]

%timeit generate_runs(524288, 16)
# 24.8 µs ± 503 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The key idea is to round up to the next integer multiple of runs, generate them using tile (or any other way of doing the same job faster), then using indexing to take only the part you need.

You didn't specify what the interface should be, but this may shift in the direction opposite to what you were expecting. If so, you can compensate for that by providing 2*m - desired_shift for the shift argument (or you can adjust the function to do that for you).


~~This saves memory if you don't need to mutate the output.~~ Turns out this doesn't generate a view like I expected, never mind.

import numpy as np
import math
def generate_runs(n, m, shift=0):
    n_runs = math.ceil((n + shift) / (2*m))
    arr = np.asarray([0]*m + [1]*m, dtype=np.uint8)
    arr = np.broadcast_to(arr, (n_runs, arr.size)).ravel()
    return arr[shift:shift + n]

generate_runs(524288, 16)

But it turns out to be a bit slower.

This is what I ended up doing:

import numba as nb
import numpy as np


@nb.njit(cache=True, parallel=True)
def rotating_bools(
    length: int, period: int, shift: int = 0, first: bool = 0
) -> np.ndarray:
    shift %= period
    cycle = 2 * period
    total = length + shift
    total += (-total % cycle)
    second = first ^ 1
    result = np.full(total, first, dtype=np.uint8)
    for i in nb.prange(total // cycle):
        start = i * cycle + period
        result[start : start + period] = second

    return result[shift : shift + length]

I used the same idea used by Matt Haberland to round up the length to the next multiple of two periods. But since the sequence is alternating, a shift greater than the period is meaningless, so I take the modulus. This has the nice side effect that it handles negative shifts correctly.

I have tested the methods from the answers. Broadcasting to a view is fast, but only so fast, and it only works with periods that are powers of 2, it gets extremely complicated if the period isn't a power of 2. The slight performance gain doesn't justify the time cost to implement and maintain.

So I chose to use Numba. The second solution from Jérôme Richard is inefficient, but it is very easy to understand and optimize.

I just did what I would do in plain Python, just do a for loop and use slicing assignment to set the values. But instead of using a range which is synchronous I use numba.prange and slap a numba.njit decorator unto it to parallelize it.

And also creating a one-dimensional array with numpy.full ensures fast memory access.

Performance:

In [313]: %timeit gen(16, 32768)
148 μs ± 346 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [314]: %timeit generate_runs(524288, 16)
82.9 μs ± 1.1 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [315]: %timeit rotating_bools(524288, 16)
66.9 μs ± 709 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

There isn't much of a performance gain but it is definitely noticeable. Now I am trying to make it run on CUDA and make it fast.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744239198a4564620.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信