python - How to efficiently use NumPy's StringDType for string operations (e.g., joining strings)? - Stack Overflow

admin•2025-04-18 17:47:30•questions•阅读1

I'm trying to perform string operations with NumPy's StringDType.As an example, I've at

I'm trying to perform string operations with NumPy's StringDType. As an example, I've attempted to join strings with a separator row-wise. In the past, NumPy's string operations were somewhat slower compared to Python list comprehensions, and I was hoping that with the introduction of NumPy's StringDType (which supports variable string sizes), these operations would have improved. However, I haven't been able to achieve a significant performance boost so far.

Are there better options for efficiently performing operations like string joining using NumPy's StringDType?

Here's a sample code where I test several methods that try to leverage vectorized operations:

import timeit
from functools import reduce
import numpy as np
import polars as pl
from numpy.dtypes import StringDType

def interleave_separator(arr, sep=', '):
    """Interleave a separator into a 2D array column-wise (costly helper)."""
    nrows, ncols = arr.shape
    interleaved = np.empty((nrows, 2 * ncols - 1), dtype=StringDType)
    interleaved[:, ::2] = arr
    interleaved[:, 1::2] = sep
    return interleaved

def strings_join_py(arr, sep=', '):
    """Python list comprehension."""
    return [sep.join(a) for a in arr]

def strings_join_pl(arr, sep=', '):
    """Polars join series of lists."""
    return arr.list.join(separator=sep)

def strings_join_np1(arr, sep=', '):
    """Numpy interleave separator and apply sum."""
    return np.sum(interleave_separator(arr, sep), axis=1)

def strings_join_np2(arr, sep=', '):
    """Numpy interleave separator and apply add.reduce."""
    return np.strings.add.reduce(interleave_separator(arr, sep), axis=1)

def strings_join_np3(arr, sep=', '):
    """Numpy/Python accumulate over columns row-wise."""
    return reduce(lambda x, y: x + sep + y, arr.T)

Check results:

np.random.seed(999)
choices = ["apple", "banana", "cherry", "salad"]
arr = np.random.choice(choices, size=(3, 3)).astype(StringDType)
sep = ", "

print('2D array:')
print(arr)
# [['apple' 'apple' 'banana']
#  ['banana' 'apple' 'banana']
#  ['salad' 'salad' 'banana']]
print('1D array joined by separator:')
print(strings_join_py(arr.tolist(), sep))
print(strings_join_pl(pl.Series(arr.tolist()), sep))
print(strings_join_np1(arr, sep))
print(strings_join_np2(arr, sep))
print(strings_join_np3(arr, sep))
# ['apple, apple, banana'
#  'banana, apple, banana'
#  'salad, salad, banana']

Run benchmarks:

np.random.seed(999)
choices = ["apple", "banana", "cherry", "salad"]
arr = np.random.choice(choices, size=(100_000, 10)).astype(StringDType)
lst = arr.tolist()
ser = pl.Series(lst)
sep = ", "

baseline = timeit.timeit(lambda: strings_join_py(lst, sep), number=5)
time_pl = timeit.timeit(lambda: strings_join_pl(ser, sep), number=5)
time_np1 = timeit.timeit(lambda: strings_join_np1(arr, sep), number=5)
time_np2 = timeit.timeit(lambda: strings_join_np2(arr, sep), number=5)
time_np3 = timeit.timeit(lambda: strings_join_np3(arr, sep), number=5)

print("Ratio compared to Python list comprehension (>1 is faster)")
print(f"pl:  {baseline/time_pl:.2f}")
print(f"np1: {baseline/time_np1:.2f}")
print(f"np2: {baseline/time_np2:.2f}")
print(f"np3: {baseline/time_np3:.2f}")
# pl:  1.11  # Polars Series.list.join
# np1: 0.14  # interleaved np.sum
# np2: 0.14  # interleaved np.add.reduce
# np3: 0.19  # reduce np.add

Edit - Here’s a benchmark with an array shape of (500,000, 2):

# Ratio compared to Python list comprehension (>1 is faster)
# pl:  0.61
# np1: 0.31
# np2: 0.31
# np3: 1.57

The data type seems to perform well (see np3) but there seem to be not enough specialized functions at the moment to increase the usability.

Edit: Observation

I've observed that NumPy's string ufunc (np.strings.add) is quite efficient if there aren’t many intermediate results to compute. As the number of accumulated columns increases, its efficiency declines compared to a Python list comprehension.

Here's a small benchmark showing the impact of intermediate results as the number of accumulated columns rises:

# Ratio compared to Python list comprehension (>1 is faster)
# Py_list_comp / np.strings.add:  0.77 - (shape (500000, 2))
# Py_list_comp / np.strings.add:  0.04 - (shape (1000, 1000))

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744286747a4566835.html

admin

questions
promql - Which Prometheus metric type should be used for for short-lived cronjob metrics? - Stack Overflow
We use Prometheus PushGateway for collecting metrics from short-lived cronjobs. I would like to collect
admin
27分钟前
00
questions
javascript - keep checkboxes checked after page refresh - Stack Overflow
I have a couple of checkboxes. when any of them are clickdchecked and the search button is clicked, wi
admin
26分钟前
00
questions
javascript - React jest and MSAL getting BrowserAuthError - Stack Overflow
I'm trying to test a few ponents that are using MSAL for authentication.Thus far, I have a simple
admin
26分钟前
10
questions
typescript - Is it possible to get exact type from object property depends on values? - Stack Overflow
Example:type User = {id: numbername: string}const user: { fields: (keyof User)[] } = {fields: ["
admin
23分钟前
00
questions
android - ARCore API Camera Zoom and Auto-Exposure Control - Stack Overflow
Questions:Is it at all possible to control the zoom setting on an Android phone when using the ARCore A
admin
21分钟前
10
questions
javascript - How to set a timeout for executed onClick function of react-router Link? - Stack Overflow
Suppose there is a react-router Link ponent.class MyLink extends Component {handleOnClick = () => {d
admin
21分钟前
10
questions
javascript - Doing a DELETE request from <a> tag without constructing forms? - Stack Overflow
Is there any way to force an <a href> to do a delete request when clicked instead of a regular GE
admin
20分钟前
10
questions
javascript - ReactJS map an array random - Stack Overflow
I've been trying to map an array in a random way. I want every object of the array to show up. But
admin
19分钟前
00
questions
theme development - How to show post from category select
This code is in my functions.php:array( "name" => "Headline","type" => "sectio
admin
14分钟前
10
questions
converting dates in two formats in a CSV to one standardised format in R - Stack Overflow
I have some data from five temperature loggers as exported as csvs. Frustratingly it exports the data i
admin
12分钟前
00
questions
Importing javascript file multiple times on same page - Stack Overflow
I have a javascript file called pendingAjaxCallsCounter.js with a variable "var pendingAjaxCalls&q
admin
11分钟前
00
questions
Add Amount Filter in Tally XML Request for Fetching Vouchers - Stack Overflow
I am using the following XML request to fetch a voucher from Tally based on Date and Narration, Now, I
admin
11分钟前
00
questions
javascript - Isotope masonry layout is not working after item size change - Stack Overflow
I have a simple masonry layout. And need to change elements size and position on click.Here is a jsfidd
admin
10分钟前
00
questions
javascript - Regex to remove substrings such as "Official Video", "Audio", "Music V
I'm trying to clean YouTube video title from unnecessary words such as "Official Video",
admin
8分钟前
00
questions
Determine what pages are in my header
Closed. This question is off-topic. It is not currently accepting answers.Your question should be specific to WordPress.
admin
8分钟前
00
questions
javascript - Jquery html() inside conditional check - Stack Overflow
these are extract of my code<button>Hook It!<button>and a bit of JQuery$("ul li .ta
admin
7分钟前
00
questions
Load and draw tiled images to canvas in JavaScript - Stack Overflow
It seems clear to me how to dynamically load and draw an image with JavaScript. I attach an onload func
admin
4分钟前
00
questions
javascript - Creating overlay on an image using Leaflet - Stack Overflow
I am attempting to get Leaflet to work for a standard, non-map image so that I can place markers on the
admin
2分钟前
00
questions
javascript - Is it posible to POST request the responseType "array buffer"? - Stack Overflow
I made an api that will return an image. at first I tried it on get method request and it works, but fo
admin
1分钟前
00
questions
jquery - ASP.NET Core prevent logout after register new user - Stack Overflow
I am working on ASP.NET Core 2.2. After the user registers for the first time, the user is redirected t
admin
56秒前
00

发表回复

评论列表（0条）

暂无评论

python - How to efficiently use NumPy's StringDType for string operations (e.g., joining strings)? - Stack Overflow

发表回复

评论列表（0条）

联系我们

400-800-8888

python - How to efficiently use NumPy&#39;s StringDType for string operations (e.g., joining strings)? - Stack Overflow

相关推荐

发表回复

评论列表（0条）

联系我们

400-800-8888

python - How to efficiently use NumPy's StringDType for string operations (e.g., joining strings)? - Stack Overflow