Slow performance with large dataset processing in Python

Code processing millions of records is slow. How can I optimize it to calculate the average word length more efficiently?

import time

def get_avg_word_length(records):
    total_length = 0
    word_count = 0
    for record in records:
        words = record.split()
        for word in words:
            total_length += len(word)
            word_count += 1
    return total_length / word_count


records = ["This is a sample record." * 10] * 10**6

start_time = time.time()
avg_length = get_avg_word_length(records)
end_time = time.time()

print(f"Average word length: {avg_length}")
print(f"Time taken: {end_time - start_time} seconds")

Code processing millions of records is slow. How can I optimize it to calculate the average word length more efficiently?

import time

def get_avg_word_length(records):
    total_length = 0
    word_count = 0
    for record in records:
        words = record.split()
        for word in words:
            total_length += len(word)
            word_count += 1
    return total_length / word_count


records = ["This is a sample record." * 10] * 10**6

start_time = time.time()
avg_length = get_avg_word_length(records)
end_time = time.time()

print(f"Average word length: {avg_length}")
print(f"Time taken: {end_time - start_time} seconds")

Share asked Mar 10 at 8:45 SkyPigeon 2391 silver badge7 bronze badges

Note: this will count punctuation in word length which might not be what you want. – user19077881 Commented Mar 10 at 9:30
The way you're constructing your sample data doesn't make much sense. Just print(records[0]) to see why – Adon Bilivit Commented Mar 10 at 14:26

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

This drops the time from 6.5 seconds to 2.5 seconds for me:

def get_avg_word_length(records):
    total_length = 0
    word_count = 0
    for record in records:
        words = record.split()
        word_count += len(words)
        total_length += len(''.join(words))
    return total_length / word_count

And 2.1 seconds by batching:

from itertools import batched

def get_avg_word_length(records):
    total_length = 0
    word_count = 0
    for batch in batched(records, 100):
        words = ' '.join(batch).split()
        word_count += len(words)
        total_length += len(''.join(words))
    return total_length / word_count

I also tried words = ' '.join(records).split(), i.e., joining all records and then not having a Python loop. But that used a lot of memory, which became a problem. And I tried that with ten times fewer records, which took 0.4 seconds, so with the full amount it would probably take at least 4 seconds anyway, even if it didn't get into trouble with the memory.

Assuming that there can only be one whitespace character between words, and it is exactly a space, and there is at least one word in each record, an efficient solution would be something like this:

def get_avg_word_length(records):
    total_length = sum(map(len, records))
    total_spaces = sum(record.count(" ") for record in records)
    total_words = len(records) + total_spaces

    return (total_length - total_spaces) / total_words

It drops the time from 4.3 seconds to 0.3 seconds for me. The idea is simple:

If each record has at least one word, we can immediately take len(records) as the initial value (sum(map(bool, records)) otherwise, which will give the number of non-empty records).
Since each next word is separated by exactly one space, each space corresponds to exactly one word: total_words = len(records) + total_spaces, where total_spaces = sum(record.count(" ") for record in records) is the sum of spaces in each record.
The total sum of word lengths is respectively the cumulative length sum(map(len, records)) minus the number of whitespace characters. It is simply the sum of record lengths without spaces.

The advantage of this idea is that it does not require expensive memory operations. And with the addition of functional style, some of the work is taken out to the C-level. That is why the solution is so fast.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744855909a4597402.html

Slow performance with large dataset processing in Python - Stack Overflow

2 Answers 2

发表回复

评论列表（0条）

联系我们

400-800-8888

Slow performance with large dataset processing in Python - Stack Overflow

2 Answers 2

相关推荐