Slow performance with large dataset processing in Python - Stack Overflow

Code processing millions of records is slow. How can I optimize it to calculate the average word length

Code processing millions of records is slow. How can I optimize it to calculate the average word length more efficiently?

import time

def get_avg_word_length(records):
    total_length = 0
    word_count = 0
    for record in records:
        words = record.split()
        for word in words:
            total_length += len(word)
            word_count += 1
    return total_length / word_count


records = ["This is a sample record." * 10] * 10**6

start_time = time.time()
avg_length = get_avg_word_length(records)
end_time = time.time()

print(f"Average word length: {avg_length}")
print(f"Time taken: {end_time - start_time} seconds")

Code processing millions of records is slow. How can I optimize it to calculate the average word length more efficiently?

import time

def get_avg_word_length(records):
    total_length = 0
    word_count = 0
    for record in records:
        words = record.split()
        for word in words:
            total_length += len(word)
            word_count += 1
    return total_length / word_count


records = ["This is a sample record." * 10] * 10**6

start_time = time.time()
avg_length = get_avg_word_length(records)
end_time = time.time()

print(f"Average word length: {avg_length}")
print(f"Time taken: {end_time - start_time} seconds")
Share asked Mar 10 at 8:45 SkyPigeonSkyPigeon 2391 silver badge7 bronze badges 2
  • Note: this will count punctuation in word length which might not be what you want. – user19077881 Commented Mar 10 at 9:30
  • The way you're constructing your sample data doesn't make much sense. Just print(records[0]) to see why – Adon Bilivit Commented Mar 10 at 14:26
Add a comment  | 

2 Answers 2

Reset to default 1

This drops the time from 6.5 seconds to 2.5 seconds for me:

def get_avg_word_length(records):
    total_length = 0
    word_count = 0
    for record in records:
        words = record.split()
        word_count += len(words)
        total_length += len(''.join(words))
    return total_length / word_count

And 2.1 seconds by batching:

from itertools import batched

def get_avg_word_length(records):
    total_length = 0
    word_count = 0
    for batch in batched(records, 100):
        words = ' '.join(batch).split()
        word_count += len(words)
        total_length += len(''.join(words))
    return total_length / word_count

I also tried words = ' '.join(records).split(), i.e., joining all records and then not having a Python loop. But that used a lot of memory, which became a problem. And I tried that with ten times fewer records, which took 0.4 seconds, so with the full amount it would probably take at least 4 seconds anyway, even if it didn't get into trouble with the memory.

Assuming that there can only be one whitespace character between words, and it is exactly a space, and there is at least one word in each record, an efficient solution would be something like this:

def get_avg_word_length(records):
    total_length = sum(map(len, records))
    total_spaces = sum(record.count(" ") for record in records)
    total_words = len(records) + total_spaces

    return (total_length - total_spaces) / total_words

It drops the time from 4.3 seconds to 0.3 seconds for me. The idea is simple:

  1. If each record has at least one word, we can immediately take len(records) as the initial value (sum(map(bool, records)) otherwise, which will give the number of non-empty records).
  2. Since each next word is separated by exactly one space, each space corresponds to exactly one word: total_words = len(records) + total_spaces, where total_spaces = sum(record.count(" ") for record in records) is the sum of spaces in each record.
  3. The total sum of word lengths is respectively the cumulative length sum(map(len, records)) minus the number of whitespace characters. It is simply the sum of record lengths without spaces.

The advantage of this idea is that it does not require expensive memory operations. And with the addition of functional style, some of the work is taken out to the C-level. That is why the solution is so fast.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744855909a4597402.html

相关推荐

  • Slow performance with large dataset processing in Python - Stack Overflow

    Code processing millions of records is slow. How can I optimize it to calculate the average word length

    2天前
    40

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信