Code processing millions of records is slow. How can I optimize it to calculate the average word length more efficiently?
import time
def get_avg_word_length(records):
total_length = 0
word_count = 0
for record in records:
words = record.split()
for word in words:
total_length += len(word)
word_count += 1
return total_length / word_count
records = ["This is a sample record." * 10] * 10**6
start_time = time.time()
avg_length = get_avg_word_length(records)
end_time = time.time()
print(f"Average word length: {avg_length}")
print(f"Time taken: {end_time - start_time} seconds")
Code processing millions of records is slow. How can I optimize it to calculate the average word length more efficiently?
import time
def get_avg_word_length(records):
total_length = 0
word_count = 0
for record in records:
words = record.split()
for word in words:
total_length += len(word)
word_count += 1
return total_length / word_count
records = ["This is a sample record." * 10] * 10**6
start_time = time.time()
avg_length = get_avg_word_length(records)
end_time = time.time()
print(f"Average word length: {avg_length}")
print(f"Time taken: {end_time - start_time} seconds")
Share
asked Mar 10 at 8:45
SkyPigeonSkyPigeon
2391 silver badge7 bronze badges
2
- Note: this will count punctuation in word length which might not be what you want. – user19077881 Commented Mar 10 at 9:30
- The way you're constructing your sample data doesn't make much sense. Just print(records[0]) to see why – Adon Bilivit Commented Mar 10 at 14:26
2 Answers
Reset to default 1This drops the time from 6.5 seconds to 2.5 seconds for me:
def get_avg_word_length(records):
total_length = 0
word_count = 0
for record in records:
words = record.split()
word_count += len(words)
total_length += len(''.join(words))
return total_length / word_count
And 2.1 seconds by batching:
from itertools import batched
def get_avg_word_length(records):
total_length = 0
word_count = 0
for batch in batched(records, 100):
words = ' '.join(batch).split()
word_count += len(words)
total_length += len(''.join(words))
return total_length / word_count
I also tried words = ' '.join(records).split()
, i.e., joining all records and then not having a Python loop. But that used a lot of memory, which became a problem. And I tried that with ten times fewer records, which took 0.4 seconds, so with the full amount it would probably take at least 4 seconds anyway, even if it didn't get into trouble with the memory.
Assuming that there can only be one whitespace character between words, and it is exactly a space, and there is at least one word in each record, an efficient solution would be something like this:
def get_avg_word_length(records):
total_length = sum(map(len, records))
total_spaces = sum(record.count(" ") for record in records)
total_words = len(records) + total_spaces
return (total_length - total_spaces) / total_words
It drops the time from 4.3 seconds to 0.3 seconds for me. The idea is simple:
- If each record has at least one word, we can immediately take
len(records)
as the initial value (sum(map(bool, records))
otherwise, which will give the number of non-empty records). - Since each next word is separated by exactly one space, each space corresponds to exactly one word:
total_words = len(records) + total_spaces
, wheretotal_spaces = sum(record.count(" ") for record in records)
is the sum of spaces in each record. - The total sum of word lengths is respectively the cumulative length
sum(map(len, records))
minus the number of whitespace characters. It is simply the sum of record lengths without spaces.
The advantage of this idea is that it does not require expensive memory operations. And with the addition of functional style, some of the work is taken out to the C-level. That is why the solution is so fast.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744855909a4597402.html
评论列表(0条)