tokenize - How to detect out-of-vocabulary words in a prompt - Stack Overflow

admin•2025-04-22 22:13:32•questions•阅读1

I need to detect words an LLM has no knowledge about, to add RAG-based definition of said word to the p

I need to detect words an LLM has no knowledge about, to add RAG-based definition of said word to the prompt, i.e.:

What is the best way to achieve slubalisme using the new fabridocium product ?, should highlight slubalisme and fabridocium as unknown words.

What is the best way to achieve this ?

What I've tried:

Tokenizer based: checking if the model tokenizer splits the word in multiple pieces. This is not accurate as some known words can easily be split in multiple pieces by the tokenizer. There are a lot of false positives
Comparing vocab list: prone to spelling issue
Prompting an LLM: works OK but really inefficient

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1745285192a4620518.html