tokenize - How to detect out-of-vocabulary words in a prompt - Stack Overflow

I need to detect words an LLM has no knowledge about, to add RAG-based definition of said word to the p

I need to detect words an LLM has no knowledge about, to add RAG-based definition of said word to the prompt, i.e.:

What is the best way to achieve slubalisme using the new fabridocium product ?, should highlight slubalisme and fabridocium as unknown words.

What is the best way to achieve this ?

What I've tried:

  • Tokenizer based: checking if the model tokenizer splits the word in multiple pieces. This is not accurate as some known words can easily be split in multiple pieces by the tokenizer. There are a lot of false positives
  • Comparing vocab list: prone to spelling issue
  • Prompting an LLM: works OK but really inefficient

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745285192a4620518.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信