javascript - How to use whole word regex search for Devanagari text? - Stack Overflow

My HTML code with Devanagari words<html><head><title>TODO<title><meta c

My HTML code with Devanagari words

<html>
<head>
<title>TODO</title>
<meta charset="UTF-8">
</head>
<body>
    मंत्री मुख्यमंत्री 
</body>
    <script src="jquery-1.11.0.min.js"></script>
    <script src="xregexp_20.js"></script>
    <script src="addons/unicode/unicode-base.js"></script>
    <script src="addons/unicode/unicode-scripts.js"></script>
    <script src="my.js"></script>
</html>

My javascript code

var html = document.getElementsByTagName("html")[0];
var fullpage_content = html.innerHTML;

var regex = RegExp("मंत्री", "g");
var count = fullpage_content.match(regex);
console.log("count in page : " + count+ ", " + count.length);

//use of word boundry ,not supported by devanagari characters
regex = RegExp("\\bमंत्री\\b", "g");
count = fullpage_content.match(regex);
console.log("count in page : " + count);

regex = XRegExp("मंत्री");
var match = XRegExp.matchChain(fullpage_content, [regex]);
console.log("count in page : " + match + ", " + match.length);

//xregex do not support word boundry \\b
regex = XRegExp("\\bमंत्री\\b");
match = XRegExp.matchChain(fullpage_content, [regex]);
console.log("count in page : " + match + ", " + match.length);

Output of js (on Chrome)

count in page : मंत्री,मंत्री, 2

count in page : null

count in page : मंत्री,मंत्री, 2

count in page : , 0

Whole word search should give one as answer, but regexp and XRegExp both are failing me. I need some help.

My HTML code with Devanagari words

<html>
<head>
<title>TODO</title>
<meta charset="UTF-8">
</head>
<body>
    मंत्री मुख्यमंत्री 
</body>
    <script src="jquery-1.11.0.min.js"></script>
    <script src="xregexp_20.js"></script>
    <script src="addons/unicode/unicode-base.js"></script>
    <script src="addons/unicode/unicode-scripts.js"></script>
    <script src="my.js"></script>
</html>

My javascript code

var html = document.getElementsByTagName("html")[0];
var fullpage_content = html.innerHTML;

var regex = RegExp("मंत्री", "g");
var count = fullpage_content.match(regex);
console.log("count in page : " + count+ ", " + count.length);

//use of word boundry ,not supported by devanagari characters
regex = RegExp("\\bमंत्री\\b", "g");
count = fullpage_content.match(regex);
console.log("count in page : " + count);

regex = XRegExp("मंत्री");
var match = XRegExp.matchChain(fullpage_content, [regex]);
console.log("count in page : " + match + ", " + match.length);

//xregex do not support word boundry \\b
regex = XRegExp("\\bमंत्री\\b");
match = XRegExp.matchChain(fullpage_content, [regex]);
console.log("count in page : " + match + ", " + match.length);

Output of js (on Chrome)

count in page : मंत्री,मंत्री, 2

count in page : null

count in page : मंत्री,मंत्री, 2

count in page : , 0

Whole word search should give one as answer, but regexp and XRegExp both are failing me. I need some help.

Share Improve this question edited Jun 20, 2020 at 9:12 CommunityBot 11 silver badge asked Apr 23, 2014 at 6:42 user3563136user3563136 615 bronze badges 4
  • Can you give us the fiddle for this ? – Prabhat Jain Commented Apr 23, 2014 at 6:54
  • @PrabhatJain I created one for me. You can have a look at jsfiddle/es63p – Abhas Tandon Commented Apr 23, 2014 at 7:29
  • can you please check your fiddle, if something is helping. – Prabhat Jain Commented Apr 23, 2014 at 7:58
  • here is the updated fiddle. jsfiddle/es63p/3 BTW I think the person who has asked this questions wants to use xregexp library. The script is not returning the actual word count. – Abhas Tandon Commented Apr 23, 2014 at 8:40
Add a ment  | 

4 Answers 4

Reset to default 3

Using this regexp I can get a match on मंत्री but exclude मुख्यमंत्री:

var regex = XRegExp("(?:^|\\P{L})मंत्री(?=\\P{L}|$)");

What this does is match मंत्री if it:

  1. Is at the beginning of the string or preceded by a character which Unicode considers a non-Letter, and

  2. Is at the end of the string or followed by a character which Unicode considers a non-Letter.

Note that this is slightly different from what \b does because \b does not match digits. For instance, /\bmantri\b/ won't match mantri123 because 1, 2, and 3 are considered to be part of words and thus do not mark a word boundary. If you want something that emulates \b then this would do it:

var regex = XRegExp("(?:^|[^\\p{L}\\p{N}])मंत्री(?=[^\\p{L}\\p{N}]|$)");

The difference with the first regexp is that with this one मंत्री cannot be preceded or followed by a digit.

I've used a negative lookahead at the end of the regular expression so the character that follows your word is excluded from the results. There is no equivalent negative lookbehind so if there is a character before मंत्री, it will appear in the results. You'll have to decide what you want to do with this character for your specific application.

regex = XRegExp("(?:^|[^\\p{Devanagari}\\p{L}])मंत्री(?=[^\\p{Devanagari}\\p{L}]|$)");

solved it. Thanks to Louis in particular. I tested a more rigorous test case before finalizing.

मंत्री मंत्रीमंत्री मंत्रीमं ममंत्री मंत्री मंत्री मंत्री. .मंत्री मंत्री- <मंत्री मंत्री> मंत्री, ,मंत्री ,मंत्री, मंत्री,मंत्री, ,मंत्री,मंत्री,

मंत्री, मंत्री

मंत्री,मंत्री मंत्री मुख्यमंत्री

Add this in fiddle and check if something is helping you

alert(fullpage_content);
//match(/मंत्री/g);
alert("मंत्री मुख्यमंत्री".match(/मंत्री/g));

If you assume that each word is followed by one or more space as character breaks then following js regular expression will give you the correct result

console.log("count inline without xRegExp:" + "मंत्री मुख्यमंत्री".match(/मंत्री\s+/g));

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745628691a4636969.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信