javascript - regular expression to match hashtags in both left to right and right to left languages - Stack Overflow

I use the following code to find words that start with hashtags:var regex = (?:^|W)#(w+)(?!w)g;bu

I use the following code to find words that start with hashtags:

var regex = /(?:^|\W)#(\w+)(?!\w)/g;

but it only matches the English words and it can not match hashtags in other languages such as arabic. so, how can I find hashtags in a text like this:

this is a simple #text
هذا #نص بسیط

I use the following code to find words that start with hashtags:

var regex = /(?:^|\W)#(\w+)(?!\w)/g;

but it only matches the English words and it can not match hashtags in other languages such as arabic. so, how can I find hashtags in a text like this:

this is a simple #text
هذا #نص بسیط

Share Improve this question asked Oct 10, 2020 at 9:37 user6931342user6931342 1553 silver badges13 bronze badges
Add a ment  | 

3 Answers 3

Reset to default 5

If the value after the # should not contain a # itself, you could use a negated character class [^\s#] matching any character except # either way around using an alternation |

The value is in capture group 1.

(?:^|\s)(#[^\s#]+|[^\s#]+#)(?=$|\s)

Regex demo

const pattern = /(?:^|\s)(#[^\s#]+|[^\s#]+#)(?=$|\s)/;
[
  "this is a simple #test1",
  "هذا #نص بسیط",
  "test #test2#",
  "test #test3#test3",
  "test ##test4",
  "test test5##",
].forEach(s => {
  const m = s.match(pattern);
  if (m) console.log(m[1]);
});

You may use the following regex alternation:

(?<!\S)#\S+|\S+#(?!\S)

Demo

Bearing in mind that a Unicode aware \w can be represented with [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}] (see What's the correct regex range for javascript's regexes to match all the non word characters in any script?), the direct Unicode equivalent of your pattern is

const uw = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`; // uw = Unicode \w
const regex = new RegExp(`(?<!${uw})#(${uw}+)(?!${uw})`, "gu");

Now, to match both directions, you may use

const regex = new RegExp(`(?<!${uw})(?:#(${uw}+)|${uw}+#)(?!${uw})`, "gu");
                                  ^_________^_______^

That is, a non-capturing group with an alternation | char is used with two alernatives, that match # + Unicode word chars on the right, or Unicode word chars and then a # on the right. Details:

  • (?<!${uw}) - a negative lookbehind that fails the match if there is a Unicode word char immediately on the left
  • (?:#(${uw}+)|${uw}+#) - a non-capturing group that matches either
    • #(${uw}+) - a # char followed with one or more Unicode word chars
    • | - or
    • ${uw}+# - one or more Unicode word chars followed with a # char
  • (?!${uw}) - a negative lookahead that fails the match if there is a Unicode word char immediately on the right.

The g flag ensures multiple matches and u enables the Unicode property classes support in the pattern.

A JavaScript demo:

const strings = ["this is a simple #text #text2", "هذا #نن*&ص بسیط","#نص2 هذا #نص بسیط"];
const uw = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`; // uw = Unicode \w
const regex = new RegExp(`(?<!${uw})(?:#(${uw}+)|${uw}+#)(?!${uw})`, "gu");
strings.forEach( string => console.log(string, '=>', string.match(regex)))

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745223125a4617341.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信