I use the following code to find words that start with hashtags:
var regex = /(?:^|\W)#(\w+)(?!\w)/g;
but it only matches the English words and it can not match hashtags in other languages such as arabic. so, how can I find hashtags in a text like this:
this is a simple #text
هذا #نص بسیط
I use the following code to find words that start with hashtags:
var regex = /(?:^|\W)#(\w+)(?!\w)/g;
but it only matches the English words and it can not match hashtags in other languages such as arabic. so, how can I find hashtags in a text like this:
Share Improve this question asked Oct 10, 2020 at 9:37 user6931342user6931342 1553 silver badges13 bronze badgesthis is a simple #text
هذا #نص بسیط
3 Answers
Reset to default 5If the value after the # should not contain a # itself, you could use a negated character class [^\s#]
matching any character except #
either way around using an alternation |
The value is in capture group 1.
(?:^|\s)(#[^\s#]+|[^\s#]+#)(?=$|\s)
Regex demo
const pattern = /(?:^|\s)(#[^\s#]+|[^\s#]+#)(?=$|\s)/;
[
"this is a simple #test1",
"هذا #نص بسیط",
"test #test2#",
"test #test3#test3",
"test ##test4",
"test test5##",
].forEach(s => {
const m = s.match(pattern);
if (m) console.log(m[1]);
});
You may use the following regex alternation:
(?<!\S)#\S+|\S+#(?!\S)
Demo
Bearing in mind that a Unicode aware \w
can be represented with [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
(see What's the correct regex range for javascript's regexes to match all the non word characters in any script?), the direct Unicode equivalent of your pattern is
const uw = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`; // uw = Unicode \w
const regex = new RegExp(`(?<!${uw})#(${uw}+)(?!${uw})`, "gu");
Now, to match both directions, you may use
const regex = new RegExp(`(?<!${uw})(?:#(${uw}+)|${uw}+#)(?!${uw})`, "gu");
^_________^_______^
That is, a non-capturing group with an alternation |
char is used with two alernatives, that match #
+ Unicode word chars on the right, or Unicode word chars and then a #
on the right. Details:
(?<!${uw})
- a negative lookbehind that fails the match if there is a Unicode word char immediately on the left(?:#(${uw}+)|${uw}+#)
- a non-capturing group that matches either#(${uw}+)
- a#
char followed with one or more Unicode word chars|
- or${uw}+#
- one or more Unicode word chars followed with a#
char
(?!${uw})
- a negative lookahead that fails the match if there is a Unicode word char immediately on the right.
The g
flag ensures multiple matches and u
enables the Unicode property classes support in the pattern.
A JavaScript demo:
const strings = ["this is a simple #text #text2", "هذا #نن*&ص بسیط","#نص2 هذا #نص بسیط"];
const uw = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`; // uw = Unicode \w
const regex = new RegExp(`(?<!${uw})(?:#(${uw}+)|${uw}+#)(?!${uw})`, "gu");
strings.forEach( string => console.log(string, '=>', string.match(regex)))
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745223125a4617341.html
评论列表(0条)