I'm trying to clean YouTube video title from unnecessary words such as "Official Video", "Audio", "Music Video" etc. I need help constructing regex that I can use. What I tried so far:
const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;
As I understand, this would remove only last occurrence of keywords. What I did is that I used it in a loop like this:
function clearSearchTerm(title) {
const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;
let newTitle;
do {
newTitle = title;
title = title.replace(regex, "");
} while (newTitle !== title);
return title;
}
Right now it works for me since I didn't find any example where it doesn't work. What was mentioned in comments is that I had problem that my previous regex would remove keywords if they appeared in middle of title which I guess is solved with this. If you have any idea how this can be improved, I'm all ears. In next part I will write examples of what I need to remove.
Words that I'm trying to remove are of kind:
Audio
Video
Lyrics
Official
Remaster
2020 (or years in general)
...
And all those words (and maybe more) can appear between (
and )
or between [
and ]
or after -
. Those words can be combined, for example: Some title - Official Video
which should be cleaned to be Some title
etc.
I'm trying to clean YouTube video title from unnecessary words such as "Official Video", "Audio", "Music Video" etc. I need help constructing regex that I can use. What I tried so far:
const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;
As I understand, this would remove only last occurrence of keywords. What I did is that I used it in a loop like this:
function clearSearchTerm(title) {
const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;
let newTitle;
do {
newTitle = title;
title = title.replace(regex, "");
} while (newTitle !== title);
return title;
}
Right now it works for me since I didn't find any example where it doesn't work. What was mentioned in comments is that I had problem that my previous regex would remove keywords if they appeared in middle of title which I guess is solved with this. If you have any idea how this can be improved, I'm all ears. In next part I will write examples of what I need to remove.
Words that I'm trying to remove are of kind:
Audio
Video
Lyrics
Official
Remaster
2020 (or years in general)
...
And all those words (and maybe more) can appear between (
and )
or between [
and ]
or after -
. Those words can be combined, for example: Some title - Official Video
which should be cleaned to be Some title
etc.
2 Answers
Reset to default 3With PCRE (typically in PHP), you can avoid the repetition of words by declaring a sub-pattern and then reuse it later in the main pattern. It's also possible to add comments and spaces for readability with the x flag:
/
(?(DEFINE)
(?<words_to_drop>
(?:
\s*
\b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b
\s*
)+
)
)
# Finishing by - and words to remove (but not years).
\s+[-–]\s+\g<words_to_drop>$
| # or
# Words or years to remove between brackets or parenthesis.
\s*[[(](?:\g<words_to_drop>|\s*\d{4}\s*)+[\])]
/ix
See it in action with the explanation: https://regex101/r/kPeYzb/1
Notice that the regex part for the brackets or parenthesis isn't 100% correct, as it would also match "(Official video]", but I prefer making the regex short by avoiding a third re-use of the sub-pattern, and really don't think this matters a lot in your case.
If you have to stick to JavaScript's engine, you'll have to remove the spaces, comments and copy-paste the pattern for the words, leading to the same pattern, in JavaScript flavour:
const pattern = /\s+[-–]\s+(?:\s*\b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b\s*)+$|\s*[[(](?:(?:\s*\b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b\s*)+|\s*\d{4}\s*)+[\])]/gi;
In action here: https://regex101/r/kPeYzb/2
Now, about your question of avoiding having this list of words
entered twice in the regex pattern, it is possible to create
the regex object from a string, with the RegExp()
constructor.
This means that you could have an array of words (or word patterns)
from a configuration:
// Original commented regular expression : https://regex101/r/kPeYzb/1
// We will build this regular expression from a custom list of words,
// for example taken from a configuration page.
const wordPatternsFromConfig = [
'Official',
'Video',
'Audio',
'Music',
'Lyrics?',
'Remaster(?:ed)?',
'HD',
'LP',
'HQ',
'4k',
'Full',
'Version',
// Uncomment this pattern with an error, for the demo.
//'Dumm(?y|ies)' // Instead of "Dumm(?:y|ies)"
];
// IMPORTANT: You should validate each word regex before saving the config.
// Example of how you could do this:
let validWordPatterns = [];
let invalidWordPatterns = [];
wordPatternsFromConfig.forEach((wordPattern) => {
try {
const wordRegex = new RegExp(wordPattern);
validWordPatterns.push(wordPattern);
} catch (e) {
invalidWordPatterns.push(e.message);
}
});
if (invalidWordPatterns.length > 0) {
console.log('You have invalid word patterns! Check the following errors:', invalidWordPatterns);
}
// IMPORTANT: compared to the regex syntax, if we build a RegExp instance
// from a string, each backslash should be escaped.
// The regex to match multiple words from this list of words to remove.
const regexWordsToRemove = '(?:\\s*\\b(?:' + validWordPatterns.join('|') + ')\\b\\s*)+';
// The full regex pattern, for the first cleanup step.
const patternCleanup1 = '\\s+[-–]\\s+' + regexWordsToRemove + '$|\\s*[[(](?:' + regexWordsToRemove + '|\\s*\\d{4}\\s*)+[\\])]';
// Create the regex object from the pattern string.
const regexCleanup1 = new RegExp(patternCleanup1, 'gmi');
// Printing it should give the same result as the original regex we
// made here: https://regex101/r/kPeYzb/2
//console.log(regexCleanup);
// A second regex to clean up some other undesired things at the end.
const regexCleanup2 = /\s*[-(\[|]*\s*$/gmi;
// When HTML is parsed and content loaded, add the JS logic.
document.addEventListener('DOMContentLoaded', (loaded) => {
const input = document.getElementById('input');
const output = document.getElementById('output');
// Function to update the output, based on the input.
function updateOutput() {
output.value = input.value.replace(regexCleanup1, '').replace(regexCleanup2, '');
}
// When the input changes, update the output.
input.addEventListener('input', updateOutput);
// Update the output for the initial input value.
updateOutput();
});
body {
font-family: Arial, sans-serif;
}
.two-cols {
display: grid;
grid-template-columns: 1fr 1fr;
grid-column-gap: .5em;
}
textarea {
/* Just because the snippet space is small. */
font-size: 0.8em;
/* Don't wrap the text, to make comparison easier. */
white-space: pre;
overflow-wrap: normal;
overflow-x: scroll;
box-sizing: border-box;
width: 100%;
}
textarea[readonly] {
color: #666;
background: #f8f8f8;
}
small {
font-size: 0.65em;
}
<form id="clean-up" class="two-cols" action="#">
<div>
<label for="input">Input:</label>
<textarea id="input" name="input"
placeholder="Put your text here"
rows="10">Some title - Official Video
Some title [Official Video]
Some title (Official Video)
The Buggles - Video killed the Radio Star
The Smashing Pumpkins - 1979 (Official Music Video)
Miki Jevremović - Prijatelji, ja vam pevam | [Official Music Audio]
1979 (Remastered 2012)
New Order – 1963 (Lyrics)
Paul Davis - '65 Love Affair (1981 LP Version HQ)
Pulp - Disco 2000</textarea>
</div>
<div>
<label for="output">Out: <small>auto-updated</small></label>
<textarea id="output" name="output"
placeholder="Modified text" readonly
rows="10"></textarea>
</div>
</form>
This regex will match -
or [
or (
followed by any number of literal spaces
, followed by any of the words OFFICIAL VIDEO|REMASTER|LYRICS|AUDIO
or a four digit number, followed any number of spaces followed by a matching closing bracket (when applicable).
REGEX PATTERN (ECMAScript(JavaScript) flavor)(Flags: gmi):
(?:-|\((?:(?<=\()(?= *[^)\n]+ *\)))|\[(?:(?<=\[)(?= *[^\]\n]+ *\]))) *(?:OFFICIAL VIDEO|REMASTER|LYRICS|AUDIO|\d{4})\s*(?:\]|\))?(?= |\n|$)
Regex demo: https://regex101/r/Wy2I0w/8 (10 matches)
NOTES:
(|\[(?:(?<=\[)(?= *[^\]\n]* *\])))
(?:
Open non-capturing group(?:...)
alternation(...|...|...)
statement. Match one of the elements in the alternation statement separated by the pipe (|
).-
Match literal dash-
(1st option)|
Alternation element delimiter. Followed by 2nd option.\(
Match literal(
(?:
Begin non-capturing group(?:...)
(2nd option)(?<=
Begin lookbehind(?<=...)
to check for opening(
.\(
Match literal(
. This character must precede this index point.)
Close lookbehind.(?=
Begin lookahead(?=...)
to make sure there is a matching closing)
. Will not consume characters.*
Match 0 or more (*
) literal spaces[^)\n]+
Negated capturing class[^...]
matches any character that is not)
or newline\n
, 1 or more times (+
).*
Match 0 or more (*
) literal spaces\)
Match literal)
.)
Close lookahead.)
Close non-capturing group (2nd option)|
Alternation element delimiter. Followed by 3nd option.\[
Match literal[
.(?:
Begin non-capturing group(?:...)
(3rd option)(?<=
Begin lookbehind(?<=...)
to check for opening.\[
Match literal[
.)
Close *lookbehind.(?=
Begin lookahead to locate matching closing bracket]
. Will not consume characters.*
Match 0 or more literal spaces[^\]\n]+
Negated character class Match any character that is not]
or newline\n
, one or more times (+
).*
Match literal space\]
Match literal]
.)
Close lookahead.)
Close non-capturing group.)
Close alternation group.*
Match 0 or more literal spaces(?:
Begin non-capturing group containing an alternation.OFFICIAL VIDEO|REMASTER|LYRICS|AUDIO|\d{4}
Altenation matches one of the words listed or four digits\d{4}
(year).)
Close non-capturing group.\s*
Match 0 or more whitespace characters\s
.(?:
Open non-capturing group containing alternation.\]|\)
Match either a literal]
or a literal)
.)?
Close alternation group. Make it optional (?
).(?=
Begin lookahead, will not consume characters.|\n|$
Matches a literal space character\n
or end of line$
.)
Close lookahead.
TEST STRING:
FIRST title - Official Video
SECOND title [Official VIDEO]
THIRD title (Lyrics)
FOURTH title - Remaster
FIFTH title - [ Audio ]
SIXTH title ( Lyrics )
SEVENTH title (2020)
EIGHT title (1999)
NINTH title (20)
TENTH title [ 2002 ]
ELEVENTH title [ 200 ]
TWELFTH title ( 1999 )
THIRTEENTH title ( Official Lyrics )
FOURTEENTH title ( Official VIDEO]
FOURTEENTH title ( Official VIDEO
FOURTEENTH title [Official VIDEO)
FOURTEENTH title Official VIDEO]
RESULT:
FIRST title
SECOND title
THIRD title
FOURTH title
FIFTH title -
SIXTH title
SEVENTH title
EIGHT title
NINTH title (20)
TENTH title
ELEVENTH title [ 200 ]
TWELFTH title
THIRTEENTH title ( Official Lyrics )
FOURTEENTH title ( Official VIDEO]
FOURTEENTH title ( Official VIDEO
FOURTEENTH title [Official VIDEO)
FOURTEENTH title Official VIDEO]
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744969113a4603860.html
Video
inThe Buggles - Video killed the Radio Star
then? :-) – C3roe Commented Mar 6 at 14:48(
,-
,[
)? – Milos Stojanovic Commented Mar 6 at 15:06