javascript - Regex to remove substrings such as "Official Video", "Audio", "Music V

I'm trying to clean YouTube video title from unnecessary words such as "Official Video",

I'm trying to clean YouTube video title from unnecessary words such as "Official Video", "Audio", "Music Video" etc. I need help constructing regex that I can use. What I tried so far:

const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;

As I understand, this would remove only last occurrence of keywords. What I did is that I used it in a loop like this:

function clearSearchTerm(title) {
    const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;
    let newTitle;

    do {
        newTitle = title;
        title = title.replace(regex, "");
    } while (newTitle !== title);

    return title;
}

Right now it works for me since I didn't find any example where it doesn't work. What was mentioned in comments is that I had problem that my previous regex would remove keywords if they appeared in middle of title which I guess is solved with this. If you have any idea how this can be improved, I'm all ears. In next part I will write examples of what I need to remove.

Words that I'm trying to remove are of kind:

Audio
Video
Lyrics
Official
Remaster
2020 (or years in general)
...

And all those words (and maybe more) can appear between ( and ) or between [ and ] or after -. Those words can be combined, for example: Some title - Official Video which should be cleaned to be Some title etc.

I'm trying to clean YouTube video title from unnecessary words such as "Official Video", "Audio", "Music Video" etc. I need help constructing regex that I can use. What I tried so far:

const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;

As I understand, this would remove only last occurrence of keywords. What I did is that I used it in a loop like this:

function clearSearchTerm(title) {
    const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;
    let newTitle;

    do {
        newTitle = title;
        title = title.replace(regex, "");
    } while (newTitle !== title);

    return title;
}

Right now it works for me since I didn't find any example where it doesn't work. What was mentioned in comments is that I had problem that my previous regex would remove keywords if they appeared in middle of title which I guess is solved with this. If you have any idea how this can be improved, I'm all ears. In next part I will write examples of what I need to remove.

Words that I'm trying to remove are of kind:

Audio
Video
Lyrics
Official
Remaster
2020 (or years in general)
...

And all those words (and maybe more) can appear between ( and ) or between [ and ] or after -. Those words can be combined, for example: Some title - Official Video which should be cleaned to be Some title etc.

Share Improve this question edited Mar 6 at 15:40 Milos Stojanovic asked Mar 6 at 14:38 Milos StojanovicMilos Stojanovic 7131 gold badge9 silver badges18 bronze badges 7
  • 1 So what will happen to Video in The Buggles - Video killed the Radio Star then? :-) – C3roe Commented Mar 6 at 14:48
  • Necessary mistake :). Maybe it would be nice to match end of string in the end. But maybe it would be too hard of a constraint but I guess those words are almost always on the end of title. – Milos Stojanovic Commented Mar 6 at 14:50
  • There is no way to answer with 100% accuracy right now, so, something like this, maybe... Or maybe not. – Wiktor Stribiżew Commented Mar 6 at 14:59
  • @WiktorStribiżew A bit too repetitive, right? Why can't it be compressed into shorter regex where keywords wouldn't be repeated for every prefix ((, -, [)? – Milos Stojanovic Commented Mar 6 at 15:06
  • It does not matter. You may use a variable. JS regex patterns are not that much limited in length as PCRE. Or do you have thousands of these words? – Wiktor Stribiżew Commented Mar 6 at 15:08
 |  Show 2 more comments

2 Answers 2

Reset to default 3

With PCRE (typically in PHP), you can avoid the repetition of words by declaring a sub-pattern and then reuse it later in the main pattern. It's also possible to add comments and spaces for readability with the x flag:

/
(?(DEFINE)
  (?<words_to_drop>
    (?:
      \s*
      \b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b
      \s*
    )+
  )
)
# Finishing by - and words to remove (but not years).
\s+[-–]\s+\g<words_to_drop>$
| # or
# Words or years to remove between brackets or parenthesis.
\s*[[(](?:\g<words_to_drop>|\s*\d{4}\s*)+[\])]
/ix

See it in action with the explanation: https://regex101/r/kPeYzb/1

Notice that the regex part for the brackets or parenthesis isn't 100% correct, as it would also match "(Official video]", but I prefer making the regex short by avoiding a third re-use of the sub-pattern, and really don't think this matters a lot in your case.

If you have to stick to JavaScript's engine, you'll have to remove the spaces, comments and copy-paste the pattern for the words, leading to the same pattern, in JavaScript flavour:

const pattern = /\s+[-–]\s+(?:\s*\b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b\s*)+$|\s*[[(](?:(?:\s*\b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b\s*)+|\s*\d{4}\s*)+[\])]/gi;

In action here: https://regex101/r/kPeYzb/2

Now, about your question of avoiding having this list of words entered twice in the regex pattern, it is possible to create the regex object from a string, with the RegExp() constructor. This means that you could have an array of words (or word patterns) from a configuration:

// Original commented regular expression : https://regex101/r/kPeYzb/1

// We will build this regular expression from a custom list of words,
// for example taken from a configuration page.
const wordPatternsFromConfig = [
  'Official',
  'Video',
  'Audio',
  'Music',
  'Lyrics?',
  'Remaster(?:ed)?',
  'HD',
  'LP',
  'HQ',
  '4k',
  'Full',
  'Version',
  // Uncomment this pattern with an error, for the demo.
  //'Dumm(?y|ies)' // Instead of "Dumm(?:y|ies)"
];

// IMPORTANT: You should validate each word regex before saving the config.
// Example of how you could do this:
let validWordPatterns = [];
let invalidWordPatterns = [];
wordPatternsFromConfig.forEach((wordPattern) => {
  try {
    const wordRegex = new RegExp(wordPattern);
    validWordPatterns.push(wordPattern);
  } catch (e) {
    invalidWordPatterns.push(e.message);
  }
});
if (invalidWordPatterns.length > 0) {
  console.log('You have invalid word patterns! Check the following errors:', invalidWordPatterns);
}

// IMPORTANT: compared to the regex syntax, if we build a RegExp instance
//            from a string, each backslash should be escaped.
// The regex to match multiple words from this list of words to remove.
const regexWordsToRemove = '(?:\\s*\\b(?:' + validWordPatterns.join('|') + ')\\b\\s*)+';
// The full regex pattern, for the first cleanup step.
const patternCleanup1 = '\\s+[-–]\\s+' + regexWordsToRemove + '$|\\s*[[(](?:' + regexWordsToRemove + '|\\s*\\d{4}\\s*)+[\\])]';
// Create the regex object from the pattern string.
const regexCleanup1 = new RegExp(patternCleanup1, 'gmi');
// Printing it should give the same result as the original regex we
// made here: https://regex101/r/kPeYzb/2
//console.log(regexCleanup);
// A second regex to clean up some other undesired things at the end.
const regexCleanup2 = /\s*[-(\[|]*\s*$/gmi;

// When HTML is parsed and content loaded, add the JS logic.
document.addEventListener('DOMContentLoaded', (loaded) => {
  const input = document.getElementById('input');
  const output = document.getElementById('output');

  // Function to update the output, based on the input.
  function updateOutput() {
    output.value = input.value.replace(regexCleanup1, '').replace(regexCleanup2, '');
  }

  // When the input changes, update the output.
  input.addEventListener('input', updateOutput);
  
  // Update the output for the initial input value.
  updateOutput();
});
body {
  font-family: Arial, sans-serif;
}

.two-cols {
  display: grid;
  grid-template-columns: 1fr 1fr;
  grid-column-gap: .5em;
}

textarea {
  /* Just because the snippet space is small. */
  font-size: 0.8em;
  /* Don't wrap the text, to make comparison easier. */
  white-space: pre;
  overflow-wrap: normal;
  overflow-x: scroll;
  box-sizing: border-box;
  width: 100%;
}

textarea[readonly] {
  color: #666;
  background: #f8f8f8;
}

small {
  font-size: 0.65em;
}
<form id="clean-up" class="two-cols" action="#">

  <div>
    <label for="input">Input:</label>
    <textarea id="input" name="input"
              placeholder="Put your text here"
              rows="10">Some title - Official Video
Some title [Official Video]
Some title (Official Video)
The Buggles - Video killed the Radio Star
The Smashing Pumpkins - 1979 (Official Music Video)
Miki Jevremović - Prijatelji, ja vam pevam | [Official Music Audio]
1979 (Remastered 2012)
New Order – 1963 (Lyrics)
Paul Davis - '65 Love Affair (1981 LP Version HQ)
Pulp - Disco 2000</textarea>
  </div>
  
  <div>
    <label for="output">Out: <small>auto-updated</small></label>
    <textarea id="output" name="output"
              placeholder="Modified text" readonly
              rows="10"></textarea>
  </div>
  
</form>

This regex will match - or [ or ( followed by any number of literal spaces , followed by any of the words OFFICIAL VIDEO|REMASTER|LYRICS|AUDIO or a four digit number, followed any number of spaces followed by a matching closing bracket (when applicable).

REGEX PATTERN (ECMAScript(JavaScript) flavor)(Flags: gmi):

(?:-|\((?:(?<=\()(?= *[^)\n]+ *\)))|\[(?:(?<=\[)(?= *[^\]\n]+ *\]))) *(?:OFFICIAL VIDEO|REMASTER|LYRICS|AUDIO|\d{4})\s*(?:\]|\))?(?= |\n|$)

Regex demo: https://regex101/r/Wy2I0w/8 (10 matches)

NOTES:

  • (|\[(?:(?<=\[)(?= *[^\]\n]* *\])))
  • (?: Open non-capturing group (?:...) alternation (...|...|...) statement. Match one of the elements in the alternation statement separated by the pipe (|).
  • - Match literal dash - (1st option)
  • | Alternation element delimiter. Followed by 2nd option.
  • \( Match literal (
  • (?: Begin non-capturing group (?:...) (2nd option)
  • (?<= Begin lookbehind (?<=...) to check for opening (.
  • \( Match literal (. This character must precede this index point.
  • ) Close lookbehind.
  • (?= Begin lookahead (?=...) to make sure there is a matching closing ). Will not consume characters.
  • * Match 0 or more (*) literal spaces .
  • [^)\n]+ Negated capturing class [^...] matches any character that is not ) or newline \n, 1 or more times (+).
  • * Match 0 or more (*) literal spaces .
  • \) Match literal ).
  • ) Close lookahead.
  • ) Close non-capturing group (2nd option)
  • | Alternation element delimiter. Followed by 3nd option.
  • \[ Match literal [.
  • (?: Begin non-capturing group (?:...) (3rd option)
  • (?<= Begin lookbehind (?<=...) to check for opening.
  • \[ Match literal [.
  • ) Close *lookbehind.
  • (?= Begin lookahead to locate matching closing bracket ]. Will not consume characters.
  • * Match 0 or more literal spaces .
  • [^\]\n]+ Negated character class Match any character that is not ] or newline \n, one or more times (+).
  • * Match literal space 0 or more times.
  • \] Match literal ].
  • ) Close lookahead.
  • ) Close non-capturing group.
  • ) Close alternation group.
  • * Match 0 or more literal spaces .
  • (?: Begin non-capturing group containing an alternation.
  • OFFICIAL VIDEO|REMASTER|LYRICS|AUDIO|\d{4} Altenation matches one of the words listed or four digits \d{4} (year).
  • ) Close non-capturing group.
  • \s* Match 0 or more whitespace characters \s.
  • (?: Open non-capturing group containing alternation.
  • \]|\) Match either a literal ] or a literal ).
  • )? Close alternation group. Make it optional (?).
  • (?= Begin lookahead, will not consume characters.
  • |\n|$ Matches a literal space character , a newline \n or end of line $.
  • ) Close lookahead.

TEST STRING:

FIRST title - Official Video 
SECOND title [Official VIDEO]
THIRD title (Lyrics) 
FOURTH title - Remaster
FIFTH title - [ Audio ]
SIXTH title ( Lyrics ) 
SEVENTH title (2020) 
EIGHT title (1999)
NINTH title (20)
TENTH title [ 2002 ]
ELEVENTH title [ 200 ]
TWELFTH  title ( 1999 )
THIRTEENTH  title ( Official Lyrics )
FOURTEENTH  title ( Official VIDEO]
FOURTEENTH  title ( Official VIDEO
FOURTEENTH  title [Official VIDEO)
FOURTEENTH  title Official VIDEO]

RESULT:

FIRST title 
SECOND title 
THIRD title  
FOURTH title 
FIFTH title - 
SIXTH title  
SEVENTH title  
EIGHT title 
NINTH title (20)
TENTH title 
ELEVENTH title [ 200 ]
TWELFTH  title 
THIRTEENTH  title ( Official Lyrics )
FOURTEENTH  title ( Official VIDEO]
FOURTEENTH  title ( Official VIDEO
FOURTEENTH  title [Official VIDEO)
FOURTEENTH  title Official VIDEO]

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744969113a4603860.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信