Remove unnecessary attributes from html tag using JavaScript RegEx - Stack Overflow

I'm newbie to regular expressions, trying to filter the HTML tags keeping only required (srchre

I'm newbie to regular expressions, trying to filter the HTML tags keeping only required (src / href / style) attribute with their values and remove unnecessary attributes. While googling I found a regular expression to keep only "src" attribute, hence my modified expression is as follows:

<([a-z][a-z0-9]*)(?:[^>]*(\s(src|href|style)=['\"][^'\"]*['\"]))?[^>]*?(\/?)>

Its working fine but the only problem is, if one tag contains more than one required attribute then it keeps only the last matched single attribute and discards the rest.

I'm trying to clean following text

<title>Hello World</title>
<div fadeout"="" style="margin:0px;" class="xyz">
    <img src="abc.jpg" alt="" />
    <p style="margin-bottom:10px;">
        The event is celebrating its 50th anniversary K&ouml;&nbsp;
        <a style="margin:0px;" href="/">exhibition grounds in Cologne</a>.
    </p>
    <p style="padding:0px;"></p>
    <p style="color:black;">
        <strong>A festival for art lovers</strong>
    </p>
</div>

at using aforementioned expression with <$1$2$4> as substitution string and getting following output:

<title>Hello World</title>
<div style="margin:0px;">
    <img src="abc.jpg"/>
    <p style="margin-bottom:10px;">
        The event is celebrating its 50th anniversary K&ouml;&nbsp;
        <a href="/">exhibition grounds in Cologne</a>.
    </p>
    <p style="padding:0px;"></p>
    <p style="color:black;">
        <strong>A festival for art lovers</strong>
    </p>
</div>

Problem is "style" attribute is discarded from anchor tag. I have tried to replicate the (\s(src|href|style)=['\"][^'\"]*['\"]) block using * operator, {3} selector and much more but in vain. Any suggestions???

I'm newbie to regular expressions, trying to filter the HTML tags keeping only required (src / href / style) attribute with their values and remove unnecessary attributes. While googling I found a regular expression to keep only "src" attribute, hence my modified expression is as follows:

<([a-z][a-z0-9]*)(?:[^>]*(\s(src|href|style)=['\"][^'\"]*['\"]))?[^>]*?(\/?)>

Its working fine but the only problem is, if one tag contains more than one required attribute then it keeps only the last matched single attribute and discards the rest.

I'm trying to clean following text

<title>Hello World</title>
<div fadeout"="" style="margin:0px;" class="xyz">
    <img src="abc.jpg" alt="" />
    <p style="margin-bottom:10px;">
        The event is celebrating its 50th anniversary K&ouml;&nbsp;
        <a style="margin:0px;" href="http://www.germany.travel/">exhibition grounds in Cologne</a>.
    </p>
    <p style="padding:0px;"></p>
    <p style="color:black;">
        <strong>A festival for art lovers</strong>
    </p>
</div>

at https://regex101./#javascript using aforementioned expression with <$1$2$4> as substitution string and getting following output:

<title>Hello World</title>
<div style="margin:0px;">
    <img src="abc.jpg"/>
    <p style="margin-bottom:10px;">
        The event is celebrating its 50th anniversary K&ouml;&nbsp;
        <a href="http://www.germany.travel/">exhibition grounds in Cologne</a>.
    </p>
    <p style="padding:0px;"></p>
    <p style="color:black;">
        <strong>A festival for art lovers</strong>
    </p>
</div>

Problem is "style" attribute is discarded from anchor tag. I have tried to replicate the (\s(src|href|style)=['\"][^'\"]*['\"]) block using * operator, {3} selector and much more but in vain. Any suggestions???

Share Improve this question edited Apr 8, 2016 at 8:51 Ketan 2721 silver badge11 bronze badges asked Apr 8, 2016 at 8:16 Ahmad AhsanAhmad Ahsan 1893 silver badges19 bronze badges 5
  • I can suggest using RegexBuddy for testing expressions. It saved me a lot of time in the past. regexbuddy. – Bozidar Sikanjic Commented Apr 8, 2016 at 8:24
  • For reference, OP's code can be found at regex101./r/mP0pX6/1 – Adrian Wragg Commented Apr 8, 2016 at 8:25
  • 1 Why don't you use DOM manipulation instead of RegEX? – Salman Arshad Commented Apr 8, 2016 at 9:40
  • 1 @SalmanA I'm trying to do the same using DOM manipulation but jquery 1.9.1 is failing. jQuery 2.0.0 fixes the issue but my application other libraries are not patible. Any suggestion? Here is my fiddler test link : jsfiddle/vytu9duc/5 Facing following error in console: Uncaught InvalidCharacterError: Failed to execute 'setAttribute' on 'Element': 'fadeout"' is not a valid attribute name. Any suggestion? – Ahmad Ahsan Commented Apr 11, 2016 at 14:40
  • Related: Regex to remove HTML attribute from any HTML tag? – vsync Commented Nov 10, 2020 at 20:48
Add a ment  | 

2 Answers 2

Reset to default 5

@AhmadAhsan here is demo to fix your issue using DOM manipulation: https://jsfiddle/pu1hsdgn/

   <script src="https://code.jquery./jquery-1.9.1.js"></script>
    <script>
        var whitelist = ["src", "href", "style"];
        $( document ).ready(function() {
            function foo(contents) {
            var temp = document.createElement('div');
            var html = $.parseHTML(contents);
            temp = $(temp).html(contents);

            $(temp).find('*').each(function (j) {
                var attributes = this.attributes;
                var i = attributes.length;
                while( i-- ) {
                    var attr = attributes[i];
                    if( $.inArray(attr.name,whitelist) == -1 )
                        this.removeAttributeNode(attr);
                }
            });
            return $(temp).html();
        }
        var raw = '<title>Hello World</title><div style="margin:0px;" fadeout"="" class="xyz"><img src="abc.jpg" alt="" /><p style="margin-bottom:10px;">The event is celebrating its 50th anniversary K&ouml;&nbsp;<a href="http://www.germany.travel/" style="margin:0px;">exhibition grounds in Cologne</a>.</p><p style="padding:0px;"></p><p style="color:black;"><strong>A festival for art lovers</strong></p></div>'
        alert(foo(raw));
    });
    </script>

Here you go, based on your original regex:

<([a-z][a-z0-9]*?)(?:[^>]*?((?:\s(?:src|href|style)=['\"][^'\"]*['\"]){0,3}))[^>]*?(\/?)>

Group 1 is the tag name, group 2 are the attributes, and group 3 is the / if there is one. I couldn't get it to work with non-allowed attributes interleaved with allowed attributes e.g. <a href="foo" class="bar" src="baz" />. I don't think it can be done.

Edit: Per @AhmadAhsan's corrections below the regex should be:

var html = `<div fadeout"="" style="margin:0px;" class="xyz">
                <img src="abc.jpg" alt="" />
                <p style="margin-bottom:10px;">
                    The event is celebrating its 50th anniversary K&ouml;&nbsp;
                    <a style="margin:0px;" href="http://www.germany.travel/">exhibition grounds in Cologne</a>.
                </p>
                <p style="padding:0px;"></p>
                <p style="color:black;">
                    <strong>A festival for art lovers</strong>
                </p>
            </div>`


console.log( 
  html.replace(/<([a-z][a-z0-9]*)(?:[^>]*?((?:\s(?:src|href|style)=['\"][^'\"]*['\"]){0,3}))[^>]‌​*?(\/?)>/, '')
)
    

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744954633a4603120.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信