Strict HTML parsing in JavaScript - Stack Overflow

On Google Chrome (Canary), it seems no string can make the DOM parser fail. I'm trying to parse so

On Google Chrome (Canary), it seems no string can make the DOM parser fail. I'm trying to parse some HTML, but if the HTML isn't pletely, 100%, valid, I want it to display an error. I've tried the obvious:

var newElement = document.createElement('div');
newElement.innerHTML = someMarkup; // Might fail on IE, never on Chrome.

I've also tried the method in this question. Doesn't fail for invalid markup, even the most invalid markup I can produce.

So, is there some way to parse HTML "strictly" in Google Chrome at least? I don't want to resort to tokenizing it myself or using an external validation utility. If there's no other alternative, a strict XML parser is fine, but certain elements don't require closing tags in HTML, and preferably those shouldn't fail.

On Google Chrome (Canary), it seems no string can make the DOM parser fail. I'm trying to parse some HTML, but if the HTML isn't pletely, 100%, valid, I want it to display an error. I've tried the obvious:

var newElement = document.createElement('div');
newElement.innerHTML = someMarkup; // Might fail on IE, never on Chrome.

I've also tried the method in this question. Doesn't fail for invalid markup, even the most invalid markup I can produce.

So, is there some way to parse HTML "strictly" in Google Chrome at least? I don't want to resort to tokenizing it myself or using an external validation utility. If there's no other alternative, a strict XML parser is fine, but certain elements don't require closing tags in HTML, and preferably those shouldn't fail.

Share edited May 23, 2017 at 12:20 CommunityBot 11 silver badge asked Feb 19, 2012 at 22:13 Ry-Ry- 225k56 gold badges493 silver badges499 bronze badges 8
  • "strict" in JavaScript has a specific meaning, so I've edited the title of your question – T.J. Crowder Commented Feb 19, 2012 at 22:20
  • 1 "...certain elements don't require closing tags in HTML..." Some elements don't require opening tags, either. – T.J. Crowder Commented Feb 19, 2012 at 22:21
  • tried it with HTML doctype strict? – powtac Commented Feb 19, 2012 at 22:21
  • @powtac: I'm trying to parse HTML fragments - no DTD. – Ry- Commented Feb 19, 2012 at 22:23
  • @T.J.Crowder: Okay - but the question remains :) – Ry- Commented Feb 19, 2012 at 22:24
 |  Show 3 more ments

1 Answer 1

Reset to default 7

Use the DOMParser to check a document in two steps:

  1. Validate whether the document is XML-conforming, by parsing it as XML.
  2. Parse the string as HTML. This requires a modification on the DOMParser.
    Loop through each element, and check whether the DOM element is an instance of HTMLUnknownElement. For this purpose, getElementsByTagName('*') fits well.
    (If you want to strictly parse the document, you have to recursively loop through each element, and remember whether the element is allowed to be placed at that location. Eg. <area> in <map>)

Demo: http://jsfiddle/q66Ep/1/

/* DOM parser for text/html, see https://stackoverflow./a/9251106/938089 */
;(function(DOMParser) {"use strict";var DOMParser_proto=DOMParser.prototype,real_parseFromString=DOMParser_proto.parseFromString;try{if((new DOMParser).parseFromString("", "text/html"))return;}catch(e){}DOMParser_proto.parseFromString=function(markup,type){if(/^\s*text\/html\s*(;|$)/i.test(type)){var doc=document.implementation.createHTMLDocument(""),doc_elt=doc.documentElement,first_elt;doc_elt.innerHTML=markup;first_elt=doc_elt.firstElementChild;if (doc_elt.childElementCount===1&&first_elt.localName.toLowerCase()==="html")doc.replaceChild(first_elt,doc_elt);return doc;}else{return real_parseFromString.apply(this, arguments);}};}(DOMParser));

/*
 * @description              Validate a HTML string
 * @param       String html  The HTML string to be validated 
 * @returns            null  If the string is not wellformed XML
 *                    false  If the string contains an unknown element
 *                     true  If the string satisfies both conditions
 */
function validateHTML(html) {
    var parser = new DOMParser()
      , d = parser.parseFromString('<?xml version="1.0"?>'+html,'text/xml')
      , allnodes;
    if (d.querySelector('parsererror')) {
        console.log('Not welformed HTML (XML)!');
        return null;
    } else {
        /* To use text/html, see https://stackoverflow./a/9251106/938089 */
        d = parser.parseFromString(html, 'text/html');
        allnodes = d.getElementsByTagName('*');
        for (var i=allnodes.length-1; i>=0; i--) {
            if (allnodes[i] instanceof HTMLUnknownElement) return false;
        }
    }
    return true; /* The document is syntactically correct, all tags are closed */
}

console.log(validateHTML('<div>'));  //  null, because of the missing close tag
console.log(validateHTML('<x></x>'));// false, because it's not a HTML element
console.log(validateHTML('<a></a>'));//  true, because the tag is closed,
                                     //       and the element is a HTML element

See revision 1 of this answer for an alternative to XML validation without the DOMParser.

Considerations

  • The current method pletely ignores the doctype, for validation.
  • This method returns null for <input type="text">, while it's valid HTML5 (because the tag is not closed).
  • Conformance is not checked.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744652217a4585967.html

相关推荐

  • Strict HTML parsing in JavaScript - Stack Overflow

    On Google Chrome (Canary), it seems no string can make the DOM parser fail. I'm trying to parse so

    2天前
    50

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信