I need to extract all the text content from a web page. I have used 'document.body.textContent'. But I get the javascript content as well.How do I ensure that I get only the readable text content?
function myFunction() {
var str = document.body.textContent
alert(str);
}
<html>
<title>Test Page for Text extraction</title>
<head>I hope this works</head>
<script src=".1.3/jquery.min.js"></script>
<body>
<p>Test on this content to change the 5th word to a link
<p>
<button onclick="myFunction()">Try it</button>
</body>
</hmtl>
I need to extract all the text content from a web page. I have used 'document.body.textContent'. But I get the javascript content as well.How do I ensure that I get only the readable text content?
function myFunction() {
var str = document.body.textContent
alert(str);
}
<html>
<title>Test Page for Text extraction</title>
<head>I hope this works</head>
<script src="https://ajax.googleapis./ajax/libs/jquery/2.1.3/jquery.min.js"></script>
<body>
<p>Test on this content to change the 5th word to a link
<p>
<button onclick="myFunction()">Try it</button>
</body>
</hmtl>
Share
Improve this question
asked Sep 28, 2015 at 14:49
vjravivjravi
861 silver badge6 bronze badges
2 Answers
Reset to default 5Just remove the tags you dont want read before doing body.textContent
.
function myFunction() {
var bodyScripts = document.querySelectorAll("body script");
for(var i=0; i<bodyScripts.length; i++){
bodyScripts[i].remove();
}
var str = document.body.textContent;
document.body.innerHTML = '<pre>'+str+'</pre>';
}
<html>
<title>Test Page for Text extraction</title>
<head>I hope this works</head>
<script src="https://ajax.googleapis./ajax/libs/jquery/2.1.3/jquery.min.js"></script>
<body>
<p>Test on this content to change the 5th word to a link
<p>
<button onclick="myFunction()">Try it</button>
</body>
</hmtl>
Try document.body.innerText
.
This MDN article describes the differences between textContent
and innerText
:
Don't get confused by the differences between
Node.textContent
andHTMLElement.innerText
. Although the names seem similar, there are important differences:
textContent
gets the content of all elements, including<script>
and<style>
elements. In contrast,innerText
only shows "human-readable" elements.textContent
returns every element in the node. In contrast,innerText
is aware of styling and won't return the text of "hidden" elements.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744298389a4567376.html
评论列表(0条)