-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strip out navigational and other superfluous elements #862
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @woutervanwijk , thanks for the patch. Did you see this broke the linter, and probably tests?
The current code unfortunately cannot assume that querySelectorAll
or similar are available on the DOM implementation, so using class/ID selectors for _clean
probably doesn't quite work.
It would also be really helpful to have some examples where this change produces some improvement (apart from the likely required changes to existing testcases), to understand the motivation behind some of the changes.
var fullArticleText = this._doc.body.innerText; | ||
if (fullArticleText.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is breaking all the tests as innerText
is not defined.
@@ -1546,7 +1582,7 @@ Readability.prototype = { | |||
|
|||
// get article published time | |||
metadata.publishedTime = jsonld.datePublished || | |||
values["article:published_time"] || null; | |||
values["article:published_time"] || null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a mistake as this is a continuation line vs. the line before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the linting - but we really need some better tests, and ideally we'd make these changes as individual ones with a test per change, so that it's more obvious what changes in behaviour each of the suggested modifications has.
//scale down h2-h5 because it's too large most of the time (intro's in h2, etc) | ||
this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h5"]), "h6"); | ||
this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h4"]), "h5"); | ||
this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h3"]), "h4"); | ||
this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h2"]), "h3"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment doesn't seem like a great reason to change the semantics of the headers - the "right" solution would probably to revert intro paragraphs into, well, paragraphs, instead of headers, if they meet some threshold / algorithm.
UNLIKELY_ROLES: [ "menu", "menubar", "complementary", "navigation", "alert", "alertdialog", "dialog" ], | ||
UNLIKELY_ROLES: [ "menu", "menubar", "complementary", "navigation", "alert", "alertdialog", "dialog", "nav" ], | ||
|
||
NODES_TO_CLEAN_FIRST: ["object", "embed", "footer", "link", "aside", "nav", ".icons", ".byline", ".sub-nav", ".identity", ".logo", ".video-player", ".jw-player", ".jw-wrapper", ".video", ".byline", ".author", ".dateline", ".credentials", ".writtenby", ".p-author", ".article-author", ".navigation", ".hidden-xs", ".hidden-sm", ".brand", ".modalContent", ".noPrint", ".noprint", ".screenonly", ".breadcrumb", ".breadcrumbs", "amp-iframe", "amp-img", "amp-ad", ".advert", ".ads", ".brand", ".search", ".nav", ".user", ".users", "#onetrust-consent-sdk", "#branding", "#branding-content" ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of these aren't nodenames, so _clean
won't actually remove them, I think?
It would also be really helpful to better understand what the source of this list of elements is.
unlikelyCandidates: /-ad-|ai2html|banner|combx|comment|community|cover-wrap|credentials|date|hide|hidden|disqus|extra|footer|gdpr|legends|nav|paywall|meta|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|share|sharing|yom-remote|byline|topbar|article-meta|brand|tooltip/i, | ||
okMaybeItsACandidate: /and|article|body|column|content|main|shadow|header|summary/i, | ||
|
||
positive: /article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story/i, | ||
negative: /-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget/i, | ||
extraneous: /print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single|utility/i, | ||
byline: /byline|author|dateline|writtenby|p-author/i, | ||
positive: /article|body|content|entry|header|hentry|h-entry|intro|intro|intro|intro|main|main-article|main-content|page|lead|leading|pagination|primary|post|text|blog|story|summary|strapline/i, | ||
negative: /-ad-|affiliate|credentials|controls|date|desktop|hidden|nav|^hid$| hid$| hid |^hid |hide|banner|login|gate|combx|comment|com-|contact|foot|footer|footnote|gdpr|icon|^icon|icons$|icons|masthead|media|meta|paywall|nav|outbrain|promo|related|scroll|share|sharing|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|tooltip|widget|video-player|video|jw-player|jw-aspect|modal|carousel|overlay|byline|brand|disclosure|nav|logo|account|cart|dock/i, | ||
extraneous: /print|affiliate|archive|button|comment|controls|discuss|e[\-]?mail|meta|icons|share|reply|all|login|sign|single|utility|icons|nav|video-player|jw-player|modal|video|paidcontent|carousel|overlay|social|topbar|article-meta|onetrust-consent-sdk|logo|account|cart|hamburger|traffic|weather|search/i, | ||
byline: /byline|author|dateline|credentials|writtenby|p-author|article-author/i, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are some really significant changes, and we'd need to have some testcases to help explain why we're making these changes.
Some of the changes also seem like mistakes, e.g. nav
is in there several times, and icon
will always match when ^icon
and icons$
and icons
match.
Small tweaks to strip out navigational elements and other common superfluous elements. Works really nicely