Removed engine limitaitons in package.json since they no longer are useful.
Updated cheerio dependencies
Add support for extracting out softTitle
, date
, copyright
, author
, publisher
thanks to @philgooch. See #49 .
Add support for pulling the page description out of og:description tags
Fix a hidden but where unrelated words were joined together when counting number of words in a block of text
Fixed an issue where page tags were returning line breaks in the tag names for some pages
Fix issue where an SVG image embedded in the page will have it's title concatenated with the page title
Updated Portuguese stopwords file
Fix an issue with junk being left on the page when parsing USA Today news story pages.
Bulleted lists in a webpage are now retained in the output.
Prefer <meta> og:title tag to <title> element when parsing title of document (Thanks to bradvogel)
Added extractor.lazy() function for lazy access to document properties (Thanks to franza)
Added Thai stopwords (Thanks to thangman22)
If you specify a language that isn't supported, fall back to english and warn the user (Thanks to mhuebert for #12 )
Added Turkish stopwords (Thanks to ayhankuru)
Handle pages with code blocks better (like github pages)
Fix case where text will get dropped accidentally. See #9 .
Better handle html with random line breaks. See #6 .
Added ability to extract an image from articles. See #4 .
Added ability to extract embedded videos from articles. See #2 .
You can’t perform that action at this time.