首先看了这篇文章前端使用puppeteer 爬虫生成《React.js 小书》PDF并合并,发现最后的pdf没有书签,很难受,所以主要在此基础上加了加书签的功能。
爬去的示例网站为React.js 小书,仅做学习交流
使用puppeteer爬取网页并生成pdf
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
await browser.close();
})();
pdf-merge:合并pdf
依赖于pdftk
pdftk:一个处理pdf的工具
- 安装后将bin目录添加到环境变量
利用update_info_utf8
给pdf增加书签:
pdftk 'd:\OpenSource\My\genpfdforrsb\React 小书(无书签).pdf' update_info_utf8 'd:\OpenSource\My\genpfdforrsb\bookmarks.txt' output 'd:\OpenSource\My\genpfdforrsb\React 小书.pdf'
也就是bookmarks.txt
书签格式:
BookmarkBegin
BookmarkTitle: PDF Reference (Version 1.5)
BookmarkLevel: 1
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: Contents
BookmarkLevel: 2
BookmarkPageNumber: 3
pdfjs-dist:获取单个pdf页数,用于bookmarks.txt中指定页码
const pageArr = result.map(c => c.numPages);
let txt = ''
for (let index = 0; index < pageArr.length; index++) {
let temp = `BookmarkBegin\r\nBookmarkTitle: ${titleArr[index]}\r\nBookmarkLevel: 1\r\nBookmarkPageNumber: ${pageIndex}\r\n`
txt += temp
pageIndex += pageArr[index]
}
fs.writeFileSync('bookmarks.txt', txt);
参考pdf-merge
源码,增加runshell.js
用于在node中执行pdftk
的命令
runshell.js如下:
'use strict';
const child = require('child_process');
const Promise = require('bluebird');
const exec = Promise.promisify(child.exec);
module.exports = (scripts) => new Promise((resolve, reject) => {
exec(scripts)
.then(resolve)
.catch(reject);
});
执行pdftk update_info_utf8
const nobkname = 'React 小书(无书签).pdf'
const hasbkname = 'React 小书.pdf'
mergepdf(nobkname).then(buffer => {
console.log('starting add bookmarks!')
runshell(`pdftk "${__dirname}/${nobkname}" update_info_utf8 "${__dirname}/bookmarks.txt" output "${__dirname}/${hasbkname}"`).then(() => {
console.log('completed add bookmarks!')
fs.unlinkSync(`${__dirname}/${nobkname}`);
fs.unlinkSync(`${__dirname}/bookmarks.txt`);
console.log('all completed!')
})
})
- 文件路径需要用双引号
源码:genpfdforrsb
合并后的pdf页码不是连续的,还是单个pdf的页码