PDF 文件操作
PyPDF2 读取和操作 PDF
pip install PyPDF2
提取文本
import PyPDF2
reader = PyPDF2.PdfReader('test.pdf')
for page in reader.pages:
text = page.extract_text()
print(text)
旋转页面
reader = PyPDF2.PdfReader('test.pdf')
writer = PyPDF2.PdfWriter()
for page in reader.pages:
rotated = page.rotate(90) # 顺时针旋转90度
# page.rotateCounterClockwise(90) # 逆时针
writer.add_page(rotated)
with open('rotated.pdf', 'wb') as f:
writer.write(f)
加密 PDF
writer = PyPDF2.PdfWriter()
for page in PyPDF2.PdfReader('test.pdf').pages:
writer.add_page(page)
writer.encrypt('password123')
with open('encrypted.pdf', 'wb') as f:
writer.write(f)
合并页面(水印)
reader1 = PyPDF2.PdfReader('test.pdf')
reader2 = PyPDF2.PdfReader('watermark.pdf') # 水印页PDF
writer = PyPDF2.PdfWriter()
watermark = reader2.pages[0]
for page in reader1.pages:
page.merge_page(watermark) # 叠加水印
writer.add_page(page)
with open('watermarked.pdf', 'wb') as f:
writer.write(f)
reportlab 生成 PDF
pip install reportlab
from reportlab.lib.pagesizes import A4
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas
c = canvas.Canvas('demo.pdf', pagesize=A4)
w, h = A4
# 绘制图片
img = canvas.ImageReader('photo.jpg')
c.drawImage(img, 20, h - 200, 150, 180)
# 换页
c.showPage()
# 注册字体(支持中文)
pdfmetrics.registerFont(TTFont('MyFont', 'chinese_font.ttf'))
# 写入文字
c.setFont('MyFont', 40)
c.setFillColorRGB(0.9, 0.5, 0.3)
c.drawString(w//2 - 80, h//2, '你好,PDF!')
c.save()
其他工具
| 库 |
用途 |
pdfminer.six |
提取文本(命令行:pdf2text.py file.pdf) |
pdfplumber |
表格提取 |
pypdf |
PyPDF2 的活跃分支,推荐用这个 |
总结
PyPDF2/pypdf:读取文本、旋转页面、加密、合并/水印
reportlab:从零生成 PDF,支持绘图和文字
- 中文 PDF 生成需要注册中文字体
- PDF 文本提取效果取决于文件是否嵌入字体(扫描件无法提取)