Does anybody know a way to vectorize the text in a PDF document? That is, I want each letter to be a shape/outline, without any textual content. I'm using a Linux system, and open source or a non-Windows solution would be preferred. The context: I'm trying to edit some old PDFs, for which I no longer have the fonts. I'd like to do it in Inkscape, but that will replace all the fonts with generic ones, and that's barely readable. I've also been converting back and forth using pdf2ps and ps2pdf , but the font info stays there. So when I load it into Inkscape, it still looks awful. Any ideas? Thanks.
Adam Smith asked Nov 10, 2014 at 23:37 Adam Smith Adam Smith 489 9 9 silver badges 24 24 bronze badges PDF-TEXT-To-Outlines with adblocker seems to work well for one-off privacy insensitive documents. Commented Apr 22, 2022 at 15:19To achieve this, you will have to:
This answer will omit step 3, since that's not programmable.
If you don't want a programmatic way to split documents, the modern way would be with using stapler. In your favorite shell:
stapler burst file.pdf
Would generate , where 1. N are the PDF pages. Stapler itself uses PyPDF2 and the code for splitting a PDF file is not that complex. The following function splits a file and saves the individual pages in the current directory. (shamelessly copying from the commands.py file)
import math import os from PyPDF2 import PdfFileWriter, PdfFileReader def split(filename): with open(filename) as inputfp: inputpdf = PdfFileReader(inputfp) base, ext = os.path.splitext(os.path.basename(filename)) # Prefix the output template with zeros so that ordering is preserved # (page 10 after page 09) output_template = ''.join([ base, '_', '%0', str(math.ceil(math.log10(inputpdf.getNumPages()))), 'd', ext ]) for page in range(inputpdf.getNumPages()): outputpdf = PdfFileWriter() outputpdf.addPage(inputpdf.getPage(page)) outputname = output_template % (page + 1) with open(outputname, 'wb') as fp: outputpdf.write(fp)
Now to convert the PDFs to editable files, I'd probably use pdf2svg.
pdf2svg input.pdf output.svg
If we take a look at the pdf2svg.c file, we can see that the code in principle is not that complex (assuming the input filename is in the filename variable and the output file name is in the outputname variable). A minimal working example in python follows. It requires the pycairo and pypoppler libraries:
import os import cairo import poppler def convert(inputname, outputname): # Convert the input file name to an URI to please poppler uri = 'file://' + os.path.abspath(inputname) pdffile = poppler.document_new_from_file(uri, None) # We only have one page, since we split prior to converting. Get the page page = pdffile.get_page(0) # Get the page dimensions width, height = page.get_size() # Open the SVG file to write on surface = cairo.SVGSurface(outputname, width, height) context = cairo.Context(surface) # Now we finally can render the PDF to SVG page.render_for_printing(context) context.show_page()
At this point you should have an SVG in which all text has been converted to paths, and will be able to edit with Inkscape without rendering issues.
You can call pdf2svg in a for loop to do that. But you would need to know the number of pages beforehand. The code below figures the number of pages and does the conversion in a single step. It requires only pycairo and pypoppler:
import os, math import cairo import poppler def convert(inputname, base=None): '''Converts a multi-page PDF to multiple SVG files. :param inputname: Name of the PDF to be converted :param base: Base name for the SVG files (optional) ''' if base is None: base, ext = os.path.splitext(os.path.basename(inputname)) # Convert the input file name to an URI to please poppler uri = 'file://' + os.path.abspath(inputname) pdffile = poppler.document_new_from_file(uri, None) pages = pdffile.get_n_pages() # Prefix the output template with zeros so that ordering is preserved # (page 10 after page 09) output_template = ''.join([ base, '_', '%0', str(math.ceil(math.log10(pages))), 'd', '.svg' ]) # Iterate over all pages for nthpage in range(pages): page = pdffile.get_page(nthpage) # Output file name based on template outputname = output_template % (nthpage + 1) # Get the page dimensions width, height = page.get_size() # Open the SVG file to write on surface = cairo.SVGSurface(outputname, width, height) context = cairo.Context(surface) # Now we finally can render the PDF to SVG page.render_for_printing(context) context.show_page() # Free some memory surface.finish()
To reassemble you can use the pair inkscape / stapler to convert the files manually. But it is not hard to write code that does this. The code below uses rsvg and cairo. To convert from SVG and merge everything into a single PDF:
import rsvg import cairo def convert_merge(inputfiles, outputname): # We have to create a PDF surface and inform a size. The size is # irrelevant, though, as we will define the sizes of each page # individually. outputsurface = cairo.PDFSurface(outputname, 1, 1) outputcontext = cairo.Context(outputsurface) for inputfile in inputfiles: # Open the SVG svg = rsvg.Handle(file=inputfile) # Set the size of the page itself outputsurface.set_size(svg.props.width, svg.props.height) # Draw on the PDF svg.render_cairo(outputcontext) # Finish the page and start a new one outputcontext.show_page() # Free some memory outputsurface.finish()
PS: It should be possible to use the command pdftocairo , but it doesn't seem to call render_for_printing() , which makes the output SVG maintain the font information.