The pdftotext Utility
The pdftotext
command-line utility provides capabilities for converting PDF fils into TXT files for further processing.
References:
Installation
First see if Git is already installed (it may come pre-installed):
If these commands produce a version-looking output and a filepath-looking output, respectively, then the utility is already installed and you can skip down to the "Usage" section. Otherwise, follow the OS-specific sections below to install it.
Installation on Mac
Mac users can install pdftotext
via homebrew:
After installing, restart your terminal application, where you should now be able to execute pdftotext
commands (like pdftotext --version
).
Installation on Windows
Windows users can install pdftotext
by visiting https://www.xpdfreader.com/download.html and clicking "Download the Xpdf tools". After downloading onto your local computer, locate the zip file and unzip / extract it, then move the unzipped folder to a location like the Desktop or the Programs directory. Inside the unzipped folder, observe the absolute filepath location of the executable file called "bin64/pdftotext.exe". When you need to use the pdftotext
utility in the future, either reference it from this location (e.g. /path/to/pdftotext --version
) or add an alias to that location via your "~/.bash_profile", or move a copy of that file into any project repository you'd like to reference it from.
Usage
Download a PDF file onto your Desktop or some other location (e.g. "/path/to/my_document.pdf"), then navigate there from the command-line. Then process the PDF into a new TXT file (e.g. "/path/to/my_document.txt"):
Then examine the contents of the TXT file in your text editor to see how well it was able to parse the original PDF document:
Last updated