The pdftotext Utility
The pdftotext
command-line utility provides capabilities for converting PDF fils into TXT files for further processing.
References:
Installation
First see if Git is already installed (it may come pre-installed):
# Mac Terminal:
pdftotext --help #> pdftotext version 0.65.0 ...
which pdftotext #> /usr/local/bin/pdftotext
# Windows Command Prompt, Anaconda Prompt, or Git Bash:
# first navigate to the folder where you downloaded the "pdftotext.exe" file, then...
pdftotext --help #> pdftotext version 0.65.0 ...
where pdftotext #> /path/to/pdftotext.exe
If these commands produce a version-looking output and a filepath-looking output, respectively, then the utility is already installed and you can skip down to the "Usage" section. Otherwise, follow the OS-specific sections below to install it.
Installation on Mac
Mac users can install pdftotext
via homebrew:
brew install pkg-config poppler
After installing, restart your terminal application, where you should now be able to execute pdftotext
commands (like pdftotext --version
).
Installation on Windows
Windows users can install pdftotext
by visiting https://www.xpdfreader.com/download.html and clicking "Download the Xpdf tools". After downloading onto your local computer, locate the zip file and unzip / extract it, then move the unzipped folder to a location like the Desktop or the Programs directory. Inside the unzipped folder, observe the absolute filepath location of the executable file called "bin64/pdftotext.exe". When you need to use the pdftotext
utility in the future, either reference it from this location (e.g. /path/to/pdftotext --version
) or add an alias to that location via your "~/.bash_profile", or move a copy of that file into any project repository you'd like to reference it from.
Usage
Download a PDF file onto your Desktop or some other location (e.g. "/path/to/my_document.pdf"), then navigate there from the command-line. Then process the PDF into a new TXT file (e.g. "/path/to/my_document.txt"):
pdftotext /path/to/my_document.pdf /path/to/my_document.txt
Then examine the contents of the TXT file in your text editor to see how well it was able to parse the original PDF document:
code /path/to/my_document.txt
Last updated
Was this helpful?