PyPDF2 an introduction

Posted by Afsal on 13-Oct-2023

Hi Pythonistas!

Today we will learn about PyPDF2, which can be used for reading contents from PDF files, merging 2 pdf files, rotation pdf files etc. Today we will learn how to extract text from PDF files.

Let us dive into the code

Installation

pip install PyPDF2

Extracting text from PDF

from PyPDF2 import PdfReader

reader = PdfReader("sample.pdf")
pages = reader.pages
for page in pages:
    text = page.extract_text()
    print(text)

Explanation

PdfReader("sample.pdf") -  read pdf file with name sample.pdf

reader.pages - Get all pages output will be an iterable

page.extract_text() - Extract the text from the page

Output

 A Simple PDF File

 This is a small demonstration .pdf file -

 just for use in the Virtual Mechanics tutorials. More text. And more

 text. And more text. And more text. And more text.

 And more text. And more text. And more text. And more text. And more

 text. And more text. Boring, zzzzz. And more text. And more text. And

 more text. And more text. And more text. And more text. And more text.

 And more text. And more text.

 And more text. And more text. And more text. And more text. And more

 text. And more text. And more text. Even more. Continued on page 2 ...

 Simple PDF File 2

 ...continued from page 1. Yet more text. And more text. And more text.

 And more text. And more text. And more text. And more text. And more

 text. Oh, how boring typing this stuff. But not as boring as watching

 paint dry. And more text. And more text. And more text. And more text.

 Boring.  More, a little more text. The end, and just as well.

Reading PDF is as simple as this using pyPDF2. In the upcoming posts we will learn about how to merge 2 PDFs using this package. 

I hope you have learned something new from this post. Please share your valuable suggestions with afsal@parseltongue.co.in