Menu

Home

John Dalbey

My local public library uses the OverDrive platform to allow patrons to access ebooks and other digital resources. I prefer to read ebooks on a Kindle, and happily many books are available in Kindle format. However, there are some items that are available only in EPUB format with DRM protection, or as Overdrive Read books. The EPUB format books can only be read using Adobe Digital Editions which is not available for Linux, my preferred OS. So my only option is to view the Overdrive Read book in a web browser.

This simple Python program converts the ebook in the browser into a plain text file.

The mechanics work like this:
1. Take a screenshot of a page of the book as displayed in the browser.
2. Use Optical Character Recognition (OCR) to convert the screenshot into plain text (using Tesseract).
3. Save the text to a local file.
4. Advance to the next page by simulating a key press.

When completed, the resulting text file can be converted to Kindle format using Calibre or other conversion tool.

I wrote this utility for my own use and it is clearly not production quality software. It has only minimal error handling. Like many automation tools that rely on simulating human input, it is somewhat brittle and easily confused by errant user actions. The OCR is not perfect and doesn't perform well on multiple columns, sidebars, footnotes, tables, etc. Given these limitations it has worked really well for me.

The source code is pretty straightforward and has some explanatory comments so it should be easy to modify or enhance.
Use the ticket system to submit defect reports or enhancement requests.

Installation

Download the source code:
curl https://sourceforge.net/p/ebook-scraper/code/HEAD/tree/trunk/ebook_scraper.py?format=raw > ebook_scraper.py
Install dependencies:
sudo apt-get install tesseract-ocr
Create a virtual environment and activate it:
python3 -m venv ebook-env
source ebook-env/bin/activate
Install required python modules:
pip install PySimpleGUI pyautogui pytesseract
Run the application:
python ebook_scraper.py

Usage

Open a web browser with the Overdrive book you want to read. Launch the Ebook Scraper and position the opening dialog side-by-side with the browser. In browser, advance to the page you want to start scraping. On the top banner, click on the column control to select single column. Click on the bottom to view the progress bar. Enter the desired start and end page numbers dialog form. Click on the text to hide the progress bar. Click OK in the dialog.
Next you will be prompted to locate the area of the screen that contains the text to be scraped. Once the boundaries are located, the process commences automatically. Sit back and watch as the browser pages flip by. Don't touch the mouse or keyboard until the scraping is complete.
The output file named ebook_content.txt will be found in the same directory as the program.

Video demonstration

I'm not a lawyer nor have I consulted one on this issue. My claim is that it is legal, for personal use, and I support that claim with several arguments.

  1. I obtained the original material, the ebook, by checking it out from the library, the lawful way to proceed. So I am allowed to view the material assuming I do so within the time period for which I have checked out the book. This tool simply facilitates my viewing the ebook on a different device (a Kindle reader) instead of a web browser.

  2. This tool does not break or hack the DRM features of the ebook.

  3. An analogy could be made with recording a song on the radio or using a DVR to record a TV show. In the U.S. the "Audio Home Recording Act of 1992" permits recording audio from the radio, and the Betamax ruling of 1984 considered it a fair use to record programs off the television to be watched at a later date.

Project Members:

d