HomeThe ClassicsFarai's Codelab

A Really Bad Way To Extract PDFs

Published: Updated:

Note 30 August 2023: It’s been a while since I’ve looked at this. As I’m looking back at all of my old work, I modified the post slightly to fix the errors I had missed. The source code is also online now.


Zimbabwe had elections in 2018 and I noticed how badly the election results were presented and decided fix that. It would essentially be a data visualization project where I would turn the ugly mess of PDF files and transform it into a beautiful front end. Before I start doing all that however, I had to turn the mess of PDF files into a workable format.

The Easy Parts

Presidential Results

The Presidential results were placed into .xlsx files which meant that I just had to copy-paste them into a single spreadsheet. After a quick check and cleaning up the headers, I got a spreadsheet with over 10,000 rows.

Disability Senators

There were onlytwo, I wrote them out manually.

The Council of Chiefs

Same as the disability senators—I wrote them out manually though there are 16 instead of 2.

The Hard Parts

Collecting the Polling Stations

The polling stations could be found in a 100+-page PDF file. Since the data was in a single table across those pages, it was a bit easier to do than the other scraping tasks. I wanted to find a library that could scrape PDFs but I couldn’t find any in Python. I then noticed how PDFs are rendered in the browser (Firefox) and discovered that PDFs are rendered into HTML. I then thought to myself “why not view source and copy all the code”.

Well, that didn’t work.

I then realized that the rendering was JavaScript based so I’d have to use the inspector tool and grab the HTML for the markup responsible for the PDF’s contents.

That didn’t work either.

Turns out that not all the pages are rendered due to performance reasons. Fine, I’d scroll to the end of the PDF and then grab the HTML. Again, for performance reasons the pages I’ve scrolled past were unloaded. Instead of using the PDF.js library in node to render the document, I did something truly ridiculous—I zoomed all the way out to force every page to render at once. This the point at which I realized that performance concerns are very real.

The 100 PDF pages loading in Firefox

You thought that I was joking?

These became even more real once I tried to open the file in Visual Studio Code and it struggled. It eventually opened and it was time to devise a strategy to extract the PDFs.

Studying the HTML, I discovered a structure.

  1. All the PDF’s pages were rendered in a div with a class of textLayer.
  2. Each of the cells in the table were enclosed in a span tag and positioned to look like a table.
  3. While all the cells were spans, cells under a particular column had the same left offset.
  4. Along those lines, cells within a row had the same top offset.
  5. A row had 11 columns in it.

Realizing 5, I made a script parse_polling_stations.py that would

  1. Remove the header (containing the title) and the footer (page number) since it would be part of the table if I didn’t
  2. Extract all the pages containing the table data.
  3. Go through each page and their respective cells. To do this, I used the enumerate function to get a page number and the page contents.
  4. Loop through each row, adding each of the cells contents.
  5. Write it to a CSV file.

The first time I ran this I felt so smart. THat was until I realized that the last bunch of rows were misalligned. After combing through 6000 odd items, I discovered that a cell (from a Bikita district) didn’t have any data in it. I was wise enough to take note of that cell so I didnt have to do all that again just to tell you.

Eventually I got a massive CSV file. Not as big as the original HTML, but still big.

National Assembly

This one needed a whole lot of copy pasting. At first I tried to use a regular expression and I got something that did suprisingly well. Sadly there were a whole load of edge cases that couldn’t be parsed with Python’s dialect of regular expression. I then discovered another approach—copy-paste.

After hours of painstakingly copying data from every single presinct into their own TXT files (I set up in a clever way to make getting the names easier), I made a script to look over every file and grab the election results from each race and dump it into a csv file. This was a whole lot simpler and could be simpler if the respective political parties were coded in a regular way. Sadly I had to make a huge script to replace every single deviation in the ancronym/abbreviations of the parties. Not pretty but it worked.

Womens Assembly

Using what I learned from the National Assembly parsing, I did the same thing here.

What’s Next

I started working on this in February, but I dropped it to focus on other things. I picked it up again in the past few weeks. I decided to write up my process in hopes that someone might understand it. Here’s what I’m hoping to do next with the project.

  1. Add additional file formats (SQLite, XLSX, etc)
  2. Make a web portal for the project.
  3. Refactor the code and the project structure.
  4. Add infomration on ???