So recently I came across a web page called How I manage my ebooks by a fellow named Aleksandar Todorovi. He is a developer who wanted to track his reading on his webpage. He introduced me to a Calibre project called Calibre-Web which is basically a web interface for Calibre with a few extra bells and whistles. Reading through his explanation it seemed pretty simple to implement except for this statement:

As a final step in the chain, I have created a script that allow me to publish the list of books I’ve read on my website. Since Calibre-Web doesn’t have an API, I ended up scraping my own server using Python Requests  and BeautifulSoup . After about one hundred lines of spaghetti code gets executed, I end up with two files:

  • books-read.md, which goes straight to my CMS, allowing me to publicly share the list of books I have read, sorted by the year in which I’ve finished reading them.

The Process

So I set about to try and implement my own version of Aleksandar’s project. In my typical trial and error fashion it took a couple of days of steady work and I learned a ton along the way.

Calibre-Web

I went ahead and downloaded Calibre-Web and wrestled getting it running on my test server (my old mac-mini). It is a python script, which I still a bit fuzzy about the proper way to actually implement. I ended up writing a shell script to run the  command "nohup python /Applications/calibre-web-master/cps.py"  and them made it executable from my desktop. I still have some work to do there to finalize that solution.

I have to say I really like the interface of Calibre-Web much more than the desktop Calibre and although there are a few quirks, I will likely be using the web version much more than the desktop from now on.

Then I made a few shelves with the books I had read in 2019 and 2020 and was good to go. Now I just needed to get those Shelves onto my website somehow.

Web Scraping

Now I’ve never heard of the term web scraping, but the concept was familiar and it turns out it is quite the thing.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites
Web scraping, Wikipedia

The theory being that since all the info is available accessible in the basic code of the Calibre-Web pages, all I needed to do was extract and format it, then repost it to this site. So I did. Voila: My Books Read page.

I guess I skipped the tough part…

Starting out I understood Python was a programming language, but had no idea what Python Requests or BeautifulSoup were. Turns out that Python Requests was essentially one of many “html to text” interpreters and BeautifulSoup was a program (library?…I am still a bit vague on the terminology) to extract and format long strings of code into useful data.

Start with Google

I started by a quick search and found a few likely examples to follow along with.

https://medium.com/the-andela-way/introduction-to-web-scraping-87edf94ac692
https://medium.com/the-andela-way/learn-how-to-scrape-the-web-2a7cc488e017
 https://www.dataquest.io/blog/web-scraping-beautifulsoup/

These were helpful in explaining the structure and giving me some basic coding ideas, but I mostly relied on https://realpython.com/beautiful-soup-web-scraper-python/ to base my own code on.

Step one

I got everything running (this included sorting out the mess that is python on my computer, but that is another story) and tried to get a basic python script to talk to my calibre installation. Turns out that even though my web browser was logged into Calibre-Web, my script wasn’t. Some some more googling found me this video (Website login using request library in Python) and it did the trick to write  the login portion of my script.

Step two

Then I wrote a basic script that extracted data (much more on this later) and saved it to a markdown file on the webserver. I figured markdown was easier to implement than html and knew WordPress could handle it.

Or could it? Turns out the Jetpack implementation was choking on my markdown file for some reason. I fought with it for a while and eventually decided to see if I could find a different WordPress plugin to do the job. Turned out I could kill two birds with one stone using Mytory Markdown which would actually  load (and reload) a correctly formatted remote .md file to a page every time someone visited.

Step three

After I got a sample page loaded on the website I realized that it was missing  pub date and series name which, if you have ever visited one of my annual books read posts (Last Books of the decade: 2019, Books 2018—Is this the last year? etc.) is essential information. So I had to go into the Calibre-Web code and add those particular pieces of info to the shelf page so I would be able to scrape it all at the same time. I ended up adding this:

{% if entry.series|length > 0 %}
    <p class="series">
        {{_('Book')}} {{entry.series_index}} {{_('of')}} <a href="{{url_for('web.books_list', data='series',sort='abc', book_id=entry.series[0].id)}}">{{entry.series[0].name}}</a>
    </p>
{% endif %}


{% if entry.pubdate[:10] != '0101-01-01' %}
    <p class="publishing-date">{{entry.pubdate|formatdate}} </p>
{% endif %}

…to to shelf.html in /templates folder of the Calibre-Web install. I added it around line 45 (just after the {% endif  %} for the author section). It took a bit of fussing to look good but it worked out great.

Step four

Now all I have to do is figure out how to run my scrape.py script. For now I will leave it a manual process and just run it after I update my Calibre-Web shelves, but making that automatic is on the list for “What’s Next…”

Ta-da

So between this post and Aleksandar’s I hope you have a basic idea of what you need to do in order to try and implement this solution. More importantly when future me comes back and tries to figure out what the hell all this gobbledey-gook mess is I can rebuild the system based on these sketchy notes. I will end this here and continue in a new post on the actual python/beautifulsoup code I came up with to get the web scraping done.