In my previous post I explained that I was looking for a way to use web scraping to extract data from my Calibre-Web shelves and automatically post them to my Books Read page here on my site. In this post I will step through my final Python script to explain to my future self what I did and why.

Warning: Security

A heads up. I have no guarantees that this code is secure enough to use in a production environment. In fact I would guess it isn’t. But my Web-Calibre webserver is local to my home network and I trust that my hosted server (macblaze.ca) is secure enough. But since you are passing passwords etc. back and forth I wouldn’t count on any of this to be secure without a lot more effort than I am willing to put in. 

The code in bits

# import various libraries

import requests
from bs4 import BeautifulSoup
import re

This loads the various libraries the script uses. Requests is a http library that allows you to send requests to websites, BeautifulSoup is a library to pull data from html and re is a regex library to allow you to do custom searches.

# set variables

# set header to avoid being labeled a bot
headers = {
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}

# set base url
urlpath='http://urlpath'

# website login data
login_data = {
    'next': '/',
    'username': 'username',
    'password': 'password',
    'remember_me': 'on',
}

# set path to export as markdown file 
path_folder="/Volumes/www/books/"
file = open(path_folder+"filename.md","w")

This sets up various variables used for login including a header to try and avoid being labeled a bot, the base url of the Calibre-web installation, login data and specifies a location and name for the resulting markdown file. The open command is marked with a ‘w’ switch to indicate the script will write a new file every time it is executed, overwriting the old one.

# log in and open http session

with requests.Session() as sess:
    url = urlpath+'/login'
    res = sess.get(url, headers=headers)
    res = sess.post(url, data=login_data)

Then, using Requests, I opened a session on the webserver and log in using the variables.

Writing the File

Note: The code has matching file.write() and print() statements throughout. The print() statements just write to the terminal app and allow me to see what is being written to the actual file using file.write(). They are completely unnecessary.

# Set Title

file.write("# Books Read\n")
print("# Books Read\n")

Pretty basic: write the words Books Read followed by a carriage return, tagged with a # to indicate it is a h1 head. This will become the actual page name.

# find list of shelves

shelfhtml = sess.get(urlpath)
soup = BeautifulSoup(shelfhtml.text, "html.parser")
shelflist = soup.find_all('a', href=re.compile('/shelf/[1-9]'))
print (shelflist)

So now we set the variable shelfhtml to the session we opened earlier. Using BeautifulSoup we grab all the html code and search for all a links that have an href that contain the regex expression ‘/shelf/[1-9]’. (Hopefully I won’t have more than 9 shelves or I will have to redo this bit.) The variable now contains list of all the links that match that pattern and looks like this:

[<a href="/shelf/3"><span class="glyphicon glyphicon-list private_shelf"></span>2018</a>, <a href="/shelf/2"><span class="glyphicon glyphicon-list private_shelf"></span>2019</a>, <a href="/shelf/1"><span class="glyphicon glyphicon-list private_shelf"></span>2020</a>]

This as you can see, contains the links to all three of my current Year shelves, displayed in ascending numerical order.
 

#reverse order of urllist

dateshelflist=(get_newshelflist())
dateshelflist.reverse()
print (dateshelflist)

I wanted to display my book lists from newest to oldest so I used python to reverse the items in the list.

First loop: the shelves

The first loop loops through all the shelves (in this case 3 of them) and starts the process of building a book list for each.

# loop through sorted shelves

for shelf in dateshelflist:
    #set shelf page url
    res = sess.get(urlpath+shelf.get('href'))
    soup = BeautifulSoup(res.text, "html.parser")

    # find year from shelflist and format
    shelfyear = soup.find('h2')
    year = re.search("([0-9]{4})", shelfyear.text)
    year.group()
    file.write("### {}\n".format(year.group()))
    print("### {}\n".format(year.group()))

In the first iteration of the loop, the script goes to the actual shelf page using the base url and then adding an href extracted from the list by using a get command and then accesses the html from the resulting webpage. Then the script finds the year info, which is a H2, extracts the 4-digit year with the regex ([0-9]{4}) and writes it to the file, formatted as an H3 header and followed by a line break.

# find all books

books = soup.find_all('div', class_='col-sm-3 col-lg-2 col-xs-6 book')

Using BeautifulSoup we extract the list of books from the page knowing they are all marked with a div in the class col-sm-3 col-lg-2 col-xs-6 book.

Second loop: the books

#loop though books. Each book is a new BeautifulSoup object.

for book in books:
        title = book.find('p', class_='title')
        author = book.find('a', class_='author-name')
        seriesname = book.find('p', class_='series')
        pubdate = book.find('p', class_='publishing-date')
        coverlink = book.find('div', class_='cover')
        if None in (title, author, coverlink, seriesname, pubdate):
            continue
        # extract year from pubdate
        pubyear = re.search("([0-9]{4})", pubdate.text)
        pubyear.group()

This is the beginning of the second loop. For each book we use soup to extract the title, author, series, pubdate and cover (which I don’t end up using). Each search is based on the class assigned to it in the original html code. Because I only want the pub year and not pub date, I again use a regex to extract the 4-digit year. The if None… statement is there just in case one of the fields is empty and prevents the script from hanging.

# construct line using markdown

newstring = "* ***{}*** — {} ({})\{} – ebook\n".format(title.text, author.text, pubyear.group(), seriesname.text)
file.write(newstring)
print (newstring)

Next we construct the book entry based on how we want it to appear on the web page. In my case I want each entry to be an li and end up looking like this:

  • The Cloud Roads — Martha Wells (2011)
    Book 1.0 of Raksura – ebook

Python allows you to just list the variables at the end of the statement and fills in the {} automatically which makes for easier formatting. The script then writes the line to the open markdown file and heads up to the beginning of the loop to grab the next book.

More loops

That’s pretty much it. It loops through the books until it runs out and heads back to the first loop to see if there is another shelf to process. After it processes all the shelves it drops to the last line of the script:

file.close()

which closes the file and that is that—c’est tout. It will now be accessed the next time some visits the Books Read page on my site.

In Conclusion

Hopefully this is clear enough so that when I forget every scarp of python in the years to come I can still recreate this after the inevitable big crash. The script, called scrape.py in my case, is executed in terminal by going to the enclosing folder and typing python3 scrape.py then hitting enter. Automating that is something I will ponder if this book list thing becomes my ultimate methodology for recording books read. It’s big failing is that it only records ebooks in my Calibre library. I might have to redo the entire thing for something like LibraryThing where I can record all my books…lol. Hmmm… maybe…

The Final Code

Here is the final script in its entirety.

# import various libraries
import requests
from bs4 import BeautifulSoup
import re

# set header to avoid being labeled a bot
headers = {
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}

# set base url
urlpath='http://urlpath'

# website login data
login_data = {
    'next': '/',
    'username': 'username',
    'password': 'password',
    'remember_me': 'on',
}

# set path to export as markdown file
path_folder="/Volumes/www/home/books/"
file = open(path_folder+"filename.md","w")

with requests.Session() as sess:
    url = urlpath+'/login'
    res = sess.get(url, headers=headers)
    res = sess.post(url, data=login_data)

# Note: print() commands are purely for terminal output and unnecessary

# Set Title
file.write("# Books Read\n")
print("# Books Read\n")

# find list of shelves
shelfhtml = sess.get(urlpath)
soup = BeautifulSoup(shelfhtml.text, "html.parser")
shelflist = soup.find_all('a', href=re.compile('/shelf/[1-9]'))
# print (shelflist)

#reverse order of urllist
dateshelflist=(get_newshelflist())
dateshelflist.reverse()
# print (dateshelflist)

# loop through sorted shelves
for shelf in dateshelflist:

    #set shelf page url
    res = sess.get(urlpath+shelf.get('href'))
    soup = BeautifulSoup(res.text, "html.parser")

    # find year and format
    shelfyear = soup.find('h2')
    year = re.search("([0-9]{4})", shelfyear.text)
    year.group()
    file.write("### {}\n".format(year.group()))
    print("### {}\n".format(year.group()))

    # find all books
    books = soup.find_all('div', class_='col-sm-3 col-lg-2 col-xs-6 book')

    #loop though books. Each book is a new BeautifulSoup object.
    for book in books:
        title = book.find('p', class_='title')
        author = book.find('a', class_='author-name')
        seriesname = book.find('p', class_='series')
        pubdate = book.find('p', class_='publishing-date')
        coverlink = book.find('div', class_='cover')
        if None in (title, author, coverlink, seriesname, pubdate):
            continue
        # extract year from pubdate
        pubyear = re.search("([0-9]{4})", pubdate.text)
        pubyear.group()
        # construct line using markdown
        newstring = "* ***{}*** — {} ({})\{} – ebook\n".format(title.text, author.text, pubyear.group(), seriesname.text)
        file.write(newstring)
        print (newstring)

file.close()

Note 12/2021

There has been an update to the Calibre web code so I had to make some changes to the python script.