Hugo again

A quick update

Having (mostly) successfully updated my professional site using Hugo I decided to take a swing at L’s. It was more of a blog format so it gave me some insight into how that kind of functionality could be used.

I also learned I was a dinosaur who still used floats and ended up updating both it and my own site to use flexbox. Learn something new every day! That in itself is worth a post or two.

Netlify

Under the category of learning and things I need to go into greater detail about later, I also switched the hosting of readingwithapencil.com from wordpress.com to Netlify. As a result I have a more flexible site without actually having to pay for anything (WordPress charged for the use of a custom domain).

The workflow works off of Github which I have been using more and more with the Standard Ebooks project so it is pretty smooth. All in all it is really worth of a post of its own but there are so many videos out there it might just serve you (the reader) better to  go watch a couple of them.

But I will eventually jot down my thoughts here…at least so I can figure out what I did when I inevitably break it—that being, originally, the whole purpose of this site.

Hugo!

While I was looking into linux, I came across a vlog that recommended using Hugo and Netlify as a way to maintain a free web presence. I’ve found a lot of these sorts of things (“free”) and even went so far as to set up a small site using the free parts of Google Cloud to get my Python project up and running.

But what struck me about Hugo was that it was a static website and therefore faster and more secure than the typical WordPress install. And it was an intriguing concept that you could mimic the  flexibility of a dynamic site using static pages. So I decided to give it a go.

I deiced to leave Netlify as an experiment for a future project and set about rebuilding the site using Hugo. As a result my old, much ignored portfolio site astart.ca is now refreshed and way more speedy even though I didn’t change the content or the host.

So what is it?

I will get into more it in a later post. But basically it’s a framework that allows you to build the website using templates and pseudo-dynamic techniques and when you are ready to go, you just “publish” the project and it exports the whole website as static pages. It supports a ton of themes like WordPress, although again I decided to build my own from scratch.

Pages are built using markdown. It’s a versatile markup language and one I keep trying to use so one of the side benefits of this is I have become much more facile using it. The gist is that now the site is built all you have to do is open a text file, add content using markdown formatting, link to accompanying pictures and then  just run a short “deploy” script to automatically rebuild the site and upload to your host. Simple.

It’s perfect for a site like the portfolio that doesn’t change much, but now I am going to try it on a more blog-oriented site to see it it will stand up to more frequent posting. I will let you know.

As for astart.ca, well it’s  up and running and  has a fresh new design. Check it out. Now all I have to do is dig up more current material to post. And that’s the hard part 😉

Books Read & Calibre Web Update

Previously (Making A “Books Read” Page) I had posted how I added info to the default Calibre Web templates to add the Series and Pub date information so I could scrape it. Well, it’s gotten a bit more complex since then. Someone added a similar mod to the Github repository which has not yet been incorporated. They didn’t add the pub dates, but did add the series to a few more pages so I thought I would restate my changes here for future reference.

/templates/shelf.html

Add both series info and pub date for the python web scraping program to access:

{% if entry.series|length > 0 %}
    <p class="series">
        {{_('Book')}} {{entry.series_index}} {{_('of')}} <a href="{{url_for('web.books_list', data='series',sort='abc', book_id=entry.series[0].id)}}">{{entry.series[0].name}}</a>
    </p>
{% endif %}


{% if entry.pubdate[:10] != '0101-01-01' %}
    <p class="publishing-date">{{entry.pubdate|formatdate}} </p>
{% endif %}

Added around line 45 (just after the {% endif  %} for the author section). 

/templates/index.html

Add only series info just for aesthetics (note the code is from the proposed mod and is slightly different):

        {% if entry.series.__len__() > 0 %}
        <p class="series">
          <a href="{{url_for('web.books_list', data='series', sort='new', book_id=entry.series[0].id )}}">
            {{entry.series[0].name}}
          </a> 
          ({{entry.series_index}})
        </p>
        {% endif %}

Added around line 193 111 (just after the {% endif  %} for the author section). (This might be 36… there seems to have been a change…)

/templates/discover.html

        {% if entry.series.__len__() > 0 %}
        <p class="series">
          <a href="{{url_for('web.books_list', data='series', sort='new', book_id=entry.series[0].id )}}">
            {{entry.series[0].name}}
          </a> 
          ({{entry.series_index}})
        </p>
        {% endif %}

Added around line 36 (just after the {% endif  %} for the author section).


I am trying to figure out a way to automate the mods if the main repository doesn’t decide to incorporate the changes but so far an elegant solution eludes me.

Flask Part Deux

A continuation of The Great Flask Adventure.

The structure

When last we left our heroes we had posted a groovy python script: Mark III. This was saved as yacht_app.py in a folder. The rest of the files were built and  also stored there. The structure of the folders is thus:

[searchyachtworld]
—yacht_app.py
—[output]
——boatlist.json (a file generated by the app)
—[static]
——[css]
———style.css
——[images]
———artboard.png
—[templates]
——index.html
——results.html
——template.html

Back to the app

The app/python file consists of several parts which mostly consist of mini scripts to render results to a specific template. The simplest is:

@app.route("/")
def home():
    return render_template("index.html")

This simply displays the “index.html file which is a basic form. The next is:

@app.route('/results')
def results():
    data = []
    with open("output/boatlist.json", "r") as jdata:
        data = json.load(jdata)
    return render_template("results.html", boatlist=data['boats'],predata=data['fileinfo'])

 

This defines “results.html,” basically calling for it to open using the boatlist.json file as its data.

The next one is “index.html” after the search button is clicked and it uses a form post request to gather the input data an executes the rest of the python script using that data. I am not going to get into that as it’s just a variation of the Book Page scraping.

I did add a bit at the end that reopens the output json file and uses the submitted search parameter to reorder it before moving on to the results page.

@app.route("/", methods=['POST'])
def echo():
    #get index form data
    if request.method == "POST":
        inputcurr=request.form["inputcurr"]
        minprice=request.form["minprice"]
        maxprice=request.form["maxprice"]
        minlength=request.form["minlength"]
...
    data = []
    with open("output/boatlist.json", "r") as jdata:
        data = json.load(jdata)
        data['boats'].sort(key=keyparam)
    return render_template('results.html', boatlist=data['boats'],predata=data['fileinfo'])

 Back to the HTML

Flask uses the template.html file to set all the default elements (header, navbar, styles sheets etc.)

I won’t bother with the code for the index page, but here is the results which is pretty simple. Basically extracting the header information form the “predata” section of the json and then a loop though the “boatlist” to display each boat.

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Yachtworld Results</title>
</head>

<body>
{% extends "template.html" %}
{% set active_page = "results" %}
{% block content %}
<div class="page-header">
    <h2 class="orange">YachtWorld Results</h2>
    <div id="preface">
        {% for pb in predata %}
    <div>
        <p>{{pb.Text}}<br/>Updated: {{pb.Date}}
        <br/><a href="{{pb.Creator}}">created by {{pb.Creator}}</a></p>
        <p>Price range : <strong>${{pb.Low}} </strong> and <strong>${{pb.High}}</strong> (${{pb.Currency}})<br/>
        Boat length: <strong>{{pb.Short}}'</strong> – <strong>{{pb.Long}}'</strong></p>
    </div>
    {% endfor %}
    </div>
    {% for boat in boatlist %}
    <div class="col-xs-6" style="min-height:170px;">
        <div class="col-md-5 text-right ">
            <img src="{{boat.Thumb}}" alt="" width="150px">
        </div>
        <div class="col-md-7">
            <h3><a href="{{boat.URL}}">{{boat.Name}}</a></h3>
            <p><strong>${{boat.Price}} </strong> / {{boat.Size}}</br>
            {{boat.Location}}</p>
        </div>
    </div>
    {% endfor %}

{% endblock %}
</body>
</html>

Pretty simple really…lol.

In conclusion

Anyway I don’t suspect anyone will actually understand/get much out of all this and its here mostly for posterity. There are plenty of resources online to help dig into the code. 

I am still playing with it and it will continue to evolve. I did post it on github if anyone is interested in the latest version (I have already added in some bits to handle price errors). I am still searching for host to make it publicly available but  anyone can download it from Github if they want to run it on their own server.

The Great Flask Adventure

I just published a blog post over on neverforever.ca about trying to build a web app to scrape YachtWorld. I thought I would record the details here so I can remember what I have done. The complete (and updated) repository is on github if anyone is interested.

Why?

Some time in the recent past YachtWorld  decided to redo their website. And one of the outcomes of that is that you can no longer search for boats in multiple places at the same time and, I now had to perform three separate searches with no way to “save” a previous search and be able to compare.  I figured I could adapt my newfound python skills and scrape the site and deliver output to the website.

Mark I

I copied my previous efforts and produced a python script that produced a markdown file to view on a webbrowser.

Mark II

I decided to output a JSON file instead and then build a php page to read it using JQUERY and Javascript. The json format  had two dict, one for general info and one for boat listings:

{
"fileinfo": [
{
"Date": "April 03, 2020 08:46",
"Text": "Results are a Yachtworld search of sailboats in Washington, Oregon and B.C.",
"Currency": "CAD",
"Low": "30000",
"High": "120000",
"Short": "34",
"Long": "48",
"Creator": "http://neverforever.ca"
}
],
"boats": [
{
"URL": "https://www.yachtworld.com/boats/1980/cheoy-lee-clipper-42-ketch-3577567/",
"Name": "Cheoy Lee Clipper 42 Ketch",
"Price": "80,000",
"Size": "42 ft / 1980",
"Location": "Vancouver, British Columbia, Canada",
"Thumb": "https://newimages.yachtworld.com/resize/1/16/77/7191677_20190822081237806_1_LARGE.jpg?f=/1/16/77/7191677_20190822081237806_1_LARGE.jpg&w=520&h=346&t=1566486758"
}
]

Then I used javascript to retrieve the data and loop through “boats” to display the html code.


/*Retrieve Listings*/
var data;

jQuery.get("boatlist.json", function(d) {
data = d;

/*numeric (price) sort
var datab = data.boats.sort(function(a, b) {return parseFloat(a.Price.replace(/,/g, '')) - parseFloat(b.Price.replace(/,/g, ''))});
*/

/*text (length) sort*/
var datab = data.boats.sort(function(a, b){
var x = a.Size.toLowerCase();
var y = b.Size.toLowerCase();
if (x < y) {return -1;}
if (x > y) {return 1;}
return 0;
});

// loop through all boats
datab.forEach(function(bb) {
// now put each boat in a <div>
$("#boats").append(`
<div class="col-xs-6" style="min-height:170px;">
<div class="col-md-5 text-right ">
<img src="${bb.Thumb}" alt="" width="150px">
</div>
<div class="col-md-7">
<h3><a href="${bb.URL}">${bb.Name}</a></h3>
<p><strong>\$${bb.Price} </strong> \/ ${bb.Size}</br>
${bb.Location}</p>
</div>
</div>
`);
});
});

It worked pretty good but relied on me running the python script each time. After a bit of investigation I decided to turn to Flask to see if I could host it all on a website. Since the Calibre-Web site that I was scraping for my Books Read project ran on Flask I knew it could be done.

Mark III

So here is the script I finally ended up with


from flask import Flask, render_template, request, jsonify
import json
app = Flask(__name__)
@app.route("/")
def home():
return render_template("index.html")
@app.route('/results')
def results():
data = []
with open("output/boatlist.json", "r") as jdata:
data = json.load(jdata)
return render_template("results.html", boatlist=data['boats'],predata=data['fileinfo'])
@app.route("/", methods=['POST'])
def echo():
#get index form data
if request.method == "POST":
inputcurr=request.form["inputcurr"]
minprice=request.form["minprice"]
maxprice=request.form["maxprice"]
minlength=request.form["minlength"]
maxlength=request.form["maxlength"]
texta= minlength + "–" + maxlength +"ft\n" + inputcurr +": $" +minprice + "-" + maxprice
textb= minlength + "–" + maxlength +"ft<br/>" + inputcurr +": $" +minprice + "-" + maxprice
# build sort param ie data['boats'].sort(key=lambda s: s['Location'])
sortparam=request.form["inputsearch"]
if sortparam == 'Location':
keyparam = lambda s: s['Location']
elif sortparam == 'Price':
keyparam = lambda s: int(s['Price'].replace(',', ''))
elif sortparam == 'Size':
keyparam = lambda s: s['Size']

# import various libraries
import requests
from bs4 import BeautifulSoup
import re
#enable math.ceil
import math
# enable sys.exit()
import sys
import csv
import json
from datetime import datetime
import os
# set header to avoid being labeled a bot
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
# set base url
baseurl='https://www.yachtworld.com/boats-for-sale/type-sail/region-northamerica/'
# input low number
if minprice == '':
minpricenum = '30000'
else:
minpricenum = minprice
print(minpricenum)
# input high number
if maxprice == '':
maxpricenum = '120000'
else:
maxpricenum = maxprice
print(maxpricenum)
# input currency
if inputcurr == '':
curr = 'CAD'
else:
curr = inputcurr
print(curr)
# input low length
if minlength == '':
lowlen = '34'
else:
lowlen = minlength
print(lowlen)
# input high length
if maxlength == '':
highlen = '48'
else:
highlen = maxlength
print(highlen)
# set variables
pricerange = '&price=' + minpricenum + '-' + maxpricenum
wash = 'country-united-states/state-washington/'
oreg = 'country-united-states/state-oregon/'
bc = 'country-canada/province-british-columbia/'
currlen = '?currency=' + curr + '&length=' + lowlen + '-' + highlen
# create list of url variables
urllist=[bc,wash,oreg]
#check to see if external drive is mounted and mount it
#if os.path.ismount("/Volumes/www/") == False:
# print ("False monkey")
# os.system("open smb://admin:Sally1@Mini%20Media%20Server._smb._tcp.local/www")
# set path to export as file
path_folder="output/"
# set date and time
now = datetime.now()
dt_string = now.strftime("%B %d, %Y %H:%M")
# create empty list
arrayjson = []
#loop though pages in urllist
for page in urllist:
# get url
urlpath = baseurl+page+currlen+pricerange
page = requests.get(urlpath, timeout=5)
boatpg = BeautifulSoup(page.content, "html.parser")
# find boat listings section
boatlist = boatpg.find('div', class_="search-right-col")
#find single boat listing
boatlisting = boatlist.find_all('a')
#loop though listing and append to list
for listname in boatlisting:
nameurl = listname['href']
thumb = listname.find("meta", property="image")
#add https and find content of meta and substring url to remove first two characters
thumburl="https://" + thumb["content"][2:]
name = listname.find('div', property="name")
priceraw = listname.find('div', class_="price")
#remove extra info from front and back
price = re.search("\$.*? (?= *)",priceraw.text)
cost = price.group()[1:-1]
sizeyear = listname.find('div', class_="listing-card-length-year")
location = listname.find('div', class_="listing-card-location")
#write to json format
writejson = {
"URL": nameurl,
"Name": name.text,
"Price": cost,
"Size": sizeyear.text,
"Location":location.text,
"Thumb": thumburl
}
# append to list
arrayjson.append(writejson)
#add Preface list (array)
arraypreface = []
preface = {
'Date': dt_string,
'Text': 'Results are a Yachtworld search of sailboats in Washington, Oregon and B.C.',
'Currency': curr,
'Low': minpricenum,
'High': maxpricenum,
'Short':lowlen,
'Long': highlen,
'Creator': 'http://neverforever.ca'
}
#append to list
arraypreface.append(preface)
# open json file with path
with open(path_folder+'boatlist.json', 'w') as outfile:
#dump two lists with dict names and add formatting (default=str solves date issue)
json.dump({'fileinfo': arraypreface, 'boats': arrayjson}, outfile, indent=4, default=str)
data = []
with open("output/boatlist.json", "r") as jdata:
data = json.load(jdata)
data['boats'].sort(key=keyparam)
return render_template('results.html', boatlist=data['boats'],predata=data['fileinfo'])

Continued: Flask Part Deux…

 

Web scraping Python code

In my previous post I explained that I was looking for a way to use web scraping to extract data from my Calibre-Web shelves and automatically post them to my Books Read page here on my site. In this post I will step through my final Python script to explain to my future self what I did and why.

Warning: Security

A heads up. I have no guarantees that this code is secure enough to use in a production environment. In fact I would guess it isn’t. But my Web-Calibre webserver is local to my home network and I trust that my hosted server (macblaze.ca) is secure enough. But since you are passing passwords etc. back and forth I wouldn’t count on any of this to be secure without a lot more effort than I am willing to put in. 

The code in bits

# import various libraries

import requests
from bs4 import BeautifulSoup
import re

This loads the various libraries the script uses. Requests is a http library that allows you to send requests to websites, BeautifulSoup is a library to pull data from html and re is a regex library to allow you to do custom searches.

# set variables

# set header to avoid being labeled a bot
headers = {
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}

# set base url
urlpath='http://urlpath'

# website login data
login_data = {
    'next': '/',
    'username': 'username',
    'password': 'password',
    'remember_me': 'on',
}

# set path to export as markdown file 
path_folder="/Volumes/www/books/"
file = open(path_folder+"filename.md","w")

This sets up various variables used for login including a header to try and avoid being labeled a bot, the base url of the Calibre-web installation, login data and specifies a location and name for the resulting markdown file. The open command is marked with a ‘w’ switch to indicate the script will write a new file every time it is executed, overwriting the old one.

# log in and open http session

with requests.Session() as sess:
    url = urlpath+'/login'
    res = sess.get(url, headers=headers)
    res = sess.post(url, data=login_data)

Then, using Requests, I opened a session on the webserver and log in using the variables.

Writing the File

Note: The code has matching file.write() and print() statements throughout. The print() statements just write to the terminal app and allow me to see what is being written to the actual file using file.write(). They are completely unnecessary.

# Set Title

file.write("# Books Read\n")
print("# Books Read\n")

Pretty basic: write the words Books Read followed by a carriage return, tagged with a # to indicate it is a h1 head. This will become the actual page name.

# find list of shelves

shelfhtml = sess.get(urlpath)
soup = BeautifulSoup(shelfhtml.text, "html.parser")
shelflist = soup.find_all('a', href=re.compile('/shelf/[1-9]'))
print (shelflist)

So now we set the variable shelfhtml to the session we opened earlier. Using BeautifulSoup we grab all the html code and search for all a links that have an href that contain the regex expression ‘/shelf/[1-9]’. (Hopefully I won’t have more than 9 shelves or I will have to redo this bit.) The variable now contains list of all the links that match that pattern and looks like this:

[<a href="/shelf/3"><span class="glyphicon glyphicon-list private_shelf"></span>2018</a>, <a href="/shelf/2"><span class="glyphicon glyphicon-list private_shelf"></span>2019</a>, <a href="/shelf/1"><span class="glyphicon glyphicon-list private_shelf"></span>2020</a>]

This as you can see, contains the links to all three of my current Year shelves, displayed in ascending numerical order.
 

#reverse order of urllist

dateshelflist=(get_newshelflist())
dateshelflist.reverse()
print (dateshelflist)

I wanted to display my book lists from newest to oldest so I used python to reverse the items in the list.

First loop: the shelves

The first loop loops through all the shelves (in this case 3 of them) and starts the process of building a book list for each.

# loop through sorted shelves

for shelf in dateshelflist:
    #set shelf page url
    res = sess.get(urlpath+shelf.get('href'))
    soup = BeautifulSoup(res.text, "html.parser")

    # find year from shelflist and format
    shelfyear = soup.find('h2')
    year = re.search("([0-9]{4})", shelfyear.text)
    year.group()
    file.write("### {}\n".format(year.group()))
    print("### {}\n".format(year.group()))

In the first iteration of the loop, the script goes to the actual shelf page using the base url and then adding an href extracted from the list by using a get command and then accesses the html from the resulting webpage. Then the script finds the year info, which is a H2, extracts the 4-digit year with the regex ([0-9]{4}) and writes it to the file, formatted as an H3 header and followed by a line break.

# find all books

books = soup.find_all('div', class_='col-sm-3 col-lg-2 col-xs-6 book')

Using BeautifulSoup we extract the list of books from the page knowing they are all marked with a div in the class col-sm-3 col-lg-2 col-xs-6 book.

Second loop: the books

#loop though books. Each book is a new BeautifulSoup object.

for book in books:
        title = book.find('p', class_='title')
        author = book.find('a', class_='author-name')
        seriesname = book.find('p', class_='series')
        pubdate = book.find('p', class_='publishing-date')
        coverlink = book.find('div', class_='cover')
        if None in (title, author, coverlink, seriesname, pubdate):
            continue
        # extract year from pubdate
        pubyear = re.search("([0-9]{4})", pubdate.text)
        pubyear.group()

This is the beginning of the second loop. For each book we use soup to extract the title, author, series, pubdate and cover (which I don’t end up using). Each search is based on the class assigned to it in the original html code. Because I only want the pub year and not pub date, I again use a regex to extract the 4-digit year. The if None… statement is there just in case one of the fields is empty and prevents the script from hanging.

# construct line using markdown

newstring = "* ***{}*** — {} ({})\{} – ebook\n".format(title.text, author.text, pubyear.group(), seriesname.text)
file.write(newstring)
print (newstring)

Next we construct the book entry based on how we want it to appear on the web page. In my case I want each entry to be an li and end up looking like this:

  • The Cloud Roads — Martha Wells (2011)
    Book 1.0 of Raksura – ebook

Python allows you to just list the variables at the end of the statement and fills in the {} automatically which makes for easier formatting. The script then writes the line to the open markdown file and heads up to the beginning of the loop to grab the next book.

More loops

That’s pretty much it. It loops through the books until it runs out and heads back to the first loop to see if there is another shelf to process. After it processes all the shelves it drops to the last line of the script:

file.close()

which closes the file and that is that—c’est tout. It will now be accessed the next time some visits the Books Read page on my site.

In Conclusion

Hopefully this is clear enough so that when I forget every scarp of python in the years to come I can still recreate this after the inevitable big crash. The script, called scrape.py in my case, is executed in terminal by going to the enclosing folder and typing python3 scrape.py then hitting enter. Automating that is something I will ponder if this book list thing becomes my ultimate methodology for recording books read. It’s big failing is that it only records ebooks in my Calibre library. I might have to redo the entire thing for something like LibraryThing where I can record all my books…lol. Hmmm… maybe…

The Final Code

Here is the final script in its entirety.

# import various libraries
import requests
from bs4 import BeautifulSoup
import re

# set header to avoid being labeled a bot
headers = {
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}

# set base url
urlpath='http://urlpath'

# website login data
login_data = {
    'next': '/',
    'username': 'username',
    'password': 'password',
    'remember_me': 'on',
}

# set path to export as markdown file
path_folder="/Volumes/www/home/books/"
file = open(path_folder+"filename.md","w")

with requests.Session() as sess:
    url = urlpath+'/login'
    res = sess.get(url, headers=headers)
    res = sess.post(url, data=login_data)

# Note: print() commands are purely for terminal output and unnecessary

# Set Title
file.write("# Books Read\n")
print("# Books Read\n")

# find list of shelves
shelfhtml = sess.get(urlpath)
soup = BeautifulSoup(shelfhtml.text, "html.parser")
shelflist = soup.find_all('a', href=re.compile('/shelf/[1-9]'))
# print (shelflist)

#reverse order of urllist
dateshelflist=(get_newshelflist())
dateshelflist.reverse()
# print (dateshelflist)

# loop through sorted shelves
for shelf in dateshelflist:

    #set shelf page url
    res = sess.get(urlpath+shelf.get('href'))
    soup = BeautifulSoup(res.text, "html.parser")

    # find year and format
    shelfyear = soup.find('h2')
    year = re.search("([0-9]{4})", shelfyear.text)
    year.group()
    file.write("### {}\n".format(year.group()))
    print("### {}\n".format(year.group()))

    # find all books
    books = soup.find_all('div', class_='col-sm-3 col-lg-2 col-xs-6 book')

    #loop though books. Each book is a new BeautifulSoup object.
    for book in books:
        title = book.find('p', class_='title')
        author = book.find('a', class_='author-name')
        seriesname = book.find('p', class_='series')
        pubdate = book.find('p', class_='publishing-date')
        coverlink = book.find('div', class_='cover')
        if None in (title, author, coverlink, seriesname, pubdate):
            continue
        # extract year from pubdate
        pubyear = re.search("([0-9]{4})", pubdate.text)
        pubyear.group()
        # construct line using markdown
        newstring = "* ***{}*** — {} ({})\{} – ebook\n".format(title.text, author.text, pubyear.group(), seriesname.text)
        file.write(newstring)
        print (newstring)

file.close()

Making a “Books Read” page

So recently I came across a web page called How I manage my ebooks by a fellow named Aleksandar Todorovi. He is a developer who wanted to track his reading on his webpage. He introduced me to a Calibre project called Calibre-Web which is basically a web interface for Calibre with a few extra bells and whistles. Reading through his explanation it seemed pretty simple to implement except for this statement:

As a final step in the chain, I have created a script that allow me to publish the list of books I’ve read on my website. Since Calibre-Web doesn’t have an API, I ended up scraping my own server using Python Requests  and BeautifulSoup . After about one hundred lines of spaghetti code gets executed, I end up with two files:

  • books-read.md, which goes straight to my CMS, allowing me to publicly share the list of books I have read, sorted by the year in which I’ve finished reading them.

The Process

So I set about to try and implement my own version of Aleksandar’s project. In my typical trial and error fashion it took a couple of days of steady work and I learned a ton along the way.

Calibre-Web

I went ahead and downloaded Calibre-Web and wrestled getting it running on my test server (my old mac-mini). It is a python script, which I still a bit fuzzy about the proper way to actually implement. I ended up writing a shell script to run the  command "nohup python /Applications/calibre-web-master/cps.py"  and them made it executable from my desktop. I still have some work to do there to finalize that solution.

I have to say I really like the interface of Calibre-Web much more than the desktop Calibre and although there are a few quirks, I will likely be using the web version much more than the desktop from now on.

Then I made a few shelves with the books I had read in 2019 and 2020 and was good to go. Now I just needed to get those Shelves onto my website somehow.

Web Scraping

Now I’ve never heard of the term web scraping, but the concept was familiar and it turns out it is quite the thing.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites
Web scraping, Wikipedia

The theory being that since all the info is available accessible in the basic code of the Calibre-Web pages, all I needed to do was extract and format it, then repost it to this site. So I did. Voila: My Books Read page.

I guess I skipped the tough part…

Starting out I understood Python was a programming language, but had no idea what Python Requests or BeautifulSoup were. Turns out that Python Requests was essentially one of many “html to text” interpreters and BeautifulSoup was a program (library?…I am still a bit vague on the terminology) to extract and format long strings of code into useful data.

Start with Google

I started by a quick search and found a few likely examples to follow along with.

https://medium.com/the-andela-way/introduction-to-web-scraping-87edf94ac692
https://medium.com/the-andela-way/learn-how-to-scrape-the-web-2a7cc488e017
 https://www.dataquest.io/blog/web-scraping-beautifulsoup/

These were helpful in explaining the structure and giving me some basic coding ideas, but I mostly relied on https://realpython.com/beautiful-soup-web-scraper-python/ to base my own code on.

Step one

I got everything running (this included sorting out the mess that is python on my computer, but that is another story) and tried to get a basic python script to talk to my calibre installation. Turns out that even though my web browser was logged into Calibre-Web, my script wasn’t. Some some more googling found me this video (Website login using request library in Python) and it did the trick to write  the login portion of my script.

Step two

Then I wrote a basic script that extracted data (much more on this later) and saved it to a markdown file on the webserver. I figured markdown was easier to implement than html and knew WordPress could handle it.

Or could it? Turns out the Jetpack implementation was choking on my markdown file for some reason. I fought with it for a while and eventually decided to see if I could find a different WordPress plugin to do the job. Turned out I could kill two birds with one stone using Mytory Markdown which would actually  load (and reload) a correctly formatted remote .md file to a page every time someone visited.

Step three

After I got a sample page loaded on the website I realized that it was missing  pub date and series name which, if you have ever visited one of my annual books read posts (Last Books of the decade: 2019, Books 2018—Is this the last year? etc.) is essential information. So I had to go into the Calibre-Web code and add those particular pieces of info to the shelf page so I would be able to scrape it all at the same time. I ended up adding this:

{% if entry.series|length > 0 %}
    <p class="series">
        {{_('Book')}} {{entry.series_index}} {{_('of')}} <a href="{{url_for('web.books_list', data='series',sort='abc', book_id=entry.series[0].id)}}">{{entry.series[0].name}}</a>
    </p>
{% endif %}


{% if entry.pubdate[:10] != '0101-01-01' %}
    <p class="publishing-date">{{entry.pubdate|formatdate}} </p>
{% endif %}

…to to shelf.html in /templates folder of the Calibre-Web install. I added it around line 45 (just after the {% endif  %} for the author section). It took a bit of fussing to look good but it worked out great.

Step four

Now all I have to do is figure out how to run my scrape.py script. For now I will leave it a manual process and just run it after I update my Calibre-Web shelves, but making that automatic is on the list for “What’s Next…”

Ta-da

So between this post and Aleksandar’s I hope you have a basic idea of what you need to do in order to try and implement this solution. More importantly when future me comes back and tries to figure out what the hell all this gobbledey-gook mess is I can rebuild the system based on these sketchy notes. I will end this here and continue in a new post on the actual python/beautifulsoup code I came up with to get the web scraping done.

Computer Doings

I had recently discovered that my old copy of Filemaker Pro 12 wouldn’t run on my main computer (Macbook Pro OSX 10.14 Mojave) since my latest round of upgrades and it kept crashing on my old Mac Mini server (10.13 High Sierra). I don’t use Filemaker for much anymore but I do have two databases that get used  about once a year: a boating log and a Blackboard quiz maker. The quiz maker wasn’t important but I had all my sailing resume stored in that Filemaker database. So it looked like I had a bit of a problem.

But every problem is just a new opportunity. I decided to break down and actually learn how to use Mysql and php. I’ve dabbled. This website (and most of my other sites) is run on WordPress and that is based on Mysql/php—so any modifications I’ve done to them have definitely danced around the edges.

Mysql is an open source database related to the professional standard SQL database. Sort of a linux to unix relationship, although not really. Php is a scripting language for web development. Since mysql has no interface, php is used to bridge the gap and provided a web-based GUI to access the database. My Mac Mini has a testing version of mysql/php I use for  my own wordpress work so that was set. All I had to download was a gui to help me see what it was doing. I settled on Mysql Workbench which is the “official” mysql gui.

I started here: tutorialrepublic.com/php-tutorial. Eventually I got to the section about CRUDs (an acronym for Create, Read, Update, and Delete) and then dumped all my work to date to make my own crud and use it to develop the Blackboard quiz as a test project. Then it was a week of learning, trial and error and discovery. I even had to delve into javascript which was a whole other learning curve.

Javascript..Oh MY!

At the end of the day I got it 90% there. There are still some issues with error checking and security. It really isn’t ready for use in the wild, but then again I doubt any of my Filemaker DBs ever were either. But it works and we are setting up for a live test to see if it’s actually practical for L to use.

Features

Some things I managed to add to the crud tutorial:

  • changing the placeholder text in a field based on the question-type dropdown menu
  • changing input types (box, radio buttons etc.) with defaults based on question type
  • export to tab delimited file, with ability to choose file name
  • javascript loops through variable names so I didn’t have to repeat the same code for all 12 possible answers
  • ability to delete all records with error checking
  • and a few other tweaks

I also learned about Bootstrap. Which was so cool. Wikipedia says:

Bootstrap is a free and open-source CSS framework directed at responsive, mobile-first front-end web development. It contains CSS- and (optionally) JavaScript-based design templates for typography, forms, buttons, navigation and other interface components.

Bootstrap, originally named Twitter Blueprint, was developed by Mark Otto and Jacob Thornton at Twitter as a framework to encourage consistency across internal tools. Before Bootstrap, various libraries were used for interface development, which led to inconsistencies and a high maintenance burden. 

It was renamed from Twitter Blueprint to Bootstrap, and released as an open source project on August 19, 2011.

I certainly appreciate the way it makes for a beautiful looking website without much effort. And it’s something more to learn…

What’s Next

As a result of all this I am going to attempt to recreate my much more complex Boating Log DB. And the realization that I need some “official” training rather than just the hit and miss, trial and error, flavoured with random googling method that characterizes so much of my computer work. So the next step is to find some courses…

ebook Update

As a side project in late 2018 I started to produce ebooks for Standard Ebooks. Details about that can be found here on this older post.

My ebooks so far…

So here is the complete list of the books I have worked on so far, including the last 3 that have yet to be approved, but will likely be posted in a week or so…

               

And so…

The Mack Reynolds text is my first collection. A great read of most of his short stories and novellas by an under-appreciated sci-fi author. I did very little of the textual work on the William Carlos Williams book as poetry is not my jam, but I convinced Dr. L to collaborate, so I did the code work and she stepped in as editor. Those two and the Shakespeares have proven to be great learning experiences both in ebook coding and how texts have been handled over the centuries; seriously after almost finishing a Masters degree in renaissance drama, I am astounded about how much minutia I didn’t know about source texts and four centuries of editing practice.. Hope you give some of them a try. I also try to keep a current list of books over at astart.ca/coding/ebooks.