I just published a blog post over on neverforever.ca about trying to build a web app to scrape YachtWorld. I thought I would record the details here so I can remember what I have done. The complete (and updated) repository is on github if anyone is interested.

Why?

Some time in the recent past YachtWorld  decided to redo their website. And one of the outcomes of that is that you can no longer search for boats in multiple places at the same time and, I now had to perform three separate searches with no way to “save” a previous search and be able to compare.  I figured I could adapt my newfound python skills and scrape the site and deliver output to the website.

Mark I

I copied my previous efforts and produced a python script that produced a markdown file to view on a webbrowser.

Mark II

I decided to output a JSON file instead and then build a php page to read it using JQUERY and Javascript. The json format  had two dict, one for general info and one for boat listings:

{
"fileinfo": [
{
"Date": "April 03, 2020 08:46",
"Text": "Results are a Yachtworld search of sailboats in Washington, Oregon and B.C.",
"Currency": "CAD",
"Low": "30000",
"High": "120000",
"Short": "34",
"Long": "48",
"Creator": "http://neverforever.ca"
}
],
"boats": [
{
"URL": "https://www.yachtworld.com/boats/1980/cheoy-lee-clipper-42-ketch-3577567/",
"Name": "Cheoy Lee Clipper 42 Ketch",
"Price": "80,000",
"Size": "42 ft / 1980",
"Location": "Vancouver, British Columbia, Canada",
"Thumb": "https://newimages.yachtworld.com/resize/1/16/77/7191677_20190822081237806_1_LARGE.jpg?f=/1/16/77/7191677_20190822081237806_1_LARGE.jpg&w=520&h=346&t=1566486758"
}
]

Then I used javascript to retrieve the data and loop through “boats” to display the html code.


/*Retrieve Listings*/
var data;

jQuery.get("boatlist.json", function(d) {
data = d;

/*numeric (price) sort
var datab = data.boats.sort(function(a, b) {return parseFloat(a.Price.replace(/,/g, '')) - parseFloat(b.Price.replace(/,/g, ''))});
*/

/*text (length) sort*/
var datab = data.boats.sort(function(a, b){
var x = a.Size.toLowerCase();
var y = b.Size.toLowerCase();
if (x < y) {return -1;}
if (x > y) {return 1;}
return 0;
});

// loop through all boats
datab.forEach(function(bb) {
// now put each boat in a <div>
$("#boats").append(`
<div class="col-xs-6" style="min-height:170px;">
<div class="col-md-5 text-right ">
<img src="${bb.Thumb}" alt="" width="150px">
</div>
<div class="col-md-7">
<h3><a href="${bb.URL}">${bb.Name}</a></h3>
<p><strong>\$${bb.Price} </strong> \/ ${bb.Size}</br>
${bb.Location}</p>
</div>
</div>
`);
});
});

It worked pretty good but relied on me running the python script each time. After a bit of investigation I decided to turn to Flask to see if I could host it all on a website. Since the Calibre-Web site that I was scraping for my Books Read project ran on Flask I knew it could be done.

Mark III

So here is the script I finally ended up with


from flask import Flask, render_template, request, jsonify
import json
app = Flask(__name__)
@app.route("/")
def home():
return render_template("index.html")
@app.route('/results')
def results():
data = []
with open("output/boatlist.json", "r") as jdata:
data = json.load(jdata)
return render_template("results.html", boatlist=data['boats'],predata=data['fileinfo'])
@app.route("/", methods=['POST'])
def echo():
#get index form data
if request.method == "POST":
inputcurr=request.form["inputcurr"]
minprice=request.form["minprice"]
maxprice=request.form["maxprice"]
minlength=request.form["minlength"]
maxlength=request.form["maxlength"]
texta= minlength + "–" + maxlength +"ft\n" + inputcurr +": $" +minprice + "-" + maxprice
textb= minlength + "–" + maxlength +"ft<br/>" + inputcurr +": $" +minprice + "-" + maxprice
# build sort param ie data['boats'].sort(key=lambda s: s['Location'])
sortparam=request.form["inputsearch"]
if sortparam == 'Location':
keyparam = lambda s: s['Location']
elif sortparam == 'Price':
keyparam = lambda s: int(s['Price'].replace(',', ''))
elif sortparam == 'Size':
keyparam = lambda s: s['Size']

# import various libraries
import requests
from bs4 import BeautifulSoup
import re
#enable math.ceil
import math
# enable sys.exit()
import sys
import csv
import json
from datetime import datetime
import os
# set header to avoid being labeled a bot
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
# set base url
baseurl='https://www.yachtworld.com/boats-for-sale/type-sail/region-northamerica/'
# input low number
if minprice == '':
minpricenum = '30000'
else:
minpricenum = minprice
print(minpricenum)
# input high number
if maxprice == '':
maxpricenum = '120000'
else:
maxpricenum = maxprice
print(maxpricenum)
# input currency
if inputcurr == '':
curr = 'CAD'
else:
curr = inputcurr
print(curr)
# input low length
if minlength == '':
lowlen = '34'
else:
lowlen = minlength
print(lowlen)
# input high length
if maxlength == '':
highlen = '48'
else:
highlen = maxlength
print(highlen)
# set variables
pricerange = '&price=' + minpricenum + '-' + maxpricenum
wash = 'country-united-states/state-washington/'
oreg = 'country-united-states/state-oregon/'
bc = 'country-canada/province-british-columbia/'
currlen = '?currency=' + curr + '&length=' + lowlen + '-' + highlen
# create list of url variables
urllist=[bc,wash,oreg]
#check to see if external drive is mounted and mount it
#if os.path.ismount("/Volumes/www/") == False:
# print ("False monkey")
# os.system("open smb://admin:Sally1@Mini%20Media%20Server._smb._tcp.local/www")
# set path to export as file
path_folder="output/"
# set date and time
now = datetime.now()
dt_string = now.strftime("%B %d, %Y %H:%M")
# create empty list
arrayjson = []
#loop though pages in urllist
for page in urllist:
# get url
urlpath = baseurl+page+currlen+pricerange
page = requests.get(urlpath, timeout=5)
boatpg = BeautifulSoup(page.content, "html.parser")
# find boat listings section
boatlist = boatpg.find('div', class_="search-right-col")
#find single boat listing
boatlisting = boatlist.find_all('a')
#loop though listing and append to list
for listname in boatlisting:
nameurl = listname['href']
thumb = listname.find("meta", property="image")
#add https and find content of meta and substring url to remove first two characters
thumburl="https://" + thumb["content"][2:]
name = listname.find('div', property="name")
priceraw = listname.find('div', class_="price")
#remove extra info from front and back
price = re.search("\$.*? (?= *)",priceraw.text)
cost = price.group()[1:-1]
sizeyear = listname.find('div', class_="listing-card-length-year")
location = listname.find('div', class_="listing-card-location")
#write to json format
writejson = {
"URL": nameurl,
"Name": name.text,
"Price": cost,
"Size": sizeyear.text,
"Location":location.text,
"Thumb": thumburl
}
# append to list
arrayjson.append(writejson)
#add Preface list (array)
arraypreface = []
preface = {
'Date': dt_string,
'Text': 'Results are a Yachtworld search of sailboats in Washington, Oregon and B.C.',
'Currency': curr,
'Low': minpricenum,
'High': maxpricenum,
'Short':lowlen,
'Long': highlen,
'Creator': 'http://neverforever.ca'
}
#append to list
arraypreface.append(preface)
# open json file with path
with open(path_folder+'boatlist.json', 'w') as outfile:
#dump two lists with dict names and add formatting (default=str solves date issue)
json.dump({'fileinfo': arraypreface, 'boats': arrayjson}, outfile, indent=4, default=str)
data = []
with open("output/boatlist.json", "r") as jdata:
data = json.load(jdata)
data['boats'].sort(key=keyparam)
return render_template('results.html', boatlist=data['boats'],predata=data['fileinfo'])

Continued: Flask Part Deux…