{"id":13665,"date":"2020-04-04T13:56:24","date_gmt":"2020-04-04T19:56:24","guid":{"rendered":"http:\/\/macblaze.ca\/?p=13665"},"modified":"2020-04-08T12:22:45","modified_gmt":"2020-04-08T18:22:45","slug":"the-great-flask-adventure","status":"publish","type":"post","link":"https:\/\/macblaze.ca\/?p=13665","title":{"rendered":"The Great Flask Adventure"},"content":{"rendered":"<p>I just published a blog post over on <a href=\"https:\/\/neverforever.ca\/2020\/04\/if-you-cant-boat-dream\/\">neverforever.ca<\/a> about trying to build a web app to scrape YachtWorld. I thought I would record the details here so I can remember what I have done. The complete (and updated) repository is on <a href=\"https:\/\/github.com\/b-t-k\/searchboats.github.io\" target=\"_blank\" rel=\"noopener noreferrer\">github<\/a> if anyone is interested.<\/p>\n<h3>Why?<\/h3>\n<p>Some time in the recent past&nbsp;<a href=\"https:\/\/www.yachtworld.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">YachtWorld<\/a>&nbsp; decided to redo their website. And one of the outcomes of that is that you can no longer search for boats in multiple places at the same time and, I now had to perform three separate searches with no way to \u201csave\u201d a previous search and be able to compare.&nbsp; I figured I could adapt my newfound python skills and scrape the site and deliver output to the website.<\/p>\n<h4>Mark I<\/h4>\n<p>I copied my previous efforts and produced a python script that produced a markdown file to view on a webbrowser.<\/p>\n<h4>Mark II<\/h4>\n<p>I decided to output a JSON file instead and then build a php page to read it using JQUERY and Javascript. The json format&nbsp; had two dict, one for general info and one for boat listings:<\/p>\n<pre><code>{\n\"fileinfo\": [\n{\n\"Date\": \"April 03, 2020 08:46\",\n\"Text\": \"Results are a Yachtworld search of sailboats in Washington, Oregon and B.C.\",\n\"Currency\": \"CAD\",\n\"Low\": \"30000\",\n\"High\": \"120000\",\n\"Short\": \"34\",\n\"Long\": \"48\",\n\"Creator\": \"http:\/\/neverforever.ca\"\n}\n],\n\"boats\": [\n{\n\"URL\": \"https:\/\/www.yachtworld.com\/boats\/1980\/cheoy-lee-clipper-42-ketch-3577567\/\",\n\"Name\": \"Cheoy Lee Clipper 42 Ketch\",\n\"Price\": \"80,000\",\n\"Size\": \"42 ft \/ 1980\",\n\"Location\": \"Vancouver, British Columbia, Canada\",\n\"Thumb\": \"https:\/\/newimages.yachtworld.com\/resize\/1\/16\/77\/7191677_20190822081237806_1_LARGE.jpg?f=\/1\/16\/77\/7191677_20190822081237806_1_LARGE.jpg&amp;w=520&amp;h=346&amp;t=1566486758\"\n}\n]<\/code><\/pre>\n<p>Then I used javascript to retrieve the data and loop through &#8220;boats&#8221; to display the html code.<\/p>\n<pre><code>\n\/*Retrieve Listings*\/\nvar data;\n\njQuery.get(\"boatlist.json\", function(d) {\ndata = d;\n\n\/*numeric (price) sort\nvar datab = data.boats.sort(function(a, b) {return parseFloat(a.Price.replace(\/,\/g, '')) - parseFloat(b.Price.replace(\/,\/g, ''))});\n*\/\n\n\/*text (length) sort*\/\nvar datab = data.boats.sort(function(a, b){\nvar x = a.Size.toLowerCase();\nvar y = b.Size.toLowerCase();\nif (x &lt; y) {return -1;}\nif (x &gt; y) {return 1;}\nreturn 0;\n});\n\n\/\/ loop through all boats\ndatab.forEach(function(bb) {\n\/\/ now put each boat in a &lt;div&gt;\n$(\"#boats\").append(`\n&lt;div class=\"col-xs-6\" style=\"min-height:170px;\"&gt;\n&lt;div class=\"col-md-5 text-right \"&gt;\n&lt;img src=\"${bb.Thumb}\" alt=\"\" width=\"150px\"&gt;\n&lt;\/div&gt;\n&lt;div class=\"col-md-7\"&gt;\n&lt;h3&gt;&lt;a href=\"${bb.URL}\"&gt;${bb.Name}&lt;\/a&gt;&lt;\/h3&gt;\n&lt;p&gt;&lt;strong&gt;\\$${bb.Price} &lt;\/strong&gt; \\\/ ${bb.Size}&lt;\/br&gt;\n${bb.Location}&lt;\/p&gt;\n&lt;\/div&gt;\n&lt;\/div&gt;\n`);\n});\n});\n<\/code><\/pre>\n<p>It worked pretty good but relied on me running the python script each time. After a bit of investigation I decided to turn to <a href=\"https:\/\/flask.palletsprojects.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Flask<\/a> to see if I could host it all on a website. Since the Calibre-Web site that I was scraping for my <a href=\"http:\/\/macblaze.ca\/?p=13488\">Books Read project<\/a> ran on Flask I knew it could be done.<\/p>\n<h4>Mark III<\/h4>\n<p>So here is the script I finally ended up with<\/p>\n<pre><code>\nfrom flask import Flask, render_template, request, jsonify\nimport json\napp = Flask(__name__)\n@app.route(\"\/\")\ndef home():\nreturn render_template(\"index.html\")\n@app.route('\/results')\ndef results():\ndata = []\nwith open(\"output\/boatlist.json\", \"r\") as jdata:\ndata = json.load(jdata)\nreturn render_template(\"results.html\", boatlist=data['boats'],predata=data['fileinfo'])\n@app.route(\"\/\", methods=['POST'])\ndef echo():\n#get index form data\nif request.method == \"POST\":\ninputcurr=request.form[\"inputcurr\"]\nminprice=request.form[\"minprice\"]\nmaxprice=request.form[\"maxprice\"]\nminlength=request.form[\"minlength\"]\nmaxlength=request.form[\"maxlength\"]\ntexta= minlength + \"\u2013\" + maxlength +\"ft\\n\" + inputcurr +\": $\" +minprice + \"-\" + maxprice\ntextb= minlength + \"\u2013\" + maxlength +\"ft&lt;br\/&gt;\" + inputcurr +\": $\" +minprice + \"-\" + maxprice\n# build sort param ie data['boats'].sort(key=lambda s: s['Location'])\nsortparam=request.form[\"inputsearch\"]\nif sortparam == 'Location':\nkeyparam = lambda s: s['Location']\nelif sortparam == 'Price':\nkeyparam = lambda s: int(s['Price'].replace(',', ''))\nelif sortparam == 'Size':\nkeyparam = lambda s: s['Size']\n<\/code><\/pre>\n<pre><code>\n# import various libraries\nimport requests\nfrom bs4 import BeautifulSoup\nimport re\n#enable math.ceil\nimport math\n# enable sys.exit()\nimport sys\nimport csv\nimport json\nfrom datetime import datetime\nimport os\n<\/code><\/pre>\n<pre><code># set header to avoid being labeled a bot\nheaders = {\n'user-agent': 'Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/67.0.3396.99 Safari\/537.36'\n}\n# set base url\nbaseurl='https:\/\/www.yachtworld.com\/boats-for-sale\/type-sail\/region-northamerica\/'\n# input low number\nif minprice == '':\nminpricenum = '30000'\nelse:\nminpricenum = minprice\nprint(minpricenum)\n<\/code><\/pre>\n<pre><code># input high number\nif maxprice == '':\nmaxpricenum = '120000'\nelse:\nmaxpricenum = maxprice\nprint(maxpricenum)\n<\/code><\/pre>\n<pre><code># input currency\nif inputcurr == '':\ncurr = 'CAD'\nelse:\ncurr = inputcurr\nprint(curr)\n<\/code><\/pre>\n<pre><code># input low length\nif minlength == '':\nlowlen = '34'\nelse:\nlowlen = minlength\nprint(lowlen)\n<\/code><\/pre>\n<pre><code># input high length\nif maxlength == '':\nhighlen = '48'\nelse:\nhighlen = maxlength\nprint(highlen)\n<\/code><\/pre>\n<pre><code># set variables\npricerange = '&amp;price=' + minpricenum + '-' + maxpricenum\nwash = 'country-united-states\/state-washington\/'\noreg = 'country-united-states\/state-oregon\/'\nbc = 'country-canada\/province-british-columbia\/'\ncurrlen = '?currency=' + curr + '&amp;length=' + lowlen + '-' + highlen\n<\/code><\/pre>\n<pre><code># create list of url variables\nurllist=[bc,wash,oreg]\n#check to see if external drive is mounted and mount it\n#if os.path.ismount(\"\/Volumes\/www\/\") == False:\n# print (\"False monkey\")\n# os.system(\"open smb:\/\/admin:Sally1@Mini%20Media%20Server._smb._tcp.local\/www\")\n# set path to export as file\npath_folder=\"output\/\"\n# set date and time\nnow = datetime.now()\ndt_string = now.strftime(\"%B %d, %Y %H:%M\")\n<\/code><\/pre>\n<pre><code># create empty list\narrayjson = []\n#loop though pages in urllist\nfor page in urllist:\n# get url\nurlpath = baseurl+page+currlen+pricerange\npage = requests.get(urlpath, timeout=5)\nboatpg = BeautifulSoup(page.content, \"html.parser\")\n<\/code><\/pre>\n<pre><code># find boat listings section\nboatlist = boatpg.find('div', class_=\"search-right-col\")\n#find single boat listing\nboatlisting = boatlist.find_all('a')\n<\/code><\/pre>\n<pre><code>#loop though listing and append to list\nfor listname in boatlisting:\nnameurl = listname['href']\nthumb = listname.find(\"meta\", property=\"image\")\n#add https and find content of meta and substring url to remove first two characters\nthumburl=\"https:\/\/\" + thumb[\"content\"][2:]\nname = listname.find('div', property=\"name\")\npriceraw = listname.find('div', class_=\"price\")\n#remove extra info from front and back\nprice = re.search(\"\\$.*? (?= *)\",priceraw.text)\ncost = price.group()[1:-1]\nsizeyear = listname.find('div', class_=\"listing-card-length-year\")\nlocation = listname.find('div', class_=\"listing-card-location\")\n#write to json format\nwritejson = {\n\"URL\": nameurl,\n\"Name\": name.text,\n\"Price\": cost,\n\"Size\": sizeyear.text,\n\"Location\":location.text,\n\"Thumb\": thumburl\n}\n# append to list\narrayjson.append(writejson)\n<\/code><\/pre>\n<pre><code>#add Preface list (array)\narraypreface = []\npreface = {\n'Date': dt_string,\n'Text': 'Results are a Yachtworld search of sailboats in Washington, Oregon and B.C.',\n'Currency': curr,\n'Low': minpricenum,\n'High': maxpricenum,\n'Short':lowlen,\n'Long': highlen,\n'Creator': 'http:\/\/neverforever.ca'\n}\n#append to list\narraypreface.append(preface)\n<\/code><\/pre>\n<pre><code># open json file with path\nwith open(path_folder+'boatlist.json', 'w') as outfile:\n#dump two lists with dict names and add formatting (default=str solves date issue)\njson.dump({'fileinfo': arraypreface, 'boats': arrayjson}, outfile, indent=4, default=str)\ndata = []\nwith open(\"output\/boatlist.json\", \"r\") as jdata:\ndata = json.load(jdata)\ndata['boats'].sort(key=keyparam)\nreturn render_template('results.html', boatlist=data['boats'],predata=data['fileinfo'])\n<\/code><\/pre>\n<h4><a href=\"http:\/\/macblaze.ca\/?p=13680\">Continued: Flask Part Deux&#8230;<\/a><\/h4>\n<pre>&nbsp;<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I just published a blog post over on neverforever.ca about trying to build a web app to scrape YachtWorld. I thought I would record the details here so I can remember what I have done. The complete (and updated) repository is on github if anyone is interested. Why? Some time in the recent past&nbsp;YachtWorld&nbsp; decided [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/posts\/13665"}],"collection":[{"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/macblaze.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13665"}],"version-history":[{"count":7,"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/posts\/13665\/revisions"}],"predecessor-version":[{"id":13685,"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/posts\/13665\/revisions\/13685"}],"wp:attachment":[{"href":"https:\/\/macblaze.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13665"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/macblaze.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13665"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/macblaze.ca\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13665"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}