{"id":13517,"date":"2020-01-26T15:07:22","date_gmt":"2020-01-26T22:07:22","guid":{"rendered":"http:\/\/macblaze.ca\/?p=13517"},"modified":"2021-12-10T12:15:29","modified_gmt":"2021-12-10T19:15:29","slug":"web-scraping-python-code","status":"publish","type":"post","link":"https:\/\/macblaze.ca\/?p=13517","title":{"rendered":"Web scraping Python code"},"content":{"rendered":"<p>In <a href=\"http:\/\/macblaze.ca\/?p=13488\">my previous post<\/a> I explained that I was looking for a way to use web scraping to extract data from my Calibre-Web shelves and automatically post them to my <a href=\"http:\/\/macblaze.ca\/?page_id=13497\">Books Read<\/a> page here on my site. In this post I will step through my final Python script to explain to my future self what I did and why.<\/p>\n<h3>Warning: Security<\/h3>\n<p>A heads up. I have no guarantees that this code is secure enough to use in a production environment. In fact I would guess it isn&#8217;t. But my Web-Calibre webserver is local to my home network and I trust that my hosted server (macblaze.ca) is secure enough. But since you are passing passwords etc. back and forth I wouldn&#8217;t count on any of this to be secure without a lot more effort than I am willing to put in.&nbsp;<\/p>\n<h2>The code in bits<\/h2>\n<h3># import various libraries<\/h3>\n<pre><code>import requests\nfrom bs4 import BeautifulSoup\nimport re<\/code><\/pre>\n<p>This loads the various libraries the script uses. <em>Requests<\/em> is a http library that allows you to send requests to websites, <em>BeautifulSoup<\/em> is a library to pull data from html and <em>re<\/em> is a regex library to allow you to do custom searches.<\/p>\n<h3># set variables<\/h3>\n<pre><code># set header to avoid being labeled a bot\nheaders = {\n    'user-agent': 'Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/67.0.3396.99 Safari\/537.36'\n}\n\n# set base url\nurlpath='http:\/\/urlpath'\n\n# website login data\nlogin_data = {\n    'next': '\/',\n    'username': 'username',\n    'password': 'password',\n    'remember_me': 'on',\n}\n\n# set path to export as markdown file \npath_folder=\"\/Volumes\/www\/books\/\"\nfile = open(path_folder+\"filename.md\",\"w\")<\/code><\/pre>\n<p>This sets up various variables used for login including a header to try and avoid being labeled a bot, the base url of the Calibre-web installation, login data and specifies a location and name for the resulting <a href=\"http:\/\/macblaze.ca\/?p=7837\">markdown<\/a> file. The open command is marked with a &#8216;w&#8217; switch to indicate the script will write a new file every time it is executed, overwriting the old one.<\/p>\n<h3># log in and open http session<\/h3>\n<pre><code>with requests.Session() as sess:\n    url = urlpath+'\/login'\n    res = sess.get(url, headers=headers)\n    res = sess.post(url, data=login_data)<\/code><\/pre>\n<p>Then, using <em>Requests<\/em>, I opened a session on the webserver and log in using the variables.<\/p>\n<h3>Writing the File<\/h3>\n<p><strong>Note:<\/strong> The code has matching <code>file.write()<\/code> and <code>print()<\/code> statements throughout. The <code>print()<\/code> statements just write to the terminal app and allow me to see what is being written to the actual file using <code>file.write()<\/code>. They are completely unnecessary.<\/p>\n<h3># Set Title<\/h3>\n<pre><code>file.write(\"# Books Read\\n\")\nprint(\"# Books Read\\n\")<\/code><\/pre>\n<p>Pretty basic: write the words <em>Books Read<\/em> followed by a carriage return, tagged with a # to indicate it is a h1 head. This will become the actual page name.<\/p>\n<h3># find list of shelves<\/h3>\n<pre><code>shelfhtml = sess.get(urlpath)\nsoup = BeautifulSoup(shelfhtml.text, \"html.parser\")\nshelflist = soup.find_all('a', href=re.compile('\/shelf\/[1-9]'))\nprint (shelflist)<\/code><\/pre>\n<p>So now we set the variable <em>shelfhtml<\/em> to the session we opened earlier. Using BeautifulSoup we grab all the html code and search for all <strong>a<\/strong> links that have an <strong>href<\/strong> that contain the regex expression &#8216;\/shelf\/[1-9]&#8217;. (Hopefully I won&#8217;t have more than 9 shelves or I will have to redo this bit.) The variable now contains list of all the links that match that pattern and looks like this:<\/p>\n<pre><code>[&lt;a href=\"\/shelf\/3\"&gt;&lt;span class=\"glyphicon glyphicon-list private_shelf\"&gt;&lt;\/span&gt;2018&lt;\/a&gt;, &lt;a href=\"\/shelf\/2\"&gt;&lt;span class=\"glyphicon glyphicon-list private_shelf\"&gt;&lt;\/span&gt;2019&lt;\/a&gt;, &lt;a href=\"\/shelf\/1\"&gt;&lt;span class=\"glyphicon glyphicon-list private_shelf\"&gt;&lt;\/span&gt;2020&lt;\/a&gt;]<\/code><\/pre>\n<p>This as you can see, contains the links to all three of my current Year shelves, displayed in ascending numerical order.<br \/>\n&nbsp;<\/p>\n<h3>#reverse order of urllist<\/h3>\n<pre><code>dateshelflist=(get_newshelflist())\ndateshelflist.reverse()\nprint (dateshelflist)<\/code><\/pre>\n<p>I wanted to display my book lists from newest to oldest so I used python to reverse the items in the list.<\/p>\n<h2>First loop: the shelves<\/h2>\n<p>The first loop loops through all the shelves (in this case 3 of them) and starts the process of building a book list for each.<\/p>\n<h3># loop through sorted shelves<\/h3>\n<pre><code>for shelf in dateshelflist:\n    #set shelf page url\n    res = sess.get(urlpath+shelf.get('href'))\n    soup = BeautifulSoup(res.text, \"html.parser\")\n\n    # find year from shelflist and format\n    shelfyear = soup.find('h2')\n    year = re.search(\"([0-9]{4})\", shelfyear.text)\n    year.group()\n    file.write(\"### {}\\n\".format(year.group()))\n    print(\"### {}\\n\".format(year.group()))<\/code><\/pre>\n<p>In the first iteration of the loop, the script goes to the actual shelf page using the base url and then adding an <strong>href<\/strong> extracted from the list by using a <em>get<\/em> command and then accesses the <em>html<\/em> from the resulting webpage. Then the script finds the year info, which is a <strong>H2<\/strong>, extracts the 4-digit year with the regex <em>([0-9]{4})<\/em> and writes it to the file, formatted as an <strong>H3<\/strong> header and followed by a line break.<\/p>\n<h3># find all books<\/h3>\n<pre><code>books = soup.find_all('div', class_='col-sm-3 col-lg-2 col-xs-6 book')<\/code><\/pre>\n<p>Using BeautifulSoup we extract the list of books from the page knowing they are all marked with a <strong>div<\/strong> in the class <strong>col-sm-3 col-lg-2 col-xs-6 book<\/strong>.<\/p>\n<h2>Second loop: the books<\/h2>\n<h3>#loop though books. Each book is a new BeautifulSoup object.<\/h3>\n<pre><code>for book in books:\n        title = book.find('p', class_='title')\n        author = book.find('a', class_='author-name')\n        seriesname = book.find('p', class_='series')\n        pubdate = book.find('p', class_='publishing-date')\n        coverlink = book.find('div', class_='cover')\n        if None in (title, author, coverlink, seriesname, pubdate):\n            continue\n        # extract year from pubdate\n        pubyear = re.search(\"([0-9]{4})\", pubdate.text)\n        pubyear.group()<\/code><\/pre>\n<p>This is the beginning of the second loop. For each book we use soup to extract the title, author, series, pubdate and cover (which I don&#8217;t end up using). Each search is based on the class assigned to it in the original html code. Because I only want the pub year and not pub date, I again use a regex to extract the 4-digit year. The <em>if None&#8230;<\/em> statement is there just in case one of the fields is empty and prevents the script from hanging.<\/p>\n<h3># construct line using markdown<\/h3>\n<pre><code>newstring = \"* ***{}*** \u2014 {} ({})\\{} \u2013 ebook\\n\".format(title.text, author.text, pubyear.group(), seriesname.text)\nfile.write(newstring)\nprint (newstring)<\/code><\/pre>\n<p>Next we construct the book entry based on how we want it to appear on the web page. In my case I want each entry to be an <strong>li<\/strong> and end up looking like this:<\/p>\n<ul>\n<li><strong><em>The Cloud Roads<\/em><\/strong> \u2014 Martha Wells (2011)<br \/>\nBook 1.0 of Raksura \u2013 ebook<\/li>\n<\/ul>\n<p>Python allows you to just list the variables at the end of the statement and fills in the {} automatically which makes for easier formatting. The script then writes the line to the open markdown file and heads up to the beginning of the loop to grab the next book.<\/p>\n<h3>More loops<\/h3>\n<p>That&#8217;s pretty much it. It loops through the books until it runs out and heads back to the first loop to see if there is another shelf to process. After it processes all the shelves it drops to the last line of the script:<\/p>\n<pre><code>file.close()<\/code><\/pre>\n<p>which closes the file and that is that\u2014c&#8217;est tout. It will now be accessed the next time some visits the <em>Books Read<\/em> page on my site.<\/p>\n<h2>In Conclusion<\/h2>\n<p>Hopefully this is clear enough so that when I forget every scarp of python in the years to come I can still recreate this after the inevitable big crash. The script, called scrape.py in my case, is executed in terminal by going to the enclosing folder and typing <strong><em><span class=\"s1\">python3 scrape.py<\/span><\/em><\/strong> then hitting enter. Automating that is something I will ponder if this book list thing becomes my ultimate methodology for recording books read. It&#8217;s big failing is that it only records ebooks in my Calibre library. I might have to redo the entire thing for something like <a href=\"https:\/\/www.librarything.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">LibraryThing<\/a> where I can record all my books&#8230;lol. Hmmm&#8230; maybe&#8230;<\/p>\n<h2>The Final Code<\/h2>\n<p>Here is the final script in its entirety.<\/p>\n<pre><code># import various libraries\nimport requests\nfrom bs4 import BeautifulSoup\nimport re\n\n# set header to avoid being labeled a bot\nheaders = {\n    'user-agent': 'Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/67.0.3396.99 Safari\/537.36'\n}\n\n# set base url\nurlpath='http:\/\/urlpath'\n\n# website login data\nlogin_data = {\n    'next': '\/',\n    'username': 'username',\n    'password': 'password',\n    'remember_me': 'on',\n}\n\n# set path to export as markdown file\npath_folder=\"\/Volumes\/www\/home\/books\/\"\nfile = open(path_folder+\"filename.md\",\"w\")\n\nwith requests.Session() as sess:\n    url = urlpath+'\/login'\n    res = sess.get(url, headers=headers)\n    res = sess.post(url, data=login_data)\n\n# Note: print() commands are purely for terminal output and unnecessary\n\n# Set Title\nfile.write(\"# Books Read\\n\")\nprint(\"# Books Read\\n\")\n\n# find list of shelves\nshelfhtml = sess.get(urlpath)\nsoup = BeautifulSoup(shelfhtml.text, \"html.parser\")\nshelflist = soup.find_all('a', href=re.compile('\/shelf\/[1-9]'))\n# print (shelflist)\n\n#reverse order of urllist\ndateshelflist=(get_newshelflist())\ndateshelflist.reverse()\n# print (dateshelflist)\n\n# loop through sorted shelves\nfor shelf in dateshelflist:\n\n    #set shelf page url\n    res = sess.get(urlpath+shelf.get('href'))\n    soup = BeautifulSoup(res.text, \"html.parser\")\n\n    # find year and format\n    shelfyear = soup.find('h2')\n    year = re.search(\"([0-9]{4})\", shelfyear.text)\n    year.group()\n    file.write(\"### {}\\n\".format(year.group()))\n    print(\"### {}\\n\".format(year.group()))\n\n    # find all books\n    books = soup.find_all('div', class_='col-sm-3 col-lg-2 col-xs-6 book')\n\n    #loop though books. Each book is a new BeautifulSoup object.\n    for book in books:\n        title = book.find('p', class_='title')\n        author = book.find('a', class_='author-name')\n        seriesname = book.find('p', class_='series')\n        pubdate = book.find('p', class_='publishing-date')\n        coverlink = book.find('div', class_='cover')\n        if None in (title, author, coverlink, seriesname, pubdate):\n            continue\n        # extract year from pubdate\n        pubyear = re.search(\"([0-9]{4})\", pubdate.text)\n        pubyear.group()\n        # construct line using markdown\n        newstring = \"* ***{}*** \u2014 {} ({})\\{} \u2013 ebook\\n\".format(title.text, author.text, pubyear.group(), seriesname.text)\n        file.write(newstring)\n        print (newstring)\n\nfile.close()<\/code><\/pre>\n<h2>Note 12\/2021<\/h2>\n<p>There has been an update to the Calibre web code so I <a href=\"https:\/\/macblaze.ca\/?p=14799\">had to make some changes<\/a> to the python script.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my previous post I explained that I was looking for a way to use web scraping to extract data from my Calibre-Web shelves and automatically post them to my Books Read page here on my site. In this post I will step through my final Python script to explain to my future self what [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/posts\/13517"}],"collection":[{"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/macblaze.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13517"}],"version-history":[{"count":31,"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/posts\/13517\/revisions"}],"predecessor-version":[{"id":14801,"href":"https:\/\/macblaze.ca\/index.php?rest_route=\/wp\/v2\/posts\/13517\/revisions\/14801"}],"wp:attachment":[{"href":"https:\/\/macblaze.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13517"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/macblaze.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13517"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/macblaze.ca\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13517"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}