“Stop learning and start thinking” – dabbling in web scraping with Python and BeautifulSoup

jacob-barnettIn the midst of my efforts learning, and re-learning, various software development minutiae , I ran across this video detailing the (somewhat over-hyped) story of Jacob Barnett, a kid who was diagnosed with autism as a toddler, leading to all sorts of anxiety on the part of his parents.  Eventually, his autism – which is still noticeable, but probably more in line with Asperger syndrome – was augmented with a diagnosis of genius (his IQ is supposedly a stunning 170.)  He is currently pursing a PhD in Theoretical Physics.

In his talk, Barnett repeatedly makes a plea for others to, “Stop learning and start thinking and creating.”  At first, I didn’t understand what he was talking about (hey, cut me some slack, my IQ isn’t 170…) but as I thought more about what he was saying, I think I finally understood.  When striving for competency, and then mastery, in any subject, it is critical to take a break from aggregating new knowledge and to put your existing knowledge to work.  To put it in Jacob’s words, we need to stop learning that new language or library, and start thinking about how this new knowledge applies to the problems we need to solve.  And finally, we need to actually apply what we’ve learned by creating solutions to those problems.

With those thoughts dominating my thinking for a day or two, I decided to step away from my learning and dive into a project I’d been thinking about for several months (more than a year actually.)  Through that experience I was able to see clearly the importance of what Barnett was saying.

The project I chose to work on required some web scraping, and I’d been advised to check out the BeautifulSoup Python library, which parses HTML and XML files and provides easy-to-use methods to search for specific elements and extract data from them.  What it doesn’t do is actually connect to web servers and request the HTML content you want to search, so I turned to another user-friendly Python library – requests.  requests made it easy to make a HTTP GET request, and coupled with BeautifySoup, made parsing data from an HTML file almost too easy.

My first challenge was identifying which elements needed to be captured. This could have been done with static HTML captured from a manual search on the site. But it was easier to use the Python command line, BeautifulSoup, and requests, to perform queries and parse html until I was getting the data I needed. The Python command line environment was a real boon in this situation.

At this point, I’d created a skeleton of a script that performed all the basic functionality – pulled a page from a website, parsed that page for the desired data, and even saved that data to a file.  This was about as far as any of the “learning” I’d done so far had taken me – implement basic functionality, see how things fit together, then move on.

But the problem I was trying to solve wasn’t to simply make one query and parse one HTML page and save one bit of data.  It was to query and parse thousands of pages and record thousands of records of data.   So, once I had this shell up and running, I set up the script to run against the full set of data and let it rip.

Almost immediately, my script screeched to a halt, with an obscure exception being thrown. With a little bit of debugging (again, using the Python command line) I was able to identify the problem – a page which was missing all of the data I was looking for, and thus was also missing the key HTML elements required for my BeautifulSoup code to work correctly.  Eventually, I was able to add some exception handling that would both allow me to skip the offending record, and also record some data about it in the log for later inspection.

This same pattern happened several more times – a situation would arise, be it missing data, or unexpected data, or some other anomaly with the pages  I was examining – and each time there was some triage required, and a solution to be found and implemented.

Eventually, I was able to resolve all of these issues, and had my script running without a hitch – pages were parsed, data was stored, and errors were logged and then moved past.  As I was about to start the maiden voyage of my newly minted data scraping script, I decided first to remove some old comments, written back at the beginning of the project.  Some of it was simple pseudo code describing the expected functionality of my main loop.  Before deleting it, I quickly scanned it and verified that my code was performing the expected functionality.  To my surprise, I had missed a subtle, but huge, issue – the existence of a Next link.

The queries I was making against this public database could return much more data than the developers felt fit comfortably on a single page.  Instead of presenting the user with an almost endless list of data, they paged it, adding Next and Back buttons to the bottom of each page, as appropriate.  I had known about these early on, but the “final script” I had created only processed the first page of data.

Resolved to complete this script, I added the code to check for the Next button, and worked out how to process multiple pages instead of one.  The scheme I worked out used recursion, and required very minimal changes to the script.  Finally, I was ready to test my new script.  Sure enough, the changes worked, and it wasn’t long before I began a multiple-day scraping run.

By the time I was finished I had lived the truth to what Jacob Barnett was preaching in his TED Talk.  When we set aside the active learning of new things, and start thinking – about and with – the new knowledge we have, and start using it to create new things, a whole new level of understanding in opened up.  And more than understanding, we will further the process of developing our craft.  And active thinking and creating is the only way to move our craft forward.

What am I doing? Let’s try to get specific.

As I declared in the post on the primary page of this blog, I’m working on resurrecting my career as a software engineer.  The first task of this endeavor is to determine just what sort of software I want to work on.

There are three areas I’m interested in focusing on:

  1. Web development
  2. Mobile development
  3. Mobile video game development

I’ve worked professionally on all three types of projects – unfortunately, not all as a software engineer.  My short-term goal is to familiarize myself with each of these enough that a company will be willing to give me a job.  Here’s what I’m planning on focusing on in each area.

Web Development

As I’ve turned my eye once again to the software job market, I’ve seen mostly a sea of web development jobs.  Whether it be internal business applications, corporate websites, or more commonly, Software as a Service (SaaS) providers, web development seems to top the list of most sought after development skills.  These would include:

  • HTML5, CSS3, JavaScript (AngularJS, Bootstrap, etc.)
  • Some server-side framework
    • Server-side Java
    • Ruby on Rails
    • Python/Drupal
    • NodeJS
    • PHP
    • These aren’t frameworks, but I need to learn more about these:
      • JSON
      • REST-based API development
  • Databases
    • Oracle/MySQL/Postgresql
    • NoSQL/MongoDB/etc.
  • Linux/Unix
    • You gotta run your servers on something.  Almost everyone runs some variant of Linux.

Mobile Development

There are really two primary development platforms here: iOS and Android.  Windows phone exists, but it is almost irrelevant.  Some of what I need to learn here is what development tools are currently in vogue.  I know the following:

  • iOS development
    • XCode IDE
    • Shift
    • App Store submissions
  • Android development
    • Java development for Android
    • Investigate other development tools/languages for Android

This list seems way too small.  A big part of this learning will be discovering what I need to be learning.

Mobile Video Game Development

My goal here is to become familiar with Unity, a development platform for interactive applications, primarily 2D and 3D games.  I’ve worked in the video game industry, and while Unity isn’t the end-all of game development solutions, it is a great solution for mobile games, and for indie game development (specifically because it’s free.)

That is actually a pretty big list of stuff.  Not sure how far along the list I need to get before I’ll be ready for work, but it’s time to get started.