In the midst of my efforts learning, and re-learning, various software development minutiae , I ran across this video detailing the (somewhat over-hyped) story of Jacob Barnett, a kid who was diagnosed with autism as a toddler, leading to all sorts of anxiety on the part of his parents. Eventually, his autism – which is still noticeable, but probably more in line with Asperger syndrome – was augmented with a diagnosis of genius (his IQ is supposedly a stunning 170.) He is currently pursing a PhD in Theoretical Physics.
In his talk, Barnett repeatedly makes a plea for others to, “Stop learning and start thinking and creating.” At first, I didn’t understand what he was talking about (hey, cut me some slack, my IQ isn’t 170…) but as I thought more about what he was saying, I think I finally understood. When striving for competency, and then mastery, in any subject, it is critical to take a break from aggregating new knowledge and to put your existing knowledge to work. To put it in Jacob’s words, we need to stop learning that new language or library, and start thinking about how this new knowledge applies to the problems we need to solve. And finally, we need to actually apply what we’ve learned by creating solutions to those problems.
With those thoughts dominating my thinking for a day or two, I decided to step away from my learning and dive into a project I’d been thinking about for several months (more than a year actually.) Through that experience I was able to see clearly the importance of what Barnett was saying.
The project I chose to work on required some web scraping, and I’d been advised to check out the BeautifulSoup Python library, which parses HTML and XML files and provides easy-to-use methods to search for specific elements and extract data from them. What it doesn’t do is actually connect to web servers and request the HTML content you want to search, so I turned to another user-friendly Python library – requests. requests made it easy to make a HTTP GET request, and coupled with BeautifySoup, made parsing data from an HTML file almost too easy.
My first challenge was identifying which elements needed to be captured. This could have been done with static HTML captured from a manual search on the site. But it was easier to use the Python command line, BeautifulSoup, and requests, to perform queries and parse html until I was getting the data I needed. The Python command line environment was a real boon in this situation.
At this point, I’d created a skeleton of a script that performed all the basic functionality – pulled a page from a website, parsed that page for the desired data, and even saved that data to a file. This was about as far as any of the “learning” I’d done so far had taken me – implement basic functionality, see how things fit together, then move on.
But the problem I was trying to solve wasn’t to simply make one query and parse one HTML page and save one bit of data. It was to query and parse thousands of pages and record thousands of records of data. So, once I had this shell up and running, I set up the script to run against the full set of data and let it rip.
Almost immediately, my script screeched to a halt, with an obscure exception being thrown. With a little bit of debugging (again, using the Python command line) I was able to identify the problem – a page which was missing all of the data I was looking for, and thus was also missing the key HTML elements required for my BeautifulSoup code to work correctly. Eventually, I was able to add some exception handling that would both allow me to skip the offending record, and also record some data about it in the log for later inspection.
This same pattern happened several more times – a situation would arise, be it missing data, or unexpected data, or some other anomaly with the pages I was examining – and each time there was some triage required, and a solution to be found and implemented.
Eventually, I was able to resolve all of these issues, and had my script running without a hitch – pages were parsed, data was stored, and errors were logged and then moved past. As I was about to start the maiden voyage of my newly minted data scraping script, I decided first to remove some old comments, written back at the beginning of the project. Some of it was simple pseudo code describing the expected functionality of my main loop. Before deleting it, I quickly scanned it and verified that my code was performing the expected functionality. To my surprise, I had missed a subtle, but huge, issue – the existence of a Next link.
The queries I was making against this public database could return much more data than the developers felt fit comfortably on a single page. Instead of presenting the user with an almost endless list of data, they paged it, adding Next and Back buttons to the bottom of each page, as appropriate. I had known about these early on, but the “final script” I had created only processed the first page of data.
Resolved to complete this script, I added the code to check for the Next button, and worked out how to process multiple pages instead of one. The scheme I worked out used recursion, and required very minimal changes to the script. Finally, I was ready to test my new script. Sure enough, the changes worked, and it wasn’t long before I began a multiple-day scraping run.
By the time I was finished I had lived the truth to what Jacob Barnett was preaching in his TED Talk. When we set aside the active learning of new things, and start thinking – about and with – the new knowledge we have, and start using it to create new things, a whole new level of understanding in opened up. And more than understanding, we will further the process of developing our craft. And active thinking and creating is the only way to move our craft forward.