Tag Archives: lxml

GCHQ Quiz Grinch: 1

***Contains spoilers, obviously.***

The GCHQ (Government Communications Headquarters) released a hilariously infuriating Christmas card, which consisted of a logic puzzle, which once solved, led to more puzzles, which led to more puzzles, and so on.

I solved the first puzzle no problem. That revealed Part 2, a series of 6 multiple choice questions, all of which required a correct answer to move on. I got two of them, but was left at a loss for the other four. I gave it an honest try, but really wanted to see what Part 3 had in store. So, I turned to cheating. And, here’s how.

Clicking through the answers, you may notice that the URL contains the previous answer letter. For example, if your answers were A, B, C, D, E, and F, the final webpage (telling you that you were wrong) would have the URL: http://s3-eu-west-1.amazonaws.com/puzzleinabucket/ABCDEF.html

If you had enough time, you could type in each combination and see if that led to a “correct” URL. Obviously, that would be fairly time-consuming. Each of 6 questions had 6 possible answers, which means 6 ^ 6 = 46656 possibilities. You could save some time by including known correct answers. In my case, I had two answers of which I was confident. That left me with 6 ^ 4 = 1296 possibilities, still too many to inspect manually.

However, a simple Python script could load the pages fairly quickly (about once per second, or a maximum of 21.6 minutes), inspect the resulting webpage, and determine if it contains a message indicating that the combination was incorrect. If so, move onto next combination. If not, stop and print the correct combination.

Here’s the script:

import requests, time, sys
from lxml import html

opts = ['A','B','C','D','E','F'] # possible answers
for i in opts:
    for j in opts:
        for k in opts:
            for l in opts:
                url = i + j + k + 'A' + l + 'E' # 4 unknown, 2 known
                time.sleep(1) # provide rest in between calls
                r = requests.get('http://s3-eu-west-1.amazonaws.com/puzzleinabucket/' + url + '.html') # make call
                tree = html.fromstring(r.text) # store page html
                if tree.xpath('//div[contains(., "Sorry - you did not get all the questions correct.")]/p//text()'):
                    print "SUCCESS " + url
                    sys.exit() # quit script once correct combination is found


And the final answer: DEFACE

The quiz is for charity, so don’t forget to donate here.



GeoNet Prime Answering Times


ESRI runs a biannual contest to encourage participation on their help forum, GeoNet. Prizes go to the top ten point-getters. I found out during the last contest that it takes a great deal of time and persistence to keep up with the top of the pack (I ended up 5th). There are several tips for optimizing your effort (some of which I outlined, somewhat sarcastically, here).

One such tip is: get on GeoNet during times when there are the greatest number of fresh, unanswered questions. If you’ve spent any time on GeoNet, you have likely noticed that questions are generally asked during North American working hours, when people are struggling to get through their work-related GIS tasks. I wanted to put some better numbers to this idea, so I set about gathering the data myself.

All the information is there: each post has the date/time it was asked written right there in the posting. You could click on each post and record that date/time into an Excel and be done with it, but that would be awfully tedious. This is where screen scraping comes in. Screen scraping is the direct equivalent of having your computer control your web browser: click here, find this part of the HTML code, read it, and do something with it.Luckily, your computer doesn’t care if it has to spend all day doing the same thing over and over and over…

I chose to use Python, but you can do this in other languages, as well. Useful libraries to download are Requests and lxml. I use Requests for making the, you guessed it, “requests”, which are similar to typing a URL in the address bar of your browser. I use lxml for parsing and traversing the returned HTML code, which you can look at on any web page by pressing Ctrl+u (at least, in Chrome).

from lxml import html
import requests, time, csv

with open('C:/junk/geonet.csv', 'w') as csvfile: # create and/or open a CSV file
  csvWriter = csv.writer(csvfile, delimiter=" ", quoting=csv.QUOTE_MINIMAL) # writer
  dateList = []
  baseUrl = 'https://geonet.esri.com/content' # store the URL prefix

  for i in range(10): # loop through the first 10 'Content' pages
    page = requests.get(baseUrl + '?start=' + str(i*20)) # navigate to page
    tree = html.fromstring(page.text) # retrieve the HTML
    linkList = tree.iterlinks() # find all the links on the page
    threads = []

    for link in linkList: # loop through the links
      if link[2].startswith('/thread/'): # find those starting with "thread"
        threads.append(link[2]) # add the link to the list

    threadBase = 'https://geonet.esri.com' # store the URL prefix
    for thread in threads: # loop through the threads listed on the 'Content' page
      page = requests.get(threadBase + thread) # navigate to the correct thread page
      tree = html.fromstring(page.text) # retrieve the HTML
      dates = tree.find_class('j-post-author') # retrieve the date
      dateList.append(dates[0].text_content().strip()) # write to list
      csvWriter.writerow(dates[0].text_content().strip()) # write to CSV
      time.sleep(5) # wait 5s to give server a chance to handle someone else's requests

Anyhow, the graph at the start of this post shows pretty much what I expected: people on the East Coast get confused, then people on the West Coast get confused, then everyone goes home.