I received a comment on my recent post about searching, I received a comment referencing a piece of technology which is new to me: Gnugol. It is a command line interface to the most common search engines. There is work on an Emacs interface for it, but I think I have another application idea.
I’ve contemplated, for a while now, how one might actually get the results one wants in a similar format. As far as I can tell, Gnugol does not yet support regex as a search parameter (though it would be trivial to apply it to the results, which is why I’m now sharing this idea) but it does provide a command line (C compiled) utility to fetch results from Google. I’ll wager that this will prove remarkably faster than having Python or PHP stream the data directly (it seems like it is faster than my browser).
So, here is the proposal. Have Python, specifically Django, perform the search using this library. Have it then spawn a thread (this way you can create a large number of streams simultaneously) for each result and actually retrieve the data from that website. Apply a regex the results. If it does not find that a page matches the regex, then dump that result and queries a new result set until the target count is matched (this way there can be 10 results/page, for example). Obviously some keywords would need to be included first. There isn’t a way to “grep the web” (yet), but this will make my dreams a reality.
Python, so you know, is my language of preference here because it does not require specialized compiling to get the thread library to work. All you need is:
import threading, time
t = threading.Thread(target=time.sleep,args=)
The threading code is already a part of Python’s main build. PHP, on the other hand, will require the pcntl library compiled in. That isn’t an option in most cases (and my case in particular) (there are other simulations but that is a hack).