Screen snarfs

 
Gadgets

Authentication
Crypto
ENV
HTTP
Regex
Regex 2
Robots
Snarfs
SSL
Stepper

The Web is a wonderful place. All those treasures of information like ripe apples hanging just inches away from your grasp. You just need a browser. Or, do you?

How hard is it to grab a particular nugget of data off a Web page without going to the trouble of launching a browser, typing in the URL, clicking down a layer or two, etc.? Surprisingly, it's really quite simple.

An example: Network Solutions maintains a whois server for querying the InterNIC database of domain names and owners. ( http://www.networksolutions.com/cgi-bin/whois/whois.) It's a great tool. In fact, it's a better tool than many people know. Let's see if we can streamline the way it works, and deliver an informative and useful report.

First of all, using the InterNIC whois to see if www.coolbabes.com is taken (it is) is a very trivial implementation of a powerful tool. Why not use it to see all the domains registered to General Motors? When you get the list (large), each domain is hyper-linked, so you can click on them to get each one's entire database entry.

Or you can write a screen snarf that asks for a list of all GM's registered domain names and then uses the list to collect all GM's InterNIC database records.

You could kick such a scraper off through a CGI run by a Web form. Or, you could run it as a daily cron. You might even implement it as a Java application your secretary runs on his Windows desktop. I'll show you how to do all three. I'll even show you how to do it from behind a proxy firewall.

  Next >>






Home | Gadgets | Code | Links | Reads | Contact

Copyright © 1999, 2001, 2002 by John H. Byrd
All rights reserved.