--#--
Strip tease
--#--
When an HTML file is returned by a Web server, whether to your
browser or to your robot user agent, a good portion of the
text is HTML mark-up code. The browser finds this essential to
rendering the page, but a user agent finds it downright annoying.
If you want just to strip all the HTML tags out of the file, you
might be interested in the Perl module HTML::Parser.
We want something less generic: We want just the list of General Motor's
domains. So, let's just strip those out.
To start, capture the WHOIS output to a file by redirecting your Perl script's
output to a file. If you open the file in a browser, you'll see the list
your looking for. GM has so many domains registered, NSI's WHOIS server
just quits when it gets to 50.
Open the same file in a text editor and find those same domain listings.
Notice that the entire listing is encased in a <pre> </pre> tag pair? We can
quickly cut down on the amount of text we have to parse by simply cutting
away what's outside those tags.
First, grab the script's output into another Perl variable, rather than
printing it to stdout. Then use a Perl regex expression to strip away
the unwanted fluff. (Note the use of the 's' operator on the end of in
the Perl substitution function. Because $list contains multiple lines,
each ending in a newline character, you need to tell the
substitution to treat each newline as "space.")
$list = $repsonse->content;
$list =~ s#.*(.*)
.*#$1#sg;
Now, if you add print "$list\n";
to your code you'll see the list of 50 domains. But it's still
choked with HTML.
Another Perl regex can easily strip out any string comprised of characters,
a dot and three characters: s/\b([w-\.]*\w{3,3})/; Use of
the parentheses lets us grab what was matched inside the parentheses.
Since we start with a multi-lined string, let's break it into individual
lines and process each line:
@lines = split(/\n/, $list);
foreach (@lines) {
if ( /\s*([\w-]*\.\w{3,3})/) {
print "$1\n";
}
}
So now we have
26 lines of
Perl code that automagically queries NSI's WHOIS server for
the domain names that GM holds, or at least the first 50 of them.
But I promised more than that, didn't I? I promised you'd get the
InterNIC records on all the domains. And I'll deliver just that.