I've a listing of million web addresses. I have to extract the TLD for every url and make multiple files for every TLD. For instance collect all web addresses with .com as tld and dump that in 1 file, another apply for .edu tld and so forth. Further within each file, I must sort it alphabetically by domain names after which by subdomains etc.
Can anybody produce a jump for applying this in perl?
- Use URI to parse the URL,
- Use its
hosttechnique to get the host,
- Use Domain::PublicSuffix's
get_root_domainto parse the host title.
- Make use of the
suffixtechnique to get the actual TLD or even the pseudo TLD.
use feature qw( say ) use Domain::PublicSuffix qw( ) use URI qw( ) my $dps = Domain::PublicSuffix->new() for (qw( http://world wide web.google.com/ http://world wide web.google.co.united kingdom/ )) Web addresses as absolute Web addresses with missing http://. $url = "http://$url" if $url !~ /^w+:/ my $host = URI->new($url)->host() $host =~ s/.z// # D::PS does not handle "domain.com.". $dps->get_root_domain($host) or die $dps->error() say $dps->tld() # com united kingdom say $dps->suffix() # com co.united kingdom