I am looping over a number of web addresses and wish to clean them up

I presently possess the following code

# Parse hyperlink to remove http, path and appearance format

o_url = URI.parse(node.characteristics['href'])

# Remove world wide web

new_url = o_url.host.gsub('www.', '').strip

How do i extend this to get rid of the subdomains that could appear in some web addresses?

This can be a tricky problem. Some top-level domain names don't accept sign ups in the second level.

Compare example.com and example.co.united kingdom. Should you simply strip everything except the final two domain names, you'd finish track of example.com, and co.united kingdom, which could not be the intention.

Opera solves this by blocking by effective top-level domain, plus they maintain a listing of each one of these domain names. More details at publicsuffix.org.

This can be used list remove everything except the domain right near the effective TLD. I'm not sure associated with a Ruby library that performs this, but it might be a good idea release a one!

Update: you will find C, Perl and PHP libraries which do this. Because of the C version, you can produce a Ruby extension. Alternatively, you can port the code to Ruby.

I simply authored a library to get this done known as Domainatrix. You'll find it here: http://github.com/pauldix/domainatrix

require 'rubygems'

require 'domainatrix'

url = Domainatrix.parse("http://world wide web.pauldix.internet")

url.tld       # => "internet"

url.domain    # => "pauldix"

url.canonical # => "internet.pauldix"

url = Domainatrix.parse("http://foo.bar.pauldix.co.united kingdom/asdf.html?q=arg")

url.tld       # => "co.united kingdom"

url.domain    # => "pauldix"

url.subdomain # => "foo.bar"

url.path      # => "/asdf.html?q=arg"

url.canonical # => "united kingdom.co.pauldix.bar.foo/asdf.html?q=arg"

Something similar to:

def remove_subdomain(host)

    # Not complete. Add all root domain to regexp

    host.sub(/.*?([^.]+(.com.co.united kingdom.united kingdom.nl))$/, "1")

finish

puts remove_subdomain("world wide web.example.com") # -> example.com

puts remove_subdomain("world wide web.company.co.united kingdom") # -> company.co.united kingdom

puts remove_subdomain("world wide web.sub.domain.nl") # -> domain.nl

You'll still have to add all (root) domain names you think about root domain. So '.uk' may be the root domain, however, you most likely wish to keep your host right before the '.co.uk' part.

Discovering the subdomain of the URL is non-trivial to complete inside a general sense - it is not difficult should you just think about the fundamental ones, but when you receive into worldwide territory this becomes tricky.

Edit: Consider things like http://mylocalschool.k12.oh.us et al.

The standard expression you will need here could be a little tricky, because, hostnames could be infinitely complex -- you might have multiple subdomains (ie. foo.bar.baz.com), or even the top level domain (TLD) might have multiple parts (ie. world wide web.baz.co.united kingdom).

Ready for any complex regular expression? :)

re = /^(?:(?>[a-z0-9-]*.)+?)([a-z0-9-]+.(?>[a-z]*(?>.[a-z])?))$/i

new_url = o_url.host.gsub(re, '1').strip

Let us break this into two sections. ^(?:(?>[a-z0-9-]*.)+?) will collect subdomains, by matching a number of categories of figures then a us dot (greedily, to ensure that all subdomains are matched up here). The empty alternation is required within the situation of no subdomain (for example foo.com). ([a-z0-9-]+.(?>[a-z]*(?>.[a-z])?))$ will collect the particular hostname and also the TLD. It enables because of a 1-part TLD (like .info, .com or .museum), or perhaps a two part TLD in which the second part is two figures (like .oh.us or .org.united kingdom).

I examined this expression around the following samples:

foo.com => foo.com

world wide web.foo.com => foo.com

bar.foo.com => foo.com

world wide web.foo.ca => foo.ca

world wide web.foo.co.united kingdom => foo.co.united kingdom

a.b.c.d.e.foo.com => foo.com

a.b.c.d.e.foo.co.united kingdom => foo.co.united kingdom

Observe that this regex won't correctly match hostnames which have a lot more than two "parts" towards the TLD!