I am looping over a number of web addresses and wish to clean them up
I presently possess the following code
# Parse hyperlink to remove http, path and appearance format o_url = URI.parse(node.characteristics['href']) # Remove world wide web new_url = o_url.host.gsub('www.', '').strip
How do i extend this to get rid of the subdomains that could appear in some web addresses?
This can be a tricky problem. Some top-level domain names don't accept sign ups in the second level.
example.co.united kingdom. Should you simply strip everything except the final two domain names, you'd finish track of
co.united kingdom, which could not be the intention.
This can be used list remove everything except the domain right near the effective TLD. I'm not sure associated with a Ruby library that performs this, but it might be a good idea release a one!
Update: you will find C, Perl and PHP libraries which do this. Because of the C version, you can produce a Ruby extension. Alternatively, you can port the code to Ruby.
I simply authored a library to get this done known as Domainatrix. You'll find it here: http://github.com/pauldix/domainatrix
require 'rubygems' require 'domainatrix' url = Domainatrix.parse("http://world wide web.pauldix.internet") url.tld # => "internet" url.domain # => "pauldix" url.canonical # => "internet.pauldix" url = Domainatrix.parse("http://foo.bar.pauldix.co.united kingdom/asdf.html?q=arg") url.tld # => "co.united kingdom" url.domain # => "pauldix" url.subdomain # => "foo.bar" url.path # => "/asdf.html?q=arg" url.canonical # => "united kingdom.co.pauldix.bar.foo/asdf.html?q=arg"
Something similar to:
def remove_subdomain(host) # Not complete. Add all root domain to regexp host.sub(/.*?([^.]+(.com.co.united kingdom.united kingdom.nl))$/, "1") finish puts remove_subdomain("world wide web.example.com") # -> example.com puts remove_subdomain("world wide web.company.co.united kingdom") # -> company.co.united kingdom puts remove_subdomain("world wide web.sub.domain.nl") # -> domain.nl
You'll still have to add all (root) domain names you think about root domain. So '.uk' may be the root domain, however, you most likely wish to keep your host right before the '.co.uk' part.
Discovering the subdomain of the URL is non-trivial to complete inside a general sense - it is not difficult should you just think about the fundamental ones, but when you receive into worldwide territory this becomes tricky.
Edit: Consider things like http://mylocalschool.k12.oh.us et al.
The standard expression you will need here could be a little tricky, because, hostnames could be infinitely complex -- you might have multiple subdomains (ie. foo.bar.baz.com), or even the top level domain (TLD) might have multiple parts (ie. world wide web.baz.co.united kingdom).
Ready for any complex regular expression? :)
re = /^(?:(?>[a-z0-9-]*.)+?)([a-z0-9-]+.(?>[a-z]*(?>.[a-z])?))$/i new_url = o_url.host.gsub(re, '1').strip
Let us break this into two sections.
^(?:(?>[a-z0-9-]*.)+?) will collect subdomains, by matching a number of categories of figures then a us dot (greedily, to ensure that all subdomains are matched up here). The empty alternation is required within the situation of no subdomain (for example foo.com).
([a-z0-9-]+.(?>[a-z]*(?>.[a-z])?))$ will collect the particular hostname and also the TLD. It enables because of a 1-part TLD (like .info, .com or .museum), or perhaps a two part TLD in which the second part is two figures (like .oh.us or .org.united kingdom).
I examined this expression around the following samples:
foo.com => foo.com world wide web.foo.com => foo.com bar.foo.com => foo.com world wide web.foo.ca => foo.ca world wide web.foo.co.united kingdom => foo.co.united kingdom a.b.c.d.e.foo.com => foo.com a.b.c.d.e.foo.co.united kingdom => foo.co.united kingdom
Observe that this regex won't correctly match hostnames which have a lot more than two "parts" towards the TLD!