how does one extract the domain title from the URL, excluding any subdomains?
My primary simplistic attempt was:
This works best for http://world wide web.foo.com, although not http://world wide web.foo.com.au. It is possible to method of doing this correctly without needing special understanding about valid TLDs or country codes (simply because they change).
from __future__ import with_statement from urlparse import urlparse # load tlds, ignore comments and empty lines: with open("effective_tld_names.dat.txt") as tldFile: tlds = [line.strip() for line in tldFile if line not in "/n"] def getDomain(url, tlds): urlElements = urlparse(url).split('.') # urlElements = ["abcde","co","united kingdom"] for i in range(-len(urlElements),): lastIElements = urlElements[i:] # i=-3: ["abcde","co","united kingdom"] # i=-2: ["co","united kingdom"] # i=-1: ["united kingdom"] etc candidate = ".".join(lastIElements) # abcde.co.united kingdom, co.united kingdom, united kingdom wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.united kingdom, *.united kingdom, * exceptionCandidate = "!"+candidate # match tlds: if (exceptionCandidate in tlds): return ".".join(urlElements[i:]) if (candidate in tlds or wildcardCandidate in tlds): return ".".join(urlElements[i-1:]) # returns "abcde.co.united kingdom" raise ValueError("Domain not in global listing of TLDs") print getDomain("http://abcde.co.united kingdom",tlds)
I'd be thankful if a person tell me which bits of the aforementioned might be rewritten inside a more pythonic way. For instance, there has to be an easy method of iterating within the
lastIElements list, however i could not think about one. I additionally have no idea if ValueError is the greatest factor to boost. Comments?
No, there's no "intrinsic" method of understanding that (e.g.)
zap.co.it is really a subdomain (because Italy's registrar DOES sell domain names for example
zap.co.united kingdom is not (since the UK's registrar Does not sell domain names for example
co.united kingdom, only like
You'll have to make use of an auxiliary table (or online source) to inform you which ones TLD's behave peculiarly like UK's and Australia's -- there is no method of divining that from just looking in the string without such extra semantic understanding (obviously it may change eventually, but when you'll find a great online source that source will even change accordingly, one hopes!-).
You will find many, many TLD's. Here's their email list:
Here's another list
Here's another list
Here is a great python module someone authored to resolve this issue having seen this: https://github.com/john-kurkowski/tldextract