how does one extract the domain title from the URL, excluding any subdomains?

My primary simplistic attempt was:


This works best for http://world wide, although not http://world wide It is possible to method of doing this correctly without needing special understanding about valid TLDs or country codes (simply because they change).


using this file of effective tlds which another person available on mozzila's website:

from __future__ import with_statement

from urlparse import urlparse

# load tlds, ignore comments and empty lines:

with open("effective_tld_names.dat.txt") as tldFile:

    tlds = [line.strip() for line in tldFile if line[] not in "/n"]

def getDomain(url, tlds):

    urlElements = urlparse(url)[1].split('.')

    # urlElements = ["abcde","co","united kingdom"]

    for i in range(-len(urlElements),):

        lastIElements = urlElements[i:]

        #    i=-3: ["abcde","co","united kingdom"]

        #    i=-2: ["co","united kingdom"]

        #    i=-1: ["united kingdom"] etc

        candidate = ".".join(lastIElements) # kingdom, co.united kingdom, united kingdom

        wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.united kingdom, *.united kingdom, *

        exceptionCandidate = "!"+candidate

        # match tlds:

        if (exceptionCandidate in tlds):

            return ".".join(urlElements[i:])

        if (candidate in tlds or wildcardCandidate in tlds):

            return ".".join(urlElements[i-1:])

            # returns " kingdom"

    raise ValueError("Domain not in global listing of TLDs")

print getDomain(" kingdom",tlds)

leads to: kingdom

I'd be thankful if a person tell me which bits of the aforementioned might be rewritten inside a more pythonic way. For instance, there has to be an easy method of iterating within the lastIElements list, however i could not think about one. I additionally have no idea if ValueError is the greatest factor to boost. Comments?

No, there's no "intrinsic" method of understanding that (e.g.) is really a subdomain (because Italy's registrar DOES sell domain names for example while kingdom is not (since the UK's registrar Does not sell domain names for example co.united kingdom, only like kingdom).

You'll have to make use of an auxiliary table (or online source) to inform you which ones TLD's behave peculiarly like UK's and Australia's -- there is no method of divining that from just looking in the string without such extra semantic understanding (obviously it may change eventually, but when you'll find a great online source that source will even change accordingly, one hopes!-).

Here is a great python module someone authored to resolve this issue having seen this: