Does anybody know a normal expression to complement Domain.CCTLD? I'm not going subdomains, just the "atomic domain". For instance,
paperwork.google.com does not get matched up, but
google.com does. However, this will get complicated with things like
.co.united kingdom, CCTLDs. Does anybody know an answer? Thanks ahead of time.
EDIT: I have recognized I additionally suffer from multiple subdomains, like
john.doe.google.co.united kingdom. Require a solution now as part of your :P.
It may sound as if you are searching for the data available with the Public Suffix List project.
A "public suffix" is a to which Internet customers can directly register names. Some good examples of public suffixes are ".com", ".co.united kingdom" and "pvt.k12.wy.us". The General Public Suffix List is a listing of known public suffixes.
There's not one regular expression which will reasonably match their email list of public suffixes. You will have to implement code to make use of the general public suffix list, or find a current library that already achieves this.
I'd most likely solve this through getting an entire listing of TLDs and taking advantage of it to produce the regex. For instance (in Ruby, sorry, not really a Pythonista yet):
tld_alternation = ['.com','.co.uk','.eu','.org',...].join('') regex = /^[a-z0-9]([a-z0-9-]*[a-z0-9])?(#)$/i
I do not think you can correctly differentiate from a real two part TLD along with a subdomain not understanding the particular listing of TLDs (ie: you can always create a subdomain that appears just like a TLD should you understood the way the regex labored.)
According to your comment above, I am likely to reinterpret the question -- instead of creating a regex which will match them, we'll produce a function which will match them, and apply that function to filter a listing of domains to simply include top class domain names, e.g. google.com, amazon . com.co.united kingdom.
First, we'll need a listing of TLDs. As Greg pointed out, the general public suffix list is a superb starting point. Let us assume you've parsed their email list right into a python array known as
suffixes. If the is not something your confident with, comment and that i can also add some code that is going to do it.
suffixes = parse_suffix_list("suffix_list.txt")
Now we'll need code that identifies whether confirmed domain title matches the pattern some-title.suffix:
def is_domain(d): for suffix in suffixes: if d.endswith(suffix): # Obtain the base domain title without suffix base_title = d[:-(suffix.length + 1)] # Whether it consists of '.', it is a subdomain. otherwise base_title.consists of('.'): return true # As we arrive here, no matches put together return false