I'm trying make use of the following regular expression to extract domain title from the text, however it just produce nothing, wrong by using it? I'm not sure if this sounds like appropriate to request this "fix code" question, maybe I ought to find out more. Among the finest in order to save a while. Thanks

pat_url = re.compile(r'''

            (?:https?://)*

            (?:[w]+[-w]+[.])*

            (?P<domain>[w-]*[w.](cominternet)([.](cnjpus))*[/]*)

            ''')

print re.findall(pat_url,"http://world wide web.google.com/abcde")

I would like the output to become google.com

Avoid using regex with this. Make use of the urlparse standard library rather. It is more straightforward and simpler to seeOrpreserve.

http://paperwork.python.org/library/urlparse.html

The very first is that you are missing the re.VERBOSE flag within the call to re.compile(). The second reason is that you ought to make use of the techniques around the came back object. The 3rd is the fact that you are utilizing a regular expression where a suitable parser already is available within the stdlib.

This is actually the only right way to parse an url having a regex:

It's in C++ but you will find trivial to transform to python by getting rid of additional . With an enum for that captures.

Also see RFC3986 as original source for that regexp.

static const char* const url_regex[] = ///)?([^?#]*)(?[^#]*)?(#.*)?",



enum Plan_CLN = 1,

    Plan  = 2,

    DSLASH_AUTH = 3,

    AUTHORITY = 4,

    PATH    = 5,

    QUERY   = 6,

    FRAGMENT = 7



I do not think that this really is really about "regression", could it be? It comes down to regular expressions, the industry completely different factor. Possibly someone should fix the marking.