I want help creating a regular expression that may correctly match an URL inside free text.

  • plan
    • Among the following: ftp, http, https (is ftps a protocol?)
  • optional user (and optional pass)
  • host (with support for IDNs)
    • support for world wide web and sub-domain(s) (with support for IDNs)
    • fundamental blocking of TLDs ([a-zA-Z] is sufficient I believe)
  • optional port number
  • path (optional, with support for Unicode chars)
  • query (optional, with support for Unicode chars)
  • fragment (optional, with support for Unicode chars)

Here's what I possibly could discover about sub-domain names:

A "subdomain" expresses relative dependence, not absolute dependence: for instance, wikipedia.org comprises a subdomain from the org domain, and en.wikipedia.org comprises a subdomain from the domain wikipedia.org. In theory, this subdivision will go lower to 127 levels deep, and every DNS label can contain as much as 63 figures, as lengthy because the whole domain title doesn't exceed an overall total period of 255 figures.

Concerning the domain title itself I could not find any reliable source however i think the standard expression for non-IDNs (I am unsure crafting a IDN compatible version) is one thing like:


Can someone assist me with this particular regular expression or point me to some good direction?

John Gruber, of Daring Fireball fame, had a publish lately that detailed his mission for any good URL-realizing regex string. What he emerged with was this:

b(([w-]+://?world wide web[.])[^s()<>]+(?:([wd]+)([^[:punct:]s]/)))

Which apparently does Comfortable with Unicode-that contains Web addresses, too. You'd have to do the slight modification into it to find the relaxation of the items you are searching for -- the plan, username, password, etc. Alan Storm authored an item explaining Gruber's regex pattern, that we certainly needed (regex is really write-once-have-no-clue-how-to-read-ever-again!).

Should you require protocol and aren't worried an excessive amount of about false positives, undoubtedly the simplest factor to complete is match all non-whitespace figures around ://

This can enable you to get the majority of the way there. If you want it more refined please provide test data.