I am attempting to write (or simply find a current) PHP method that can a hyperlink and extract the url. The secret is, it must hold underneath the weight of strange searching domain names like:

world wide web.champa.kku.ac.th

Searching at that one myself with human eyes, I still suspected it improperly: thought the domain could be kku.ac.th but that provides a dns error when going to.

So anybody knows of a great way to dependably extract the domain from url:

http://site.com/hello.php

http://site.com.united kingdom/hello.php

http://subdomain.site.com/hello.php

http://subdomain.site.com.united kingdom/hello.php

http://world wide web.champa.kku.ac.th/hello.php // as well as the main one I could not tell

Maybe the parse_url function may help, here ?


Inside your situation, with individuals Web addresses, the next part of code :

echo parse_url('http://site.com/hello.php', PHP_URL_HOST) . '<br />'

echo parse_url('http://site.com.united kingdom/hello.php', PHP_URL_HOST) . '<br />'

echo parse_url('http://subdomain.site.com/hello.php', PHP_URL_HOST) . '<br />'

echo parse_url('http://subdomain.site.com.united kingdom/hello.php', PHP_URL_HOST) . '<br />'

echo parse_url('http://world wide web.champa.kku.ac.th/hello.php', PHP_URL_HOST) . '<br />'

Gives this output :

site.com

site.com.united kingdom

subdomain.site.com

subdomain.site.com.united kingdom

world wide web.champa.kku.ac.th

PHP has got the parse_url() function that may help you perform the fundamental splitting into protocol, host, port, and so forth.

Regarding removing the "right" domain in uncertain cases, this really is very tough to tell because sometimes, "two-part TLDs" really are a measure through the TLD authority (e.g. within the United kingdom) and often are private businesses (e.g. .united kingdom.com). I believe you will not circumvent maintaining lists of top level domain names which have two parts like

  • .co.united kingdom
  • .ac.united kingdom
  • .ac.th

individuals being could be treated like TLDs (Top level domain names), ingesting the 2nd part.

This is actually the best way of dependably telling apart "two-part TLDs" like .co.united kingdom - where server1.ibm.co.united kingdom (in which the two-part .co.united kingdom must be removed to look for the domain itself) from regular sub-domain names like server1.ibm.com (where .com must be removed).

A great beginning indicate get a listing of numerous important "two-part TLDs" may be the domain search at speednames.com (choose "all" in nations).

With Ruby you should use the Domainatrix library / jewel

http://world wide web.pauldix.internet/2009/12/parse-domain names-from-web addresses-easily-with-domainatrix.html


require 'rubygems'

require 'domainatrix'

s = 'http://world wide web.champa.kku.ac.th/dir1/dir2/file?option1&option2'

url = Domainatrix.parse(s)

url.domain

=> "kku"

useful gizmo! :-)