I am searching for a method to instantly determine natural language utilized by an internet site page, given its URL.
In Python, a function like:
def LanguageUsed (url): #stuff
Which returns a language specifier (e.g. 'en' for British, 'jp' for Japanese, etc...)
Review of Results: I've got a reasonable solution employed in Python using code in the PyPi for oice.langdet. It will a good job in discriminating British versus. Non-British, that is all I require right now. Note you need to fetch the html using Python urllib. Also, oice.langdet is GPL license.
For any more general solution using Trigrams in Python as others have recommended, check this out Python Cook book Recipe from ActiveState.
Your best choice really is by using Google's natural language recognition api. It returns an iso code for that page language, having a probability index.
Normally, this is accomplished by utilizing character n-gram models. You'll find here a condition from the art language identifier for Java. If you want outside assistance transforming it to Python, just request. Hope it will help.
There's nothing concerning the URL itself which will indicate language.
One option is to utilize a natural language toolkit to try and identify the word what in line with the content, but even when you will get the NLP some of it working, it will be pretty slow. Also, it might not be reliable. Remember, most user agents pass something similar to
with each request, and several large websites assists different content according to that header. More compact sites could be more reliable simply because they will not give consideration towards the language headers.
You might use server location (i.e. which country the server is within) like a proxy for language using GeoIP. It's clearly not perfect, but it's a lot better than while using TLD.
You might like to try ngram based recognition.
Edit: TextCat rivals page provides some interesting links too.
Edit2: I question if creating a python wrapper for http://world wide web.mnogosearch.org/guesser/ could be difficult...
nltk may help (if you need to get lower to coping with the page's text, i.e. when the headers and also the url itself don't determine the word what sufficiently well for the reasons) I do not think NLTK directly provides a "let me know which language this text is withinInch function (though NLTK is big and fast growing, therefore it might actually get it), but you can test parsing the given text based on various possible natural languages and checking which of them provide the best parse, wordset, &c, based on the rules for every language.