Evolutionary Tree of languages

This language evolutionary tree is generated fully automatically on the basis of 18 basic vocabulary items and a simplified sound correspondence point system. The farther left the branches get linked together, the more remote the languages are.
 
Important note!!! All families and subfamilies with names in dark blue (=top level families) and light blue (=sub families) are the ones recognized by all scholars. So in these cases, our automatical classification matches the mainstream view. The three macro families with name in green and their branches (also green) are hypotheses for long-range relationships infered by our system. So the names in green refer to existing (but not unanimously recognized) hypotheses: Eurasiatic, Austro-Tai and Northern Caucasian. An other interesting result is the internal classification of the sub families within Indo-European: the order in which the subfamilies split is still disputed today and our results match most existing hypotheses.

Language evolutionary tree

The topology of the tree shows the language families and sub-families which are known from other, more complex methods - which is a strong validation of our methodology. But most important is perhaps that the system also infers long-range relationships between established macro families which are not yet recognized by all specialists but for which hypotheses already exist. These "super families" are known from hypotheses by numerous scolars: Eurasiatic for the link between Indo-European, Uralic, Turkic and Mongolic, Austro-Tai for the link between Austronesian and Tai-Kadai and North Caucasian for the link between Northwest and Northeast Caucasian.

For interactive language trees with diachronic interpolation click here:     Indo-European language tree      Afro-Asiatic language tree


Notes from the tree:
 
(1) Since the system is designed to react to signals that link languages over thousands of years, the sub-classifications within the sub-families are sometimes not perfect - especially if these languages are very close to each other. In a few cases, some languages get classified in their right sub-family, but at the wrong place within this sub-family. Tree models are not perfect and can not always represent the gradual evolution of expanding cluster of dialects. Alternative models like the wave model may deliver better results in the future.
(2) We have excluded all ancient languages which have living descendents like Sanskrit, Avestan, Latin, Ancien Greek and languages from the Middle Ages as they bias the results in the tree (the evolution stops at a certain date but these old languages are compared with others as if they were contemporary languages. An effect of this is that Sanskrit, which can in some way be regarded as near proto-language for the Indo-Aryan family, would get classified within the Iranian family in the same area as Avestan, itself one of the oldest Iranian languages. This is because it is closer to its relative of thousands of years ago than to its descendents. (Avestan to Sanskrit (Vedic) - distance = 37,6 and Hindi to Sanskrit (Vedic) - distance = 54,4.) It is a limit of a tree representation.
(3) The Uralic branch has strong links both to the Turkic/Mongolic and to the Indo-European ones, but with a low statistical confidence. The "Uralo-Altaic" hypothesis used to be a widely accepted hypothesis in the past. Today, many scholars reject this classification, others argue a broader common origin where Uralic, Altaic, Indo-European and other families share a common origin (see source). In eLinguistics, the relatedness signals between the Uralic and the Altaic languages are as strong as they are between Uralic and Indo-European. Pairwise comparisons like Finnish to Kazakh and Hungarian to Mongolian show very strong relatedness signals with a very low p-value (low probability that the results are due to chance). The eLinguistics results alone can not confirm the Eurasiatic hypothesis but bring interesting facts for discussions about this issue!
The Tungusic family is isolated (the branch linking it to the Altaic family is too far left in the "statistic noise area") - altough quite strong relationships appear in queries for the genetic proximity between Oroqen and Kalmyk and Mongolian.
(4) The Altaic family - if considered as linking only the Mongolic and Turkic subfamilies - is reflected here with a very strong statistical significance.
(5) The Afroasiatic classification reflected in eLinguistics.net matches the widely accepted views. All branches (with the exception of Omotic) are very stable. The Cushitic and Omotic sub-families are linked to the Semitic, Egyptian (ancient Egyptian and Coptic), Berber and Chadic ones (Afroasiatic macro family). However, this connection is not completely stable. Two Omotic languages (Wolaytta and Gamo) do not connect to any macro-family, although they clearly have the Omotic-Dizoid languages (Dizi, Nayi and Sheko) as their next neighbours. There is no consensus among linguists regarding the broader status of the Omotic family and Glottolog does not classify it as a single group but as 4 separate ones. Our system identifies only the Dizoid branch of Omotic has having a clear genetic relationship with the Cushitic languages.
(6) In the Austronesian families, 4 languages do not get classified by our system: they are all Southern Oceanic languages from New Caledonia: Ajie, Drehu, Nengone and PaicĂ®. 4 other Austronesian languages get classified at the right place in the tree, but their position is instable and changes when other languages are removed or added and if the UPGMA phylogenetic algorithm is used instead of the Neighbour Joining. These languages are Nauruan, Palauan, Chamorro (Guam) and Tetum. Analysing their nearest neighbours in the values confirms the position in the present tree. Otherwise, the Austronesian classification inferred by our system perfectly matches current mainstream views.
(7) In the Niger-Congo macro-family, 2 languages are identified as members but do not classify in the right sub-family in our system: Serer and Temne. We do not represent them in the tree. The Kwa languages (Baoule and Twi) and Mossi should classify further down within the Volta-Congo subfamily.
 
Other remarks:

10 languages included in our system do not get classified within a macro-family: Ainu, Basque, Burushaski, Elamite, Georgian, Japanese, Kanuri, Ket, Korean and Sumerian. These languages are isolated and are not represented in the tree. For the same reason, we do not include the languages of the Americas in this tree list: our system does not infer more than what is already known: a big number of isolated small groups of languages. eLinguistics.net identifies the established smaller groups but their representation on the tree list would extend its size with loss of clarity.

Languages with too few available words (Etruscan, Hurrian, Mycenaean, Phrygian and Urartian) are also excluded from the tree as they bias the results, with pairwise comparisons based on 1 to 5 words. For Oscan only a dozen of words is available but it ranks well in the tree without negative impact on the branches stability. Lycian and Umbrian also have only too few words available (8) but they have clear next neighbours in the list (Lycian->Luwian (Anatolian) and Umbrian->Oscan (Sabellic)), so we place them manually in the tree.

The farther left the branches are being linked at, the less reliable the classification is.

The clustering technique used to generate the tree from the distance matrix is Neighbor-Joining. Another tree technique, UPGMA (Unweighted Pairwise Group Method with Arithmetic-mean), brings very similar results.

The technology used to generate evolutionary trees from the distance matrix is R and the SigTree and ape R-packages.