Evolutionary Tree of languages

This language evolutionary tree is generated fully automatically on the basis of 18 basic vocabulary items and a simplified sound correspondence point system.

For interactive language trees with diachronic interpolation click here:     Indo-European language tree      Afro-Asiatic language tree


For computer generated tree (raw data) with all languages read further:
The farther left the branches get linked together, the more remote the languages are. The distances represented in the language evolutionary tree have to be counted in both directions, so in the scale, branches at a 35 level reflect a genetic proximity of 70. The topology of this language evolutionary tree shows the language families and sub-families which are known from other, more complex methods - which is a strong sign that the methodology is right.

Language evolutionary tree

Notes from the tree (observations about languages the system doesn't place at the right place):
 
(1) Since the system is designed to react to signals that link languages over thousands of years, the sub-classifications within the sub-families are sometimes not exact - especially if these languages are very close to each other. Examples on the tree are Croatia/Serbian, Russian and Tahitian. These language get classified in their right sub-family, but at the wrong place within this sub-family.
(2) Sanskrit can in some way be regarded as near proto-language for the Indo-Aryan family. It gets classified within the Iranian family in the same area as Avestan, itself one of the oldest Iranian languages. This is because it is closer to its relative of thousands of years ago than to its descendents. (Avestan to Sanskrit (Vedic) - distance = 37,6 and Hindi to Sanskrit (Vedic) - distance = 54,4.) It is a limit of a tree representation.
(3) The Uralic branch has strong links both to the Altaic and to the Indo-European ones, but the statistical significance of these links is not strong enought for now. The "Uralo-Altaic" hypothesis used to be a widely accepted hypothesis in the past. Today, many scholars reject this classification, others argue a broader common origin where Uralic, Altaic, Indo-European and other families share a common origin (see source). In eLinguistics, the relatedness signals between the Uralic and the Altaic languages are quite strong: Pairwise comparisons like Finnish to Kazakh and Hungarian to Mongolian show very strong relatedness signals with a very low p-value (low probability that the results are due to chance).
The eLinguistics results alone will not restore the Uralo-Altaic hypothesis but bring interesting facts for discussions about this issue!
The Tungusic family is isolated (the branch linking it to the Altaic family is too far left in the "statistic noise area") - altough quite strong relationships appear in queries for the genetic proximity between Oroqen and Kalmyk and Mongolian.
(4) The Altaic family - if considered as linking only the Mongolic and Turkic subfamilies - is reflected here with a very strong statistical significance.
(5) The Cushitic and Omotic sub-families are linked to the Semitic, Egyptian (ancient Egyptian and Coptic), Berber and Chadic ones (Afroasiatic macro family). The Afroasiatic classification reflected in eLinguistics.net is very close the the Classification by Christopher Ehret (1995 - Reconstructing Proto-Afroasiatic (Proto-Afrasian): vowels, tone, consonants, and vocabulary. University of California Press. ISBN 0-520-09799-8.)
 
Other remarks:
 
- Latin - as the proto-language of the Romance sub-family, should be linked more centrally in the Romance sub-tree.
- Some sub-classifications within the sub-families are sometimes not exact whenever languages are very close to each other. Examples on the tree are Slovene and Russian.
- Armenian gets classified at the very edge of the Indo-European family in this system. Armenian stands as one of the most isolated IE language, which is reflected also here.
- Khowar is also a language which doesn't find its right place in this system. Its classification should be the Dardic branch of Indo-Aryan.

7 languages of the study don't get classified within a macro-family: Ainu, Burushaski, Elamite, Georgian, Japanese, Korean and Sumerian. These languages are not represented in the tree. Creoles (French Creoles and Sranan) and languages with too few available words (Etruscan, Hurrian, Lycian, Mycenaean, Nenets, Phrygian, Umbrian and Urartian) are also excluded from the tree as they bias the results, with pairwise comparisons based on 1 to 5 words. For a better visibility, languages like Provencal, Romansch, Ladin, Bavarian... are note classified in the tree.

Oscan is represented as the only member of the Sabellic Family, although only a dozen words is available. This language ranks at its right position in the tree.

The farther left the branches are being linked at, the less reliable the classification is. A value of 40 on the tree corresponds to a genetic distance of 80 is the comparison query. In most cases, pairwise comparisons with a genetic distance of 80 have a p-value of 0,1 or more - this is a level where chance dominates the results.

All classifications are done automatically, without manual intervention. The position of a few languages inside the sub-families on the language evolutionary tree can be subject to discussion, but all languages get classified in the right sub-family.

The clustering technique used to generate the tree from the distance matrix is UPGMA (Unweighted Pairwise Group Method with Arithmetic-mean). The neighbor-joining tree technique brings very similar results.

The program used to generate evolutionary trees from the distance matrix is MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for bigger datasets (submitted). Kumar S, Stecher G, and Tamura K (2015) - www.megasoftware.net