Language evolution: a computerized model

Language evolution

Try out a comparison:

Or browse on world map:

Basic principles - "Linguistics & math, using methodologies like in biology...":
  -> Model of "language identity markers" to detect and quantify common origin.
  -> Right balance between signals' strength and resistance to bias by chance.

The idea is to make a representation which is accessible and easily understandable by anyone, so that a broader public can query language comparisons and navigate language trees and so better understand the evolution of languages. uses following software modules:

  1. The module for single comparisons of pairs of languages (available here for interactive "Language comparisons")
  2. The module for mass-comparison of languages and their evolution - its basis is the software module in point 1 which is being called thousands of times in mass comparisons. The output of the mass comparison is a distance matrix, written in a format (.mts) which enables its analysis by standard software used in biology and genetics for the representation of evolutionary trees.
  3. The standard software for the construction of the evolutionary trees itself.

The major steps in the methodology are:

  1. Choosing and encoding the language material for the comparisons. The challenge is to encode this material in a way it can be processed by a computer. The approach in this project is purely lexical and the language evolution material are words, for which consonants are encoded in a way they can be processed by the program. More details about this language material here.
  2. Determining a set of rules used to identify cognates. These rules are consonant-relationships as they are known from sound change. More details here. The scoring system applies step by step - first from vowel to vowel, then from word to word and then for the whole language comparison. At each step, 0 to 100 is the scoring value which is averaged from level to level - 100 being the highest possible number of points. This result is then reversed (100 minus "Result") to express a distance from 0 (same language) to 100 (completely different).
  3. Calculating the statistical context of all results. Along with cognate scoring, a statistical expected value and its standard deviation are calculated. Chance exposure is a big issue in language evolution analysis and from certain values (appr. 72-75 and above) the "cognates" detected by the system are more due to chance than to relatedness. For every comparison, the result is confronted with the statistical expectation it has to be equal or lower to what it is.

A key consideration to understand the system, its strengths and weaknesses, is that it runs purely on a probabilistic basis: each pair of consonants in comparisons between words is assessed to a resemblance match according to rules. With regard to language relatedness, some of these matches are true positives (a relatedness signal is identified between related languages) and true negatives (no relatedness signal is identified between unrelated languages). But in many cases, the system also wrongly identifies relatedness signals between unrelated languages (false positive -> chance ressemblance) or fails to identify signals that in fact point to a relatedness (false negatives). In the millions of pairwise comparisons between syllables, the true positives and true negatives prevail when languages are related. The high degree of agreement of the inferred classification with existing "classical" ones shows that the strength of the system is high when it comes to clustering of mass results. On the other hand, the variance of the results can turn out to be quite high when we compare single pairwise results, which is the main weakness of the system.

Since the input material as described in 2) and 3) is not dependent on the software itself, any other material and hypothesis can be processed by the system. Some of the blog visitors already had suggestions and own hypotheses - see in the discussion area for more details.

In this blog, you can query comparisons between 286 languages. All calculations, representations and detail-sheets are generated interactively, directly from the row data. Processing the distance matrix is being done on a desktop application: The computer tries up the different analyses and takes over millions of comparison tasks within seconds, without mistake and without influence or lack of objectivity... Presently, the matrix is generated from 286 languages and represents values of 40,898 comparisons of pairs of languages. To do so, the system takes over approximately 5.4 million consonant comparisons.

The sample detail page gives you a clear idea how the value for the genetic distance between two languages is being calculated. The sample page shows the details for the calculation of the genetic distance between German and English, reflecting their evolution.

Here is the desktop application's interface screenshot:

The technology used is ".net", using c# as a programming language. The data is not stored in a database, but in XML files. This ensures simple portability to any PC or web server. The data handling is written with LINQ - the advantage is that the data is encapsulated in objects (after mapping from the classical, relational representation stored in XML).

Further to lexicostatistics - basis words    Further to consonant relationship    Further to sample detail (result) page