splits and trees – or waves and webs?

SPLITS AND TREES – OR WAVES AND WEBS?
CASE STUDIES IN THE ANDES, ACCENTS OF ENGLISH, AND INDO-EUROPEAN
April McMahon, Paul Heggarty, Warren Maguire, Rob McMahon – Colin Renfrew, Paul Heggarty
ARE PHYLOGENETIC ANALYSIS PROGRAMMES REALLY SUITABLE FOR LANGUAGE DIVERGENCE?
IF SO, WHICH?
Much has been made in recent years of applying to language data ‘tree-drawing’ phylogenetic analysis techniques
originally developed for the biological sciences. But are such models are necessarily appropriate for analysing
language divergence in the first place? This question arises out of research into a combination of new techniques
from both linguistics and phylogenetics to try to push our methods to the greater precision required for to the level
of dialects and indeed accents (regional and/or sociolinguistic) of a single language. It transpires, though, that they
have lessons valid much more widely for all levels of language divergence. In particular, the latest network type
phylogenetic analysis programmes, which seek to accommodate more complex cross-cutting signals in the
relationships between languages, duly emerge as much more faithful to the reality of language divergence than
‘tree-only’ methods.
NEW LINGUISTIC TECHNIQUES FOR MEASURING LANGUAGE DIFFERENCE
To provide more reliable linguistic inputs to these phylogenetic analyses we have developed new linguistic
methods to produce more precise measures of difference between languages (particularly important for the finergrained differences at the accent and dialect levels). One of these measures difference in lexical semantics, but
deliberately makes a number of radical departures from the traditional lexicostatistical methodology, to be more
sensitive to the complexities in meaning-to-word relationships in real languages, which cannot always be analysed
well in all-or-nothing binary terms. A second method seeks to extend quantification to phonetics, to quantify the
degree of difference between pronunciations of cognate forms in different accents or languages – to be precise, it
measures net divergence since their common ancestor. For more details and a demonstration of all methods
referred to here, ask Paul Heggarty who is attending the Languages and Genes conference.
THE LATEST PHYLOGENETIC TECHNIQUES: WEBS NOT TREES
The quantification results from these methods were then used as linguistic input to one of the latest techniques
developed on the phylogenetic side too, NeighborNet by Bryant and Moulton (2002). What distinguishes this
method is that it is not limited to producing a graphical output only in the form of a branching phylogeny, but can
instead draw a more complex ‘web’, if this is a more faithful representation of the relationships between the
languages in the data set than a simple tree-only structure would be. As such, NeighborNet is certainly well suited
to analysing language variation at the level of dialects, since they tend naturally to be in relatively complex webs of
interrelationships with each other, within a dialect continuum. This combination of new methods has been applied
to two case studies, each of which offers some striking results.
HOW FINE CAN YOU GO?: PHYLOGENETIC ANALYSIS OF ACCENT-LEVEL DIFFERENCES IN ENGLISH
Our NeighborNets of accents of English within the British Isles reproduce graphical reflexes of the principal
isoglosses separating them into groups (e.g. rhoticity, Scottish Vowel Length Rule). They also make for a
surprisingly neat correlation with geography, at least at first glance – see Figure 1. We intend to follow this up with
an exploration of whether there is any correlation also with genetic signals for the regional populations.
Significantly, the phenograms produced from the same data by the SplitsTree 4 phylogenetic analysis programme
are much less stable than the webs: removing an individual variety at random can immediately restructure the tree
(even in other branches distant from the variety removed), but has much less impact on the web.
SOLVING A RELATEDNESS RIDDLE IN THE ANDES
A second study looked into the two main surviving indigenous language families of the Central Andes, Quechua
and Aymara. These represent a particularly difficult case for traditional methods in trying to establish whether the
two are or are not genealogically related to each other. Thanks to a number of methodological innovations in our
approach to measuring difference in lexical semantics, we were able to combine results for the two families without
making an a priori judgement on the issue. The corresponding NeighborNets make for a new type of evidence that
argues strongly against relatedness: the families are much closer to each other for the less stable and more easily
borrowable subset of our 150 sample word meanings (the NeighborNet on the left), much more distant for the more
stable ones (on the right).
REDEFINING TRADITIONAL CLASSIFICATIONS
The detailed dialect results for Quechua alone were no less significant. For decades the traditional classification has
represented this family as the familiar neatly branching tree. Yet as soon as we applied a phylogenetic analysis that
is not restricted to necessarily drawing a tree in the first place, for Quechua it duly drew no tree at all, but a web of
dialect continuum relationships, as in the diagrams above. This is by no means an automatic artefact of the
programme: with the Aymara family, for which the surviving data are compatible with a clear branch, NeighborNet
does indeed duly draw a tree.
Far from being a helpful idealisation, in the case of Quechua the traditional tree-model appears on the contrary to
be an oversimplification that is dangerously misleading as to the true history of the family.
A METHODOLOGICAL WAKE-UP CALL: WEBS BETTER THAN TREES?
This lesson has repercussions far more widely than just for Quechua – a general methodological wake-up call for
the field. Is it not high time that we stopped paying nothing but lip-service to the wave model, before merrily
pressing on with applying tree-only analyses to language divergence data? Trees are tempting, certainly: they are
neater, simpler to get our heads around, perhaps even more intellectually satisfying. Yet none of that makes them
necessarily the most faithful way of representing the actual relationships between real languages. Dialect continua
are just as natural a form of language divergence as branching trees, even if the extinction of intermediate dialects
may often make it look otherwise on the surface. (Indeed, even in the case of Aymara, this appears to be the case: it
probably was more of a continuum originally, of which only the poles have survived.)
WHAT OF THE LANGUAGE FAMILIES OF EUROPE: TREES OR WAVES?
What of all this for Indo-European? It transpires on closer inspection that here too, the search for a perfect
phylogeny invariably comes up against difficulties. Ringe et al. (2002) came across them particularly in the
relationships between Germanic and the other main families of Europe; these are also precisely the nodes that have
some of the lowest confidence values in Gray & Atkinson’s (2003) tree too.
Linguists have long identified two main processes of language divergence – splits leading to branching trees, and
the wave model leading to dialect continua. Applications of NeighborNet suggest that we need to ensure that our
phylogenetic methods accommodate the latter model in a much more balanced way alongside the former, if we
want to apply those methods to language divergence.
Rather than singling out certain language data as unhelpfully ‘recalcitrant’ to fitting into a perfect tree structure,
perhaps the problem lies rather in an approach that assumes that such a structure is necessarily suitable in the first
place for representing all or even most relationships between natural languages. Our latest research project will be
looking into precisely this issue for the main language families of Europe.
How well supported are the nodes separating the four major European sub-families in Gray & Atkinson’s (2003)
tree of Indo-European languages? If we zoom in and look at the values attached to those nodes, highlighted in the
red rings we’ve added, we can see how low they are, i.e. how weakly supported those branches are:
Slavic vs. Celtic/Romance/Germanic = 44; Celtic vs. Romance/Germanic = 67; Romance vs. Germanic = 46
Celtic
Romance
Germanic
Balto-Slavic
WHOSE RESEARCH IS THIS? WHERE TO FIND OUT MORE
The work reported on here is that of a multidisciplinary cluster of researchers in the UK, involving the linguists
April McMahon, Paul Heggarty and Warren Maguire, the geneticist Rob McMahon, and the archaeologist Colin
Renfrew, in three separate research projects.
• Quantitative Methods in Language Classification. English Language and Linguistics, University of Sheffield, June
2001-May 2004. April McMahon, Paul Heggarty, Rob McMahon, Natalia Slaska.
• Sound Comparisons: Dialect and Language Comparison and Classification by Phonetic Similarity. Linguistics and
English Language, University of Edinburgh, Oct.2005-Sept.2007.
Maguire. www.soundcomparisons.com
April McMahon, Paul Heggarty, Warren
(Listen online to our accent database recordings.)
• Languages and Origins in Europe. McDonald Institute for Archaeological Research, University of Cambridge, June
2006-Sept.2009. Colin Renfrew, Paul Heggarty. www.languagesandpeoples.com
Colin Renfrew and Paul Heggarty are both attending the Languages and Genes conference in Santa Barbara.
NEIGHBORNET REPRESENTATION OF THE RELATIONSHIPS BETWEEN REGIONAL VARIETIES OF QUECHUA AND AYMARA
As calculated from quantifications of their similarity in lexical semantics in Heggarty (2005), for different sub-lists of meanings.
The two numbers indicate the distances between the ‘roots’ of the two families.
MORE STABLE MEANINGS
LESS STABLE MEANINGS
SOUTHERN
SOUTHERN
CENTRAL
CENTRAL
AYMARA
S~C
56.6
26.5
CENTRAL
ECUADOR
S~C
INTERMEDIATE
SOUTHERN
CENTRAL
NORTH
PERU
QUECHUA
INTERMEDIATE
ECUADOR
SOUTHERN
NORTH
PERU
NEIGHBORNET OF 18 TRADITIONAL REGIONAL ACCENTS OF ENGLISH FROM BRITAIN AND IRELAND
based on phonetic difference ratings, and showing a fairly close match with geography
Shetland
Buckie
Glasgow
R~PR
Berwick
Renfrew Coldstream Holy Island
Cornhill
Hawick
Morpeth
Antrim
Tyneside
Tyrone
PR~NR
Fermanagh
R~NR
Liverpool
Dublin Sheffield
London English accents can be classed into three
main groups by their pronunciation of /r/.
The boxes and arrows show the corresponding dividing lines between them.
R
= Rhotic
PR
= Partially-Rhotic
NR
= Non-Rhotic