Figure 1. Distribution of population size in the sample of language entries (figures in
thousands, logarithmic scale).
There is one small departure from a log-normal distribution that is observable in
Figure 1 which should also be noted. This is the somewhat elevated tail on the left; for a
true log-normal distribution, we should expect this to taper off to zero, as it does on the
right. There are two possible explanations for this. The first is that population sizes are
truncated at 1; populations smaller than that can only represent languages that are extinct,
which are not shown here.
3
This could prevent the left tail from dropping to zero
normally. The second is that the elevated tail may represent a tendency of the Ethnologue
to retain speakers for small languages even when they are no longer spoken. This has
already been suggested in a review of the Ethnologue by Hammarström (2005), in which
it was pointed out that the a number of Australian languages recorded as already extinct
by another source were listed as extant in the Ethnologue. Hence, it might be profitable to
systematically examine smaller entries to ascertain whether more current data will show
there to be speakers for them or not.
Having established the general distributional nature of the population statistics,
we can now proceed to ask if there are systematic distributional effect, be they biases or
interpretable differences, according to the other factors we have already observed: the
type of source cited for population estimates, the date of the source, and geographic
3
One might expect that properly counting extinct languages could improve the statistical
profile of the left tail. However, the number of extinct languages in all of human history
is very large, and it is not clear which ones would be relevant. Extinct languages in the
Ethnologue are regarded as recently so, i.e. all were reported as living at some earlier
point. Since the Ethnologue covers a 50-year time span, and there are no indications as to
when a language became extinct, it is not possible to decide which of these entries should
be considered relevant.
region. These observations are presented as box-and-whisker plots in Figures 2-4. In each
of these plots, the vertical axis is the base ten logarithm of population size (3 corresponds
to 1,000; 4 corresponds to 10,000, etc.). Each box has a center bar indicating its median
value, with a notch around the bar indicating a 95% confidence interval for that value.
The box represents the range occupied by the central 50% of the data for that category,
and the whiskers extending above and below the box approximate a range enclosing 95%
of the data. Outliers are indicated as individual data points outside these latter ranges. By
comparing the central bars and the overlap among the notches of the different categories,
one can get a sense of the differences in population size across the set of categories.
Figure 2. Log
10
population size by geographic region among language entries.
Figure 2 indicates clearly that different regions have somewhat different typical
language population sizes. Africa, East Asia, Europe, South and Central Asia and
Western Asia appear to have somewhat larger population sizes than North America,
Oceania, and South America and the Caribbean. Southeast Asia has language population
sizes intermediate between these two sets. This confirms the observation of Grimes
(1986) of different geographic regions having different language size norms. The
observations also comport with our prior knowledge about the languages of the regions.
North America, where shift from the indigenous languages to English is all but complete,
has a relatively small median size at almost exactly 100 individuals. Oceania has a
median size around 1,000 individuals, a value widely reported for the countries of the
region such as Papua New Guinea. Some regions with larger median sizes, such as
Africa, nonetheless have a substantial number of smaller language groups, as indicated by
the small language group outliers. Note that Africa, East Asia, Europe, Oceania, South
America, and Southeast Asia all have small outlying groups; these would be good
candidates for endangered languages. Since different regions appear to have different
typical language sizes, the cutoff for what is likely to be endangered is likely to be
different for different regions.
Figure 3. Log
10
population size by source among language entries.
In Figure 3, we consider the contribution of different sources to population groups
of different sizes. Again we can see that there appear to be significant differences among
the different sources used. SIL and Academic sources tend to be for somewhat smaller
groups than Missionary, WCD, Government and other sources. This we might partly
expect, given the tendency for different sources to report on different regions, and the
different size trends observed for different regions in Figure 2. Populations for which no
source is given also tend to be smaller, while those that have a year but no source
indicated tend to be larger. This suggests that the two types of figures represent different
kinds of information entirely. Given that both reflect some uncertainty about the language
population data, and given that they account for about 20% of the language entries in our
sample, entries with such fragmentary citations on population data need be thoroughly
checked before we can fully rely on them. Again, this effect is probably distributed
unevenly across regions, so focusing on particular regions as suggested earlier may help
to address these issues as well.
Dostları ilə paylaş: |