ENGLISH FREQUENCY, SYLLABLES

The efs.cd file lists phonetic syllable frequencies calculated from a
combination of the orthographic wordform frequencies and the
syllabified phonemic wordform transcriptions in the CELEX database. It
represents a new feature, which has NOT been documented in the CELEX
User Guide, and therefore cannot be found in the eug.ps file included
on this disc.


The file contains the following fields:

1.  Syllable
2.  SylPos
3.  SylInlMln
4.  SylTotInlMln


Explanation of the fields' contents:

1. Syllable (in specified position) 
     The phonemic transcription of the syllable, using the DISC
     (one-character-per-phoneme) character set. No difference is
     made between stressed and unstressed syllables. 
2. Syllable position 
     The position of the phonetic syllable in the full phonemic
     transcription. 
3. Syllable frequency 1m, pos. 
     The frequency of the syllable selected in that particular
     position only, calculated per 1 million tokens. Thus the
     syllable 'ban' (transcribed as 'b{n'), occurring in first position
     in words such as 'ban', and 'ban-dit' has a frequency of 31, while it
     occurs in second position (e.g. 'a-ban-don') 64 times. 
4. Syllable frequency 1m, total 
     The frequency of the syllable, regardless of position in the
     word, calculated per 1 million tokens. Given the fact that 
     'ban' in third or higher position only occurs in extremely
     infrequent words not found in our corpus, it is obvious that
     the total frequency of 'ban' must be 64 + 31 = 95. 


Cautionary note:

Please note that the English corpus used by CELEX for deriving these
frequencies contains only 7.3% spoken material. This means there is a
rather tenuous relationship between the full frequency figures, which
are based on written forms, and the syllable frequencies, which merely
refer to phonemic conversions of these graphemic transcriptions. Of
course it could be argued that frequencies of syllables, as lexical
sub-units, are less liable to get skewed from differences in medium
than full words, but it has to be taken into account that NO FIRM
EVIDENCE ABOUT SPOKEN FREQUENCIES can be derived from these data.
