The Tong Corpus

                             Tong at 1;0                                                  2;5                                                    3;3                                                4;5

Tong:    还   要     汽车 呀 .                                                                         (1;7.18)

Hai yao   qiche ya

still want car    sfp

‘(I) want (toy) cars.’

Tong:    我   来      试 一     试  我  的       鞋    布  不  布 .                            (2;6.13)

Wo lai      shi yi    shi wo de      xie    bu-bu-bu

I     come try one try I     poss shoe cloth-not-cloth

Intended meaning: ‘Let me try my shoe to see if it is made up of cloth or not.’

Tong:     坏    人       来 了            就    用     这 个  把   坏    人        打   走 .  (3;4;9)

Huai ren       lai-le           jiu  yong zhe-ge ba huai ren      da  zou

bad  person come-Perf just use this-CL BA bad person hit walk

‘If bad man comes, (I will) just use this to drive him off.’





The Tong corpus contains naturalistic interactions between a Mandarin-speaking child Tong and his caregivers mainly in the home setting, from age 1;0 to 4;5.

  • Linguistic background

Tong was raised in Shenzhen, China where Mandarin is the language of the community. Members of the family all speak Mandarin to the child. The child is also exposed to some input in Rudong dialect, a variety of Jianghuai Mandarin spoken in Jiangsu province, China, as his father and grandmother sometimes speak the dialect between them. From 2;5 on, the child also received some Cantonese and English input in his kindergarten. However, the influence of these languages is at a minimum level, and the child only speaks Mandarin at home. From 3;3, for three hours a day he attended a kindergarten in Hong Kong which used Cantonese as the medium of instruction. Now the child has become a Mandarin-Cantonese bilingual.

  • Recording and transcription

The spontaneous play situation was audiotaped for one hour each week since 1;0, and video recording started at 2;3, producing more than 250 hours of audio recording and more than 90 hours of video recording. In an effort to improve on published child Mandarin corpora, we constructed the audio- or video-linked longitudinal corpus with denser sampling, and documented naturalistic adult-to-child input over the entire period of study. In the initial phase, 22 one-hour recordings with one-month intervals from 1;7 to 3;4 have been released in CHILDES. They have been transcribed and checked by native speakers of Mandarin with linguistic training. The corpus provides a morphological tier which facilitates grammatical analysis of the data.

  • Diary

The child’s mother has kept a diary of the child’s language development to supplement the corpus. She noted down Tong’s new developments and unusual utterances as much as possible with necessary contextual information. Now the diary data from the child’s birth to 3;4 have been digitized, which include one or more than one utterances, or babblings on 528 days . 

We would like to express our gratitude to Brian MacWhinney, Director of CHILDES for his expertise, advice and technical support in constructing the Tong corpus. Our special thanks go to thetranscribers of the Tong corpus: Zhong Jing, Lam Ho Ching, Xie Shanrong, Zhou Jiangling, Lu Yaqiao, Lyu Lu, Yao Yao, Au Chui Yee, Zhishu Yu, and Yuqi Wu. We gratefully acknowledge the support of our lab members, especially Stephen Matthews and Mai Ziyin. The research was supported by a start-up grant to set up the Bilingualism and Language Disorders Laboratory at the CUHK-Shenzhen Research Institute, CUHK funding for the CUHK-Peking University-University System of Taiwan Joint Research Centre for Language and Human Complexity, a General Research Fund from the Hong Kong Research Grants Council (Project no. 14413514), and the Stella and Leanne Lu Fund.


Deng, X. & Yip, V. (in press). A multimedia corpus of child Mandarin: the Tong corpus. Journal of Chinese Linguistics.

Deng, X. & Yip, V. (2015). A corpus study of the acquisition of ba and bei constructions in Mandarin. Paper presented at The International Symposium on Psycholinguistics of Second Language Acquisition and Bilingualism, Chinese University of Hong Kong.

MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd edition). Mahwah, N.J.: Lawrence Erlbaum.

Table 1 Major parts of speech used in CHILDES database

A transcript showing the child producing the “ba” construction at the age of 2;4.18 with video linked




The child is producing a non-target “ba” sentence at the age of 2;7.19