


Corpus for AI

① 世界最大規模の【収集量】
② 世界最多規模の【言語数】
③ 国内地域方言の【網羅性】
④ 幼少年〜高齢者【全世代】
⑤ 独話〜多数会話【多形態】
⑥ 独自開発【非流暢性タグ】
⑦ 独自 【自然発話実現方法】
⑧ 読上発声/発声動画/同時通訳付き収録など多様な収録音声データ

本プロジェクトは「令和2年度第三次補正予算-小規模事業者持続化補助金<低感染リスク型ビジネス枠/第一回>(中小企業庁)」の採択[No.2097: 株式会社タイムヒル(京都府)]を受けて一部実施しております。


Monologue (一者独話)

Dialogue (二者対話)


(Spontaneous Speech of English - with detailed disfluency annotation, comparable with simulated public speaking of "The Corpus of Spontaneous Japanese")

■The Corpus of Oral Presentations in English (COPE)


Provider: Michiko Watanabe (NINJAL, Waseda Univ.) and Time Hill Inc.


<サンプルを聴く> A speech sample

Corpus Title: 【COPE】
Serial Number: 【COPE-001】

The most memorable experience in my life

Annotated transcript
Transcript without disfluency and annotation
Get for free
License Agreement


The “Dai-Onsen(大音泉)” or “ the series of Corpus of Spontaneous Speech in Japanese” (CMSJ/CDSJ/CCSJ/CLSJ/CKSJ/CPSJ) is a database containing a large collection of Japanese speech data with transcription originally developed by Timehill Inc. (2016-2022, Japan).
This is the world's largest speech corpus in the field of free speech and has ever been used for both academic and commercial purposes mainly in US, UK and EU.
This corpus will be used also for a wide variety of research purposes such as linguistics, phonetics, pragmatics, semantics, lexicology, cognitive science, psychology, sociology, cultural anthropology, dialect studies and for the Japanese language studies, the materials for speech therapists or translators, and a sort of creative tips for producing dramas/films.
The price of this corpus varies depending on the purpose of use, period, region, quantity, etc. If you are interested in this corpus, please contact us at:
Then, after some necessary email back and forth, we will offer the sales price of the corpus and the contract document. Thank you for your understanding.

The list and the outline of this corpus is as follows

■48kHz/16bit/wav, Mic(Shure SM10 and others), recorded in the studio booth
★16kHz/16bit/wav, Mic(iPhone 13 or other type of smartphones), recorded in the quiet room

CMSJ: Corpus of Monologue Speech in Japanese (300 hours in total)

1. Duration > 15-20 minutes per speech file

2. Speaker > 1 Adult/session (file) [1,200 Adults in total]

3. Topics >

① My Life Story
② My Miracle Year
③ My daily life or Favorite things

CDSJ: Corpus of Dialogue Speech in Japanese (300 hours in total)

1. Duration > 20-30 minutes per speech file

2. Speaker > 2 Adults/session (file) [1,200 Adults in total]

① 2 Adults as a role of Shop clerk and the Customer
③ 2 Adults as Friends
② 2 Adults as Brothers or Sisters /Parents and their child

3. Topics >

① Conversation Between Shop clerk and the Customer
② Conversation Between Friends
③ Conversation Between Brothers or Sisters / Parents and their child

CCSJ: Corpus of Conversation Speech in Japanese (100 hours in total)

1. Duration > 30-40 minutes per speech file

2. Speaker > 3 Adults/session (file) [600 Adults in total]

① 3 Adults as a role of Shop clerk and the Customers
② 3 Adults as Friends
④ 3 Adults as Brothers or Sisters / Parents and their child

3. Topics >

① Conversation Among Shop clerk and the Customers
② Conversation Among Friends
③ Conversation Among Brothers or Sisters / Parents and their child

CLSJ: Corpus of Lecture Speech in Japanese (80 hours in total)

1. Duration > 15-30 minutes per speech file

2. Speaker > 1 Professor/session (file) [ 5 Professors in total]

3. Topics >

① Humanities
② Economics
③ Advanced science and technology

CKSJ: Corpus of Kids Speech in Japanese (50 hours in total)

1. Duration > 15-20 minutes per speech file

2. Speaker > 1 Kid (Age4 to10) /session (file) [200 Kids in total]

3. Topics >

◎ Daily life or Favorite things

CPSJ: Corpus of Telephony Speech in Japanese (300 hours in total)

1. Duration > 10-30 minutes per speech file

2. Speaker > 2 Adults/session (file) [1,200 Adults in total]

① 2 Adults as a role of Shop clerk and the Customer
② 2 Adults as Friends
④ 2 Adults as Brothers or Sisters /Parents and their child

3. Topics >

① Conversation Between Shop clerk and the Customer
③ Conversation Between Friends
③ Conversation Between Brothers or Sisters / Parents and their child

(Spontaneous Speech/Text to Speech data of 20 Languages -No Annotations )

Speech Corpus List (20 Languages)

■The Corpus of Dialogue speech in Japanese (CDSJ )


Provider: Time Hill Inc.


<サンプルを聴く> A speech sample

Corpus Title: 【CDSJ】
Serial Number: 【00001_RT-G】


Corpus Title: 【CDSJ】
Serial Number: 【00001_RT-C】


Corpus Title: 【CDSJ】
Serial Number: 【00002_WD-G】


Corpus Title: 【CDSJ】
Serial Number: 【00002_WD-C】


License Agreement

[大文嶺] <有料/製品開発向け>

(Parallel Text Corpus - English to Japanese)

Coming soon!

[大語輪] <有料/製品開発向け>

(Electronic Dictionary of Words and Phrases- Multilingual Languages)

Coming soon!

We are collecting speech data in many languages around the world. Please record your speech data!
And you can send it to us according to the guidance on this page.

第1事業部  ●営業課 ●制作課




① <世界中の言語を現地でカバー>

② <短期間に大量の収録が可能>
・1か月で1万人以上の話者を収録可能 ※録音時間や話者バランスなど条件あり
・ご発注当日の収録および納品も可能 ※人数が少ない場合など

③ <話者のバランスと多様性>

④ <低価格を実現>


⑤ <優れた録音エンジニア陣>

⑥ <3人体制での言語チェック体制>

⑦ <収録後のファイル化や自由発話の書き起こしなど>

第2事業部  ●営業課 ●制作課


第3事業部  ●販売課 ●開発課


Copyright(c) 2016 Timehill Inc. All Rights Reserved.