ASR/MT | 株式会社タイムヒル

常に、時代の丘に立て～
その時、時代の風は、吹く～
新しき時代の風は、必ず吹く！

第Ⅱ事業本部

AI学習&評価用データ事業（音声コーパス）
音声画像動画&言語認識技術開発向データ制作請負と販売

コミュニケーションAIの研究開発に必要不可欠である＜音声コーパス、並びに、テキストコーパス＞を無料公開/有料販売いたします。これらは、株式会社タイムヒルが独自に企画し収集整理したオリジナルデータ（編集著作物）であり、著作権は株式会社タイムヒル（および関連する著作者）にあります。（商用または研究限定の）「非独占的著作物利用許諾契約」を締結いただいた上でご利用いただけます。特徴として、以下の8点が挙げられます。
① 世界最大規模の【収集量】
② 世界最多規模の【言語数】
③ 国内地域方言の【網羅性】
④ 幼少年〜高齢者【全世代】
⑤ 独話〜多数会話【多形態】
⑥ 独自開発【非流暢性タグ】
⑦ 独自【自然発話実現方法】
⑧ 読上発声/発声動画/同時通訳付き収録など多様な収録音声データ
音声認識・自動翻訳などの（AI）学習用/評価用データ目的以外でも、言語学・音声学・語用論・意味論・語彙論・認知科学・心理学・社会学・文化人類学・方言の研究・語学の勉強・言語療法士や翻訳家の資料、小説や映画の創作源など、はば広く多様な用途でお使いいただける事を願っております。

本プロジェクトは「令和2年度第三次補正予算-小規模事業者持続化補助金<低感染リスク型ビジネス枠/第一回>(中小企業庁)」の採択[No.2097: 株式会社タイムヒル(京都府)]を受けて一部実施しております。
[採択者一覧]

＜音声データ収録の様子を見る＞👀

Monologue (一者独話)

Dialogue (二者対話)

[大声解]<無料/学術研究限定>

(Spontaneous Speech of English - with detailed disfluency annotation, comparable with simulated public speaking of "The Corpus of Spontaneous Japanese")

■The Corpus of Oral Presentations in English (COPE)

提供者：渡辺美知子（国立国語研究所・早稲田大学），株式会社タイムヒ

Provider: Michiko Watanabe (NINJAL, Waseda Univ.) and Time Hill Inc.

<サンプルを聴く> A speech sample

Corpus Title: 【COPE】
Serial Number: 【COPE-001】

The most memorable experience in my life

Transcript without disfluency and annotation

[大音泉]<有料/製品開発向け>

The “Dai-Onsen(大音泉)” or “ the series of Corpus of Spontaneous Speech in Japanese” (CMSJ/CDSJ/CCSJ/CLSJ/CKSJ/CPSJ) is a database containing a large collection of Japanese speech data with transcription originally developed by Timehill Inc. (2016-2022, Japan).
This is the world's largest speech corpus in the field of free speech and has ever been used for both academic and commercial purposes mainly in US, UK and EU.
This corpus will be used also for a wide variety of research purposes such as linguistics, phonetics, pragmatics, semantics, lexicology, cognitive science, psychology, sociology, cultural anthropology, dialect studies and for the Japanese language studies, the materials for speech therapists or translators, and a sort of creative tips for producing dramas/films.
The price of this corpus varies depending on the purpose of use, period, region, quantity, etc. If you are interested in this corpus, please contact us at:
info@timehill.biz
Then, after some necessary email back and forth, we will offer the sales price of the corpus and the contract document. Thank you for your understanding.

The list and the outline of this corpus is as follows

■48kHz/16bit/wav, Mic(Shure SM10 and others), recorded in the studio booth
★16kHz/16bit/wav, Mic(iPhone 13 or other type of smartphones), recorded in the quiet room

■CMSJ: Corpus of Monologue Speech in Japanese (300 hours in total)

1. Duration > 15-20 minutes per speech file

2. Speaker > 1 Adult/session (file) [1,200 Adults in total]

3. Topics >

① My Life Story
② My Miracle Year
③ My daily life or Favorite things

■CDSJ: Corpus of Dialogue Speech in Japanese (300 hours in total)

1. Duration > 20-30 minutes per speech file

2. Speaker > 2 Adults/session (file) [1,200 Adults in total]

① 2 Adults as a role of Shop clerk and the Customer
③ 2 Adults as Friends
② 2 Adults as Brothers or Sisters /Parents and their child

3. Topics >

① Conversation Between Shop clerk and the Customer
② Conversation Between Friends
③ Conversation Between Brothers or Sisters / Parents and their child

■CCSJ: Corpus of Conversation Speech in Japanese (100 hours in total)

1. Duration > 30-40 minutes per speech file

2. Speaker > 3 Adults/session (file) [600 Adults in total]

① 3 Adults as a role of Shop clerk and the Customers
② 3 Adults as Friends
④ 3 Adults as Brothers or Sisters / Parents and their child

3. Topics >

① Conversation Among Shop clerk and the Customers
② Conversation Among Friends
③ Conversation Among Brothers or Sisters / Parents and their child

■CLSJ: Corpus of Lecture Speech in Japanese (80 hours in total)

1. Duration > 15-30 minutes per speech file

2. Speaker > 1 Professor/session (file) [ 5 Professors in total]

3. Topics >

① Humanities
② Economics
③ Advanced science and technology

■CKSJ: Corpus of Kids Speech in Japanese (50 hours in total)

1. Duration > 15-20 minutes per speech file

2. Speaker > 1 Kid (Age4 to10) /session (file) [200 Kids in total]

3. Topics >

◎ Daily life or Favorite things

■CPSJ: Corpus of Telephony Speech in Japanese (300 hours in total)

1. Duration > 10-30 minutes per speech file

2． Speaker > 2 Adults/session (file) [1,200 Adults in total]

① 2 Adults as a role of Shop clerk and the Customer
② 2 Adults as Friends
④ 2 Adults as Brothers or Sisters /Parents and their child

3. Topics >

① Conversation Between Shop clerk and the Customer
③ Conversation Between Friends
③ Conversation Between Brothers or Sisters / Parents and their child

(Spontaneous Speech/Text to Speech data of 20 Languages -No Annotations )

■The Corpus of Dialogue speech in Japanese (CDSJ )

提供者：株式会社タイムヒル

Provider: Time Hill Inc.

<サンプルを聴く> A speech sample

Corpus Title: 【CDSJ】
Serial Number: 【00001_RT-G】

Restaurant-guest

Corpus Title: 【CDSJ】
Serial Number: 【00001_RT-C】

Restaurant-clerk

Corpus Title: 【CDSJ】
Serial Number: 【00002_WD-G】

Wedding-guest

Corpus Title: 【CDSJ】
Serial Number: 【00002_WD-C】

Wedding-clerk

[大文嶺] <有料/製品開発向け>

(Parallel Text Corpus - English to Japanese)

Coming soon!

[大語輪] <有料/製品開発向け>

(Electronic Dictionary of Words and Phrases- Multilingual Languages)

Coming soon!

Deliver the data

We are collecting speech data in many languages around the world. Please record your speech data!
And you can send it to us according to the guidance on this page.

第１事業部　　●営業課　●制作課

■音声情報処理■
音声認識技術および製品の開発メーカ様、大学・研究施設様向けに読上や自由発話の音声データを収録（書き起こし）する業務（音声コーパス構築）を行っています。

【特徴】

（１）広範囲で大規模かつ多様な音声収録を“短期間・低価格”で実施します。

① ＜世界中の言語を現地でカバー＞
・世界各地（30カ国100都市以上）、ほぼ全世界を網羅
・基本的に当該言語のしゃべられている現地に赴いて、現地の話者を大量に収録

② ＜短期間に大量の収録が可能＞
・1か月で1万人以上の話者を収録可能　※録音時間や話者バランスなど条件あり
・ご発注当日の収録および納品も可能　※人数が少ない場合など

③ ＜話者のバランスと多様性＞
・性別は常にバランスを取る事が基本
・年齢層は幼児から80歳を超える高齢者まで幅広く可能
・同じ言語でも出身地や方言のバリエーションを取る事も可能
・“学歴”“職歴”“車やカーナビ使用歴”他の条件付の話者も可能
＜発話の種類や多様性＞
・事前に準備した原稿の読み上げ発話の収録
・シナリオの有る/無しにかかわらず、自由な発話の収録
★特に弊社では自由発話の収録を得意としており＜1人30分超の自由発話を誰でも（原稿や事前準備なしに）行う事の出来る＞話者ディレクションノウハウを持っています。二者間の対話も、三者以上の話者による会話収録も可能です。
・異なる言語間の会話を「同時通訳（又は逐次通訳）」を利用して行う収録
・「ささやき声」「大きな声」「早口」等発声の種類を違えた収録

④ ＜低価格を実現＞
・最低価格一話者＜￥１，０００/一時間拘束＞
※発話時間が短く、入力（録音）デバイスや収録場所を選ばない、話者バランス無しなど、いくつかの条件があります。

（２）本業務経歴の長い世界中の収録専門家が集結しています。

⑤ ＜優れた録音エンジニア陣＞
・1980年代から本業務目的での音声収録に携わってきた“音の職人”が集結
・長年本業務目的で使用された実績のあるプロ仕様ナレーションスタジオ配備

⑥ ＜3人体制での言語チェック体制＞
・読上げ発話における話者の発声間違いのチェック
・指定の言語や方言を話しているかどうかのチェック
・[収録前（話者選定）→収録中（録音時）→収録後（検聴）]のそれぞれの工程に合わせて三人によるトリプルチェック

⑦ ＜収録後のファイル化や自由発話の書き起こしなど＞
・ご希望の音声形式、ファイル形式での納品
・自由発話の（高精度）書き起こし
・不用語の精細タグ付け
・音素やヨミの付与
・独自のフィラータグ付のご提案

第２事業部　　●営業課　●制作課

■言語情報処理■
自動翻訳技術および製品の開発メーカ様、大学・研究施設様向け★各種言語データ、対訳データの収集業務を行っています。

第３事業部　　●販売課　●開発課

■コーパス販売■
音声認識技術、自動翻訳技術、およびその製品の開発メーカ様、大学・研究施設様向け
～世界最大規模のオリジナル音声コーパス・テキストコーパス・0パラレルコーパスを構築、販売しております～

カタログを見る

第Ⅱ事業本部

[大声解]<無料/学術研究限定>

[大音泉]<有料/製品開発向け>

[大文嶺] <有料/製品開発向け>

[大語輪] <有料/製品開発向け>

第１事業部 ●営業課 ●制作課

第２事業部 ●営業課 ●制作課

第３事業部 ●販売課 ●開発課

第１事業部　　●営業課　●制作課

第２事業部　　●営業課　●制作課

第３事業部　　●販売課　●開発課