Friday, April 5, 2019
Malay Speech Corpus
Malayan diction CorpusCHAPTER 3MALAY SPEECH CORPUS3.1 IntroductionThe knowledge related to the structure of the rules and grammar for any nomenclature must be understood in depth prior to the development of any automatonlike name and address Recognition (ASR) systems. This chapter is intended to discuss the related issues concerning the Malay language and its speech sounds. The Malay head t distributivelyer and the test collections used for this study be also presented in the following sections.3.2Malay Speech Sounds and Language RulesMalay is an Austr mavensian language verbalize by the Malay people who atomic exit 18 native to the Malay Peninsula, gray Thailand, Singapore and parts of Sumatra and also known locally as Bahasa Melayu. It is the official language of Malaysia and is an agglutinative language, importation that the meaning of the intelligence operation brush aside be changed by adding the necessary prefixes or suffixes that will be explained through and thr ough out of this section.The smallest unit in any language is known as phoneme. The substitution of this unit for other might make a distinction of meaning (Nong et al. 2001). Integrating the phonemes produces the syllable and manner of speaking. Generally, phoneme classification for Malay language is carve up into cardinal major groups that consist of Vowels (V), Consonants (C) and other miscellaneous (Manaf Hamid 1996). This structure is relatively same with the face language as shown in Figure 3.1 (Karim 1996). The vowel class comprises of six vowels that is /a/, //, /i/, /o/, /u/ and /e/. The vowel sound is produced when the air exit from the lunges and mouth without ant noise.The second category, which is concordant class, substructure be further divided into seven different categories that is the stops or plosive group, affricates, nasals, glides, liquids, fricatives and the semivowel. The sounds from harmonizeds argon produced by air from lungs and consist of noise. The noise is generated in mouth and nose, for instance, phoneme /p/and /b/. Figure 3.2 describe the consonant utterances classification for the Malay language.The last category, miscellaneous category, consists of the diphthong and vowel functions. Vowel function is a combination of two different vowel (ia, io and iu) and most often used in manner of speaking absorbed directly from its English equivalent such as radio and audio, and in some original Malay nomenclature such as nyiur (coconut), hias (decorate) (Hussain, 1997). 3.2.1Malay morphologyMalay morphology is defined as study of word structures in Malay language (Lutfi Abas, 1971). A morpheme is the term used in the morphology. A morpheme is the smallest meaningful unit in a language. In another haggle, morpheme is a combination of phonemes into a meaningful unit. A Malay word can be comprised of one or more morphemes. When we talk about Malay morphology, we cannot avoid from discussing the process of word formation in Ma lay language. It is a language of first derivative which allows the addition of affixes to the base/root or primary word to form new dustup. The language itself is different from the English. In English language, the process involves the changes in the phonemes according to their groups. The processes of word formation in Malay language are in the forms of primary words, derivative words, compound words and reduplicative words.3.2.1.1Primary wordPrimary or root words are either nouns or verbs, which is does not take any affixes or reduplication. A primary word can be comprised of one or more syllables. A syllable consists of a vowel (V) or a vowel with a consonant (C) or a vowel with several consonants. The vowel can be presented at the front or back of the consonants. In Malay language, primary word with one syllable accounts for about 500 only (Nik Safiah Karim et al. 1995). near of the primary words are taken from other languages such as English and Arabic. The structures of t he syllable are shown in Table 3.1. Primary words with two syllables are the majority in the Malay language. The structures of the words are shown in Table 3.2 with example of words that illustrated as in Figure 3.3. Primary words with three and more syllables exist in a few numbers. Most of them are taken from other languages as shown in Table 3.3.Table 3.1 Structure of words with one syllableSyllable StructureExample of wordCVYa (yes)VCAm (common)CVCSen (cent)CCVCStor (store)CVCCBank (bank)CCCVSkru (screw)CCCVCSkrip (script)Table 3.2 Structure of words with two syllablesSyllable StructureExample of wordV + CVIbu (mother)V + VCAir (water)V + CVCIkan (fish)VC + CVErti (meaning)VC + CVCEmpat (four)CV + VDoa (pray)CV + VCDiam (silent)CV + CVGuru (teacher)CV + CVCTelur (egg)CVC + CVLampu (lamp)CVC + CVCJemput (invite)ER+TIVC+CVJEM+PUTCVC+CVCC ConsonantV VowelTable 3.3 Structure of words with three syllables or moreSyllable StructureExample of wordCV + V + CVSiapa (who)CV + V + CVCSia sat (investigate)V + CV + VUsia (age)CV + CV + VSemua (all)CV + CV + VCHaluan (direction)CVC + CV + VCBerlian (diamond)V + CV + CVUtara (north)VC + CV + CVIsteri (wife)CV + CV + CVBudaya (culture)CVC + CVC + CVSempurna (perfect)CVC + CV + CVCMatlamat (aim)CV + CV + VC + CVKeluarga (family)CV + CVC + CV + CVPeristiwa (event)CV + CV + V + CVCMesyuarat (meeting)CV + CV + CV + CVCMunasabah (reasonable)V + CV + CVC + CV + CVUniversiti (University)3.2.1.2Derivative wordDerivative words are the words that are formed by adding affixes to the primary words. The affixes can exist at the initial (Prefixes), within (Infixes) or final (Suffixes) of the words. They can also exist at the initial and final of the words at the same time. These kinds of affixes are called confixes. Examples of derivative words are berjalan (walking), mempunyai (having), pakaian (clothes) and so on.3.2.1.3 Compound wordCompound words are the words that are combined from two individual primary words, which carry certa in meanings. at that place are quite lots of compound words in Malay language. Examples of compound words are alat tulis (stationery), jalan raya (road), kapal terbang (aeroplane), Profesor Madya (associate professor), hak milik (ownership), pita suara (vocal folds) and so on. Some of the Malay idioms are from the compound words such as kaki ayam (bare feet), buah hati (gift), berat tangan (lazy), terima kasih (thank you) and so on.3.2.1.4 Reduplicative wordReduplicative words, as its name suggests, are the words that are reduplicated from the primary words. There are three forms of reduplication in Malay language full, partial and rhythmic. Examples of reduplicative words are mata-mata (policeman), sama-sama (welcomed) and so on.3.3Malay Speech Corpus DesignMalay speech design basically involves the proper selection of speech scrape sounds for speech recognition. The Malay phonemes can be analyzed according to the descriptive compendium and distinctive lineament analysis. Gener ally, the descriptive analysis is preferred over the distinctive feature analysis because it is easier to be implemented. To develop a baseline system for spoken Malay utterances or word model, we need database for separated spoken Malay words. However, very little of the literature and reference material in Malay is available in earthy electronic form to support research and development work. These materials are sometimes not suitable for the real life speech recognition system due to their setting environments and most of these materials are recorded the plan or read text.Since no spoken Malay database exists, we develop the Malay star based on Hansard documents from Parliament of Malaysia. The hansard documents consists of Dewan Rakyat(DR)Parliamentary debates session for the year 2008. It contains spontaneous and formally speeches and it is the daily records of the words spoken by 222 pick out members of DR. The hansard documents comprises of 51 huge raw word-painting and audio files (.avi form) of daily recorded parliamentary session and 42 text files (.pdf form). Each part of parliamentary session contains six to eight hours spoken speeches that surrounded with strength noise condition or environment (less than 30 dB), speakers interruption (Malay, Chinese and Indian) and different speaking styles (low, intermediate and high intonation or shouting). The reason of elect this kind of data is due to their spontaneous and inhering way of speaking in a formal or standard Malay speech during the debates session. The analysis has been done to the whole recorded session from mid-term until the end 2008 of hansard documents. Out of 42 text documents and 51 video files, only 22 text documents and 22 video files were be selected due to their perfect matched in price of the contents of video and audio source files. The remaining of the text documents and video files countenance not been chosen due to the missing of some text documents that could not be d ownloaded, some video files having corrupted during save session and some of the recorded video having missed sounds. This study focused and concerned to the video that have audio sounds since it will be used to develop the Malay corpus and to evaluate the performance of stranded spoken Malay speech recognition system. The quantitative information analysis, about the videos and text documents being selected is minded(p) in Table 3.4. Table 3.4 Quantitative information of Hansard documents selected.No.Video Text DocumentsNo. of issueNo. of SpeakersTotal voice communication1.DR28052008 (MEI)1112940,2832.DR29052008 (MEI)1511439,6123.DR24062008 (JUNE)1315449,2124.DR25062008 (JUNE)1011838,0535.DR30062008 (JUNE)1017558,0136.DR02072008 (JULY)1418767,9067.DR03072008 (JULY)1212048,4118.DR07072008 (JULY)1621072,8909.DR10072008 (JULY)1313242,35010.DR28082008 (AUGUST)1012340,78011.DR03112008 (NOVEMBER)1723278,75012.DR04112008 (NOVEMBER)1113643,44013.DR10112008 (NOVEMBER)1010539,56014.DR201 12008 (NOVEMBER)1610942,79515.DR26112008 (NOVEMBER)1018638,88016.DR27112008 (NOVEMBER)1014741,45017.DR01122008 (DECEMBER)711838,43018.DR02122008 (DECEMBER)917656,81519.DR03122008 (DECEMBER)1215248,61620.DR04122008 (DECEMBER)1119256,78021.DR10122008 (DECEMBER)613038,67722.DR11122008 (DECEMBER)1014352,369 quantityThe process of documents analysis shows that the majority of the Malay words are comprised of primary word with two syllables and kissing disease (one) syllables. Among the Malay words, the syllables structure of VC, CV and CVC are the most common. These structures are preferred because they are easy to be pronounced exactly as its written and their number is quite substantial in the hansard documents. In read to get a good distribution of consonants and vowels for the dataset from the hansard documents, the most frequently primary (root or base) words spoken by speakers during Parliamentary debates are used. As mentioned previously, most of the root words are the primary wo rds that are either in nouns or verbs without adding any derivations (affixes and suffixes) or reduplication to the root words. Thus, from the text documents analysis, we refractory 100 primaries words that mostly spoken by the committee members during the debates that consist of 10 primary words of one syllable, four primary words from three or more syllables structures and 86 primary words that form two syllables structures as depicted in Table 3.5. The details quantitative analysis of each words distribution is represented in Appendix A. Each primary word has maximum number of 50 repetitions that uttered by same or different speakers. Thus, there are a derive of 5000 isolated spoken Malay words used for this research. The challenging task is to capturing and segmenting the exact words being uttered accordingly to the audio sounds in the video files. The process of creating isolated spoken Malay corpus is illustrated as in Figure 3.4 and briefly explained in the following sectio ns. Table 3.2 Selection of 100 isolated spoken Malay words as the speech target sounds.No.WordsStructuresNo.WordsStructures1ADAV + CV51LAGICV + CV2AHLIVC + CV52LAINCV + VC3AKANV + CVC53LAMACV + CV4AKTAVC + CV54LANGKAHCVCC + CVC5ARAHV + CVC55LEBIHCV + CVC6ATASV + CVC56MAKLUMCVC + CVC7ATAUV + CVV57MANACV + CV8BAGICV + CV58MASACV + CV9BAIKCV + VC59MASIHCV + CVC10BAKALCV + CVC60MESTICVC + CV11BANKCVCC61MUNGKINCVCC + CVC12BARUCV + CV62NANTICVC + CV13BEKASCV + CVC63OLEHV + CVC14BERICV + CV64ORANGV + CVCC15BINCANGCVC + CVCC65PADACV + CV16BOLEHCV + CVC66PIHAKCV + CVC17BUATCV + VC67PRINSIPCCVC + CVC18BUKANCV + CVC68PULACV + CV19DALAMCV + CVC69PUNCVC20DANCVC70RAMAICV + CVV21DASARCV + CVC71RIBUCV + CV22DATANGCV + CVCC72RUJUKCV + CVC23DENGANCV + CCVC73SAHCVC24DIACVV74SAMACV + CV25EKONOMIV + CV + CV + CV75SANGATCV + CCVC26ESOKV + CVC76SAYACV + CV27HADIRCV + CVC77SEBABCV + CVC28HAKCVC78SEBUTCV + CVC29HALCVC79SEDANGCV + CVCC30HARICV + CV80SEDIACV + CVV31HENDAKCVC + CVC81SUDAHCV + CVC32IAITUVV + V + CV82SUSAHCV + CVC33IALAHVV + CVC83TADICV + CV34INGATVC + CVC84TAHUCV + CV35INGINVC + CVC85TAHUNCV + CVC36INIV + CV86TIDAKCV + CVC37ISUV + CV87TANYACV + CCV38ITUV + CV88TELAHCV + CVC39IZINV + CVC89TENTANGCVC + CVCC40JADICV + CV90TERIMACV + CV + CV41JANGANCV + CCVC91TIDAKCV + CVC42JAWABCV + CVC92TIPUCV + CV43JUGACV + CV93TUANCV + VC44JUTACV + CV94
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.