Score Normalisation in Voice Biometrics Term Paper

Score Normalisation in Voice Biometrics Abstract Speaker verification involves determination of the identity of the speaker, and speaker identification involves determination of matches to the input voice. Score normalisation techniques are used to transform a system’s output scores reducing misalignments, caused due to speaker dependent or independent factors, such as test data conditions, training conditions, etc., in score distributions in different speaker models. Bayesian method and standardization of score distributions are two score normalisation methods. Bayesian methods include cohort normalization, world model normalisation, and unconstrained cohort normalisation. Standardisations of score distributions include Z-norm, and T-norm. Score normalisation helps achieve separation between score distributions of known and unknown speakers. A reduction in equal error rate is achieved by the use of score normalisation methods. Introduction Speaker recognition is required in applications, such as operating in environments that are uncontrolled or while transmitting speech over communication channels. Speaker verification involves assessment of similarity scores between registered or unregistered users and reference models. The expectation is that verification scores should be high for true speakers and low for impostors. However, true speaker verification scores could be adversely affected by background noise, speech variations of the speaker, variations caused by the recording apparatus, and/or effects caused by the communication channel. Score distribution plots enable observation of true speaker scores and impostor scores relative to each other. Figure 1. True and Impostor Speaker Score Distribution (Ariyaeeinia, 2006) Test utterances from true speakers and impostors obtained experimentally can be used to generate score distribution plots (see fig. 1). Since, there is an overlap between true and impostor score distributions, an acceptance threshold is chosen. The accuracy of verification process is directly proportional to the distance between the score distributions. Overlapping of score distributions could result in errors, such as false acceptances and false rejections. False acceptances involve accepting impostors as true speakers. False rejections involve rejecting true speakers. Adjusting the threshold could result in reduction of one type of error while increasing the other. This could be overcome by setting the threshold, so that the two error types are equal. This technique is known as the equal error rate (see fig. 2), where false acceptance rate is set equal to false rejection rate. Figure 2: Setting Threshold for Equal Error Rate (Ariyaeeinia, 2006) The accuracy of speaker verification is represented by a detection error trade off plot (see fig. 3). Figure 3: Detection Error Trade off Plot (Ariyaeeinia, 2006) Variations in speech characteristics are caused because of background noise and/or channel noise. These along with speaker generated variations can cause mismatch between utterances between training and testing, resulting in reduction of accuracy (Ariyaeeinia, 2006). Score Normalisation Methods Score normalisation methods have been widely used to improve accuracy. Several score normalisation methods exist, depending on the approximation approach. These are mostly based on the mean of scores for background speaker model. It is given by the expression Snorm = score for target model/mean scores for background models, where Snorm is the normalised score. The ratio of scores instead of absolute scores has resulted in improvement of verification performance, since the ratio of the score for target model to a statistic of scores with the same background remains unchanged. Cohort normalisation method uses scores for a cohort of speaker models, where competing speakers are selected based on the closeness of speaker and target models before the testing. In this method, the possibility of an impostor’s test utterance being equally dissimilar from competing and target models exists, giving rise to the possibility of false acceptance. Unconstrained cohort normalisation uses scores for a cohort of background speaker models closest to the test utterance. Background speaker models are selected during the testing of speaker verification, thus reducing impostor scores in relation to true speaker scores. The method has been successful in the reduction of false acceptance and false rejection. When the number of background models is increased, the capability to suppress impostor score is diminished. Other score normalisation methods include those based on standardisation of score distributions. It is desirable to use a single threshold for all registered speakers. However, score distributions for impostors and true speakers have different characteristics. A widely used practice is the standardisation of impostor score distribution. T-norm is an effective normalisation method, where normalisation parameters are determined dynamically during testing. It is given by the expression Snorm =St-µT/σT, where Snorm is the normalised score, St is the initial score for the target speaker model, µT the average of scores for background speaker models, and σT the standard deviation for background speaker models (Ariyaeeinia, 2006). Snelick (2005) has described other common score normalization methods, which include: Min Max Method: Raw scores are within 0 to 1 range. It can be expressed as , where n is the normalized score, s is the raw matching score, and max(S) and min(S) are the maximum and minimum points of the score range. Z-score Method: The method is given by the expression, where n is the normalized score, s is the raw matching score, mean(S) is the arithmetic mean and std (S) is the standard deviation. Tanh Method: The method is given by the expression , where n is the normalized score, s is the raw matching score, mean(S) is the arithmetic mean, std(S) is the standard deviation, and tanh() is a trigonometric operator. Score Normalisation Advantages Ariyaeeinia et al. (2006) have emphasized that score normalisation helps achieve separation between score distributions of known and unknown speakers. Two main score normalisation categories include the Bayesian method and standardization of score distributions. Bayesian methods include cohort normalization, world model normalisation, and unconstrained cohort normalisation. Standardisations of score distributions include Z-norm, and T-norm. Speaker identification involves determining the correct speaker from a registered population. Speaker verification involves determining a speaker as s/he claims to be. In the study, cohort methods exhibited the best performance. Score normalisations are used to overcome problems in scores, which are affected by distortions in test utterance characteristics, speaker model misalignment, and unseen data. In a comparative study of decoupled and adapted Gaussian mixture models in open set text independent speaker identification by Fortuna et al. (2005), it was found that cohort approaches, particularly unconstrained cohort normalisation were equally capable of good performances in both models. Normalisations in the study included, world model normalisation, cohort normalisation, unconstrained cohort normalisation, T-norm and Z-norm. T-norm was among the worst performers in the case of decoupled Gaussian mixture models and among the best performers in adapted Gaussian mixture models. Score normalisation techniques are used to transform a system’s output scores reducing misalignments in score distributions in different speaker models. Misalignments are caused due to speaker dependent or independent factors, such as test data conditions, training conditions, etc. T-norm has been widely deployed as a score normalisation technique for improving the performance of speaker verification systems, as a result of its low false acceptance rates. T-norm has been used as a test-dependent normalization technique, which estimates score distribution of the test speech from a set of impostor models. A novel speaker adaptive technique based on T-norm has been proposed for speaker verification. The technique uses Kullback-Leibler divergence fast approximation for Gaussian mixture models. Stable improvements in error reduction rates were obtained for all conditions (Ramos-Castro, 2005). Score Normalisation Case Studies In a study of text dependant speakers by Ariyaeeinia et al. (1997), verification performance of various types of vector quantisation and dynamic time warping classifiers, algorithmic issues and verification accuracy were examined. Performance degradation caused by linear filtering effect of a telephone channel was minimized by the use of cepstral mean normalization approach, where cepstral feature vector average was computed and subtracted from individual feature vectors. Ma (2003) conducted a comparison of discriminate training methods for speaker verification. Widely used score normalization techniques, such as T-norm and Z-norm have been used in speaker verification systems to perform channel and handset compensation. On application of a discriminative score normalization technique, the methods caused better performance. However, additional speech data or external speakers needed to be computed. In the experiment, a logistic regression model has been used. Logistic regression has proved to be an effective score-normalization technique, which could be combined with other model training methods. A normalized discriminant analysis method for speaker verification has been presented by Li et al. (1996) to address problems in the use of linear discriminant analysis in the design of classifiers. The training data being small, discriminant scores from different classifiers were scaled differently. In the technique, the projected data from true speaker and impostor was maximally separated by the use of a weight vector. An equal error rate of 6.13 was achieved, while the use of Fisher linear discriminant analysis resulted in error rate of 18.18 percent. The method combined with Hidden Markov Models, a hybrid speaker verification system resulted in an error rate of 4.32 percent, which was lower than 5.30 percent in the Hidden Markov Model with cohort normalization. Alsaade et al. (2008) have proposed unconstrained cohort normalization for multimodal biometrics in the score level fusion process. The technique examined the application of widely used score normalization in voice biometrics to other biometrics. Normalisation methods considered were cohort normalization, unconstrained cohort normalisation, universal background model normalisation, T-norm and Z-norm. Speaker recognition involves the computation of the probability of the target model given the test utterance, where statistical classifiers provide the verification score. Another approach to score normalisation is based on standardizations of score distributions. The aim is to facilitate the use of single threshold for all speakers. However, impostor score distribution and true score distribution have different characteristics for different speakers. Standardising the impostor score distribution has been the current practice. In a study of speaker verification by the use of mixture decomposition discrimination, Sukkar et al (2000) showed that an error reduction of 46 percent was achieved by using a hybrid verification system, involving speaker dependent Hidden Markov Modelling with cohort normalisation. In the experiment, the same word spoken by different speakers caused domination of different Hidden Markov Model mixture components. Speaker verification output scores are transformed during score normalization, which serves to enhance the effectiveness of detection threshold. This is achieved by the alignment of score distributions of individual speaker models, and reduction of effects of speaker dependent and independent modifications of the signal. The T-norm and Z-norm are common normalization techniques. T-norm involves the estimation of parameters using scores derived from impostor models. Z-norm involves estimation of parameters using scores from a set of impostor utterances. In an experiment, the T-norm was extended to Adaptive T-norm offering advantages over the standard T-norm. This was achieved by adjusting the speaker set to the target model. This resulted in lower error rates compared to the traditional T-norm (Sturim, 2005). H´ebert et al. (2005) have described a T-norm technique has been described for text-dependent speaker verification. T-norm is an extension to cohort normalization, which has proved to be very effective in normalizing verification scores. In a text-dependent task, mismatch between the lexicon of the target speaker and cohort speaker models has made the deployment of T-norm a challenge. The researchers proposed a scheme of hybrid scoring using T-norm and background model to over to overcome the problem. This resulted in a 31 percent relative error rate reduction than the use of T-norm alone. Score Normalisation Applications Naval Research Laboratory has embarked on a study of voice biometrics, with the ultimate goal of enabling the use of voice as a password. Speech normalization methods used in the study included normalization of the peak amplitude of the speech waveform, adaptive boosting of high frequency speech for spectral analysis, fixed rule to crop the speech waveform, wider bandwidth for extracting more voice features, and removal of speech distortion on use of a gas mask. A voice biometrics system has been designed, which involves the selection of test phrase by the speaker, carrying own speech template, pre-processing of speech waveform for normalization, optimization of voice biometrics performance and calibration of the self-test score (Kang, 2002). Arslan et al. proposed a speaker authentication and identification system in the VOICIFY project. Speaker verification involves determination of the identity of the speaker from a voice sample. Speaker identification involves determination of matches to the input voice. Other systems include text dependant systems and vocabulary dependant systems. The system was designed for high precision, and robustness against channel transformations and noise making it suitable for telephony applications for security purposes. Speaker verification has been proposed in three steps. The first step was to extract features that were speaker dependent. The second step was to build a statistical model representing the characterization of the feature set. The third step involved decision making about the input voice by comparing it to previously developed speaker models. The benefits to the proposed project include improved security, reduced costs, improved service and saving time spread over industry sectors, such as financial services, telecom, retail, enterprise and information technology, travel, internet, hospitals, insurance, government, and military. Conclusion Score normalisation methods include Bayesian methods and standardisation of score distributions. Score normalisation helps achieve separation between score distributions of known and unknown speakers. A reduction in equal error rate is achieved by the use of score normalisation methods (see fig. 4). Figure 4: Effectiveness of Score Normalisation (Ariyaeeinia, 2006) References Alsaade, F. (2008). Enhancement of multimodal biometric segregation using unconstrained cohort normalisation. Pattern Recognition. 41 (2008), 814-820. Ariyaeeinia, A. (2006). Verification effectiveness in open-set speaker identification. IEE Proc.-Vis. Image Signal Process. 153 (5), 618-624. Ariyaeeinia, A. (1997). COMPARISON OF VQ AND DTW CLASSIFIERS FOR SPEAKER VERIFICATION. European Conference on Security and Detection . Conference Publication No. 437 (28-30 April), 142-146. Arslan, L. (2009). HANDSET NORMALIZATION FOR VOICE AUTHENTICATION (VOICIFY). GVZ SPEECH TECHNOLOGIES CO. 2009, 1-5. Fortuna, J. (2005). ON THE USE OF DECOUPLED AND ADAPTED GAUSSIAN MIXTURE MODELS FOR OPEN-SET SPEAKER IDENTIFICATION. Proceedings of The Third COST 275 Workshop. Biometrics on The Internet (2005), 41-44. H´ebert, M. (2005). T-Norm for Text-Dependent Commercial Speaker Verification Applications: Effect of Lexical Mismatch. ICASSP 2005. ICASSP (0-7803-8874-7/05/), 729-732. Kang, G. (2002). Voice Biometrics for Information Assurance Applications. Naval ResearchLaboratory. NRL/FR/5550--02-10,044 (December 5), 1-44. Li, Q. (1996). NORMALIZED DISCRIMINANT ANALYSIS WITH APPLICATION TO A HYBRID SPEAKER-VERIFICATION SYSTEM. IEEE. 1996 (0-7803-3 192-3), 681-684. Ma, C. (2003). COMPARISON OF DISCRIMINATIVE TRAINING METHODS FOR SPEAKER VERIFICATION. ICASSP 2003. ICASSP (0-7803-7663-3/0), 192-195. Ramos-Castro, D. (2005). SPEAKER VERIFICATION USING FAST ADAPTIVE TNORM BASED ON KULLBACK-LEIBLER DIVERGENCE. Proceedings of The Third COST 275 Workshop. Biometrics on The Internet (2005), 49-52. Snelick, R. (2005). Large Scale Evaluation of Multimodal Biometric Authentication Using State-of-the-Art Systems. IEEE Transactions on Patern Analysis and Machine Intelligence. 27 (3), 450-455. Sturim, D. (2005). SPEAKER ADAPTIVE COHORT SELECTION FOR TNORM IN TEXT-INDEPENDENT SPEAKER VERIFICATION. ICASSP 2005. ICASSP (0-7803-8874-7/05/), 741-744. Sukkar, R. (2000). Speaker Verification Using Mixture Decomposition Discrimination. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. 8 (3), 292-299. Read More

Score Normalisation in Voice Biometrics - Term Paper Example

Extract of sample "Score Normalisation in Voice Biometrics"

CHECK THESE SAMPLES OF Score Normalisation in Voice Biometrics

Biometric Authentication

The Biometric Facial Recognition Process

Biometric Security

Biometrics: Fingerprints, Retina, Facial Recognition, and Iris Patterns

The Biometric System and Its Use

The Use of Biometrics and Bio-Information to Support New Systems Integration

Biometrics Signature Recognition

Comparative Analysis of Suicide in Japan and the UK