Alumni
Bio
I’m from Pittsford, a suburb of Rochester, New York, and am currently a senior in the cognitive science BS program at SUNY Oswego with minors in audio production and design and computer science. Outside of academics, I write and produce hip-hop/electronic music under DISHMINT. I’m considering graduate study in either human and computer interaction or brain and cognitive sciences. I’m interested in belief systems, the hard problem of consciousness, sleep and dreaming and the framework for cognitive codification.
Project: Who Speaks When?
Given a speech recording, a program must split the recording into segments of each unique speaker and display the speaker-homogeneous sections on a timeline. This process is known as speaker diarization and facilitates answering the question “Who spoke when?” In this project I will be using unsupervised learning; this method makes the algorithm text independent.
Two parts to speaker recognition: Enrollment and Verification:
Enrollment → Short or relatively long phrases are used to create a model of each speaker.
Verification → Accept or reject whether a voice is likely to be that of the speaker model.
In the Wolfram Language, use FeatureExtraction to build a model of each speaker and perhaps ClusterClassify to cluster the audio segments of each speaker. Due to particular conversational dynamics, the audio must be preprocessed or filtered; noise, relative volume, gender (pitch and formants) and number of simultaneous speakers should be taken care of. The machine learning algorithms in the Wolfram Language have automated preprocessing. With the audio presented as an audio plot, the speaker label can be overlaid.
References
[1] Wolfram Research, Inc., “FeatureExtraction,” Wolfram Language System & Documentation Center. (Sep 14, 2016) reference.wolfram.com/language/ref/FeatureExtraction.html.
[2] Wolfram Research, Inc., “ClusterClassify,” Wolfram Language System & Documentation Center. (Sep 14, 2016) reference.wolfram.com/language/ref/ClusterClassify.html.
[3] M. Lutter, “Mel-Frequency Cepstral Coefficients,” SR Wiki. (Nov 25, 2014) recognize-speech.com/feature-extraction/mfcc.
[4] R. Galgon, “Now Available: Speaker & Video APIs from Microsoft Project Oxford,” Cortana Intelligence and Machine Learning Blog. (Dec 14, 2015) blogs.technet.microsoft.com/machinelearning/2015/12/14/now-available-speaker-video-apis-from-microsoft-project-oxford.
[5] A. C., “Uniquely Determining Identity Using Computer-Based Analysis of Human Speech,” Advanced Authentic Research. (Sep 14, 2016) aar.pausd.org/project/cheungchen.
[6] H. Bredin and C. Barras, “PhD—Machine Learning/Structured Prediction for Speaker Diarization,” Hervé Bredin. (Sep 14, 2016) herve.niderb.fr/students/phd/structured_prediction.html.
[7] B. Marr, “Machine Learning: What It Is and the Milestones Everyone Should Know About?” LinkedIn. (Feb 24, 2016) www.linkedin.com/pulse/machine-learning-what-milestones-everyone-should-know-bernard-marr.
[8] D. A. Reynolds, “Automatic Speaker Recognition: Current Approaches and Future Trends,” Lexington, MA: MIT Lincoln Laboratory. www.ll.mit.edu/mission/cybersec/publications/publication-files/full_papers/020314_Reynolds.pdf.
[9] S. Furui, “Speaker Recognition,” Scholarpedia.org. (Oct 10, 2007) www.scholarpedia.org/article/Speaker_recognition.
Favorite 3-Color 2D Totalistic Cellular Automaton
Rule 3109912