An Explainable AI Framework for Voice Command Classification
DOI:
https://doi.org/10.71366/ijwos03022659307Keywords:
Explainable AI, Voice Command Classification, SHAP, LIME, Grad-CAM, MFCC, Attention Mechanism, CNN, Phoneme Alignment, Interpretability, Deep Learning
Abstract
A feeling of discomfort tends to arise when working with something whose inner workings remain unclear. Voice-based systems have mostly escaped such scrutiny - since they function adequately most times, few tend to probe deeper. Yet those designing high-stakes applications must dig further; passive acceptance carries too much risk. It wasn't about proving neural nets can recognize spoken words correctly - we already had evidence for that - it was about peering into the model afterward, extracting reasoning clear enough for doctors, engineers, or regulators to act upon. Our approach used a convolutional network enhanced with attention mechanisms, fed with MFCC features pulled from Google Speech Commands v2, which holds over one hundred thousand utterances across thirty-five labels, achieving 97.3% precision. Floating above came SHAP, then LIME, followed by Grad-CAM - each serving as interpretability tools. A separate stage mapped timing-based relevance onto individual speech sounds. Results stayed true to model behavior while making sense in terms of spoken language. Yet beyond expectation emerged strong divergence among techniques when examples blurred category lines.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


