metricas
covid
Buscar en
Revista de Logopedia, Foniatría y Audiología
Toda la web
Inicio Revista de Logopedia, Foniatría y Audiología User profiles and auditory-perceptual evaluations upon the launch of the All-Voi...
Información de la revista
Vol. 45. Núm. 1.
(enero - marzo 2025)
Visitas
152
Vol. 45. Núm. 1.
(enero - marzo 2025)
Original article
Acceso a texto completo
User profiles and auditory-perceptual evaluations upon the launch of the All-Voiced app for voice evaluation: Initial insights and training potential
Perfiles de usuario y evaluaciones auditivo-perceptivas tras el lanzamiento de la aplicación All-Voiced para la evaluación de la voz: perspectivas iniciales y potencial formativo
Visitas
152
Neus Calaf
Department of Basic, Developmental and Educational Psychology, Autonomous University of Barcelona, Spain
Este artículo ha recibido
Información del artículo
Resumen
Texto completo
Bibliografía
Descargar PDF
Estadísticas
Figuras (4)
Mostrar másMostrar menos
Tablas (2)
Table 1. Summary statistics for samples evaluated by 10 or more unique all-voiced raters, including comparisons with PVQD and combined (all-voiced+PVQD) datasets.
Tablas
Table 2. Summary statistics for samples with at least 3 evaluations of pitch, loudness, resonance comments, or additional features.
Tablas
Mostrar másMostrar menos
Abstract
Introduction

Auditory-perceptual evaluation is key for diagnosing voice disorders, but variability in judgments underscores the need for improved training materials. The All-Voiced app was developed to enhance consistency in evaluations through real-time feedback and data-driven training.

Objective

To present preliminary findings from the first 75 days following the launch of the latest version of the app (September 2024), focusing on user profiles and voice evaluations, exploring the app's potential for deeper insights as more data is gathered.

Method

The latest version of the All-Voiced app, launched in September 2024, includes a fully integrated backend for data collection from users who provide consent. The app enables users to practice voice evaluations, receive feedback, and contribute to research. Descriptive statistics were used to analyze user profiles and evaluations from the first 75 days post-launch of this latest version. Box plots and scatter plots were used to compare the evaluations of All-Voiced users with PVQD ratings (Walden, 2022) across different competence levels.

Results

A total of 264 participants registered in the app, with daily registration patterns showing consistent activity. Most participants were aged 21–30 (49%), identified with “she/her” pronouns (88%), and were from the United States (51%). The majority were speech-language pathologists (78%), and 40% were beginners in terms of competence level. A total of 557 evaluations were collected across 112 voice samples. Analysis of the three most frequently evaluated voice samples revealed distinct patterns of consensus among evaluators with varying levels of expertise, hinting at trends that could have significant implications as more data is collected.

Conclusion

The All-Voiced app has been well received, particularly by speech-language pathologists, highlighting its promise as a tool for auditory-perceptual training. By collecting and analyzing large-scale data, the app holds potential to address limitations in the subjective nature of evaluations, enhance reliability, and support evidence-based practices in voice disorders management.

Keywords:
Auditory-perceptual evaluation
Voice disorders
CAPE-V
All-Voiced app
Voice assessment training
Resumen
Introducción

La evaluación auditivo-perceptiva es fundamental para el diagnóstico de los trastornos de la voz, pero la variabilidad en los juicios subraya la necesidad de mejorar los materiales de formación. La aplicación All-Voiced fue desarrollada para potenciar la consistencia en las evaluaciones a través de la retroalimentación en tiempo real y la formación basada en datos.

Objetivo

Presentar los hallazgos preliminares de los primeros 75 días tras el lanzamiento de la última versión de la aplicación (septiembre de 2024), enfocándose en los perfiles de los usuarios y las evaluaciones de la voz, explorando el potencial de la aplicación para obtener las perspectivas más profundas a medida que se recopilan más datos.

Método

La última versión de la aplicación All-Voiced, lanzada en septiembre de 2024, incluye un backend completamente integrado para la recolección de datos de los usuarios que otorgan su consentimiento. La aplicación permite a los usuarios practicar evaluaciones de la voz, recibir la retroalimentación y contribuir a la investigación. Se utilizaron estadísticas descriptivas para analizar los perfiles de los usuarios y las evaluaciones de los primeros 75 días después del lanzamiento de esta última versión. Se emplearon diagramas de cajas y gráficos de dispersión para comparar las evaluaciones de los usuarios de All-Voiced con las calificaciones de PVQD (Walden, 2022) en diferentes niveles de competencia.

Resultados

Un total de 264 participantes se registraron en la aplicación, con patrones de registro diarios que mostraban actividad consistente. La mayoría de los participantes tenían entre 21 y 30 años (49%), se identificaban con el pronombre «ella» (88%), y eran de EE. UU. (51%). La mayoría eran logopedas (78%), y el 40% eran principiantes en términos de nivel de competencia. Se recopilaron un total de 557 evaluaciones a través de 112 muestras de voz. El análisis de las tres muestras de voz más evaluadas reveló patrones distintos de consenso entre los evaluadores con diferentes niveles de experiencia, insinuando tendencias que podrían tener implicaciones significativas a medida que se recopilen más datos.

Conclusión

La aplicación All-Voiced ha sido bien recibida, especialmente por logopedas, destacando su potencial como herramienta para el entrenamiento auditivo-perceptivo. A través de la recopilación y el análisis de datos a gran escala, la aplicación tiene el potencial de abordar las limitaciones inherentes a la naturaleza subjetiva de las evaluaciones, mejorar la fiabilidad y apoyar prácticas basadas en la evidencia en el manejo de los trastornos de la voz.

Palabras clave:
Evaluación auditivo-perceptiva
Trastornos de la voz
CAPE-V
Aplicación All-Voiced
Formación en evaluación de la voz
Texto completo
Introduction

Auditory-perceptual evaluation is a critical diagnostic tool in assessing and documenting voice disorders (Barsties & De Bodt, 2015; Kreiman, Gerratt, Kempster, Erman, & Berke, 1993; Oates, 2009; Roy et al., 2013). Despite its widespread use, this method faces significant challenges, particularly due to variability influenced by both the listener's perception and the characteristics of the vocal stimulus (Kreiman, Vanlancker-Sidtis, & Gerratt, 2003). To address these challenges and improve the reliability of judgments, the CAPE-V (Consensus Auditory-Perceptual Evaluation of Voice) was developed as a standardized tool for consistently evaluating voice quality across clinical settings (Kempster, Gerratt, Verdolini Abbott, Barkmeier-Kraemer, & Hillman, 2009).

To ensure broader applicability across diverse populations, the CAPE-V has been adapted into multiple languages, including Brazilian Portuguese (Behlau, 2004; Behlau, Rocha, Englert, & Madazio, 2022), European Portuguese (de Almeida, Mendes, & Kempster, 2019; Jesus et al., 2009), Persian (Salary Majd et al., 2014), Italian (Mozzanica, Ginocchio, Borghi, Bachmann, & Schindler, 2014), Spanish (Núñez-Batalla, Morato-Galán, García-López, & Ávila-Menéndez, 2015), Mandarin (Chen et al., 2018), Turkish (Ertan-Schlüter, Demirhan, Ünsal, & Tadıhan-Özkan, 2020; Özcebe, Aydinli, Tiğrak, İncebay, & Yilmaz, 2019), Kannada (Gunjawate, Ravi, & Bhagavan, 2020), Hindi (Joshi, Baheti, & Angadi, 2020), Japanese (Kondo et al., 2021), Tamil (Venkatraman, Mahalingam, & Boominathan, 2022), Malay (Mohd Mossadeq, Mohd Khairuddin, & Zakaria, 2022), Catalan and Spanish (Calaf & Garcia-Quintana, 2024), and European French (Pommée, Mbagira, & Morsomme, 2024). These adaptations allow for greater cultural and linguistic relevance while maintaining the tool's core objective of delivering reliable perceptual assessments. The language-specific versions have significantly expanded the CAPE-V's usability, making it accessible to clinicians and researchers worldwide.

However, while these adaptations extend the CAPE-V's reach to a wider range of populations, the reliability of auditory-perceptual judgments remains a key concern. Ensuring strong inter- and intra-rater reliability is critical to the tool's effective application in diverse contexts.

Inter-rater reliability of the auditory-perceptual judgements

Inter-rater reliability—which measures the consistency of evaluations between different raters—has generally been reported as high for the CAPE-V across various languages and vocal attributes. For instance, the original English version demonstrated strong inter-rater reliability for overall severity (Karnell et al., 2007). Adaptations of the CAPE-V have also shown high inter-rater reliability in certain cases. The European Portuguese version (Jesus et al., 2009) exhibited excellent inter-rater reliability for breathiness and loudness, while the Kannada version (Gunjawate et al., 2020) displayed consistently high reliability across all attributes. Similarly, the Italian adaptation (Mozzanica et al., 2014) reported high inter-rater reliability for overall severity and breathiness, and the Turkish version (Özcebe et al., 2019) demonstrated strong reliability for overall severity and loudness. The Japanese adaptation (Kondo et al., 2021) also showed high inter-rater reliability across all attributes. Finally, the European French version (Pommée, Shanks, Morsomme, Michel, & Verduyckt, 2024) demonstrated good reliability for overall severity.

However, not all attributes performed consistently across studies. The original English version exhibited low inter-rater reliability for strain, pitch, and loudness (Zraick et al., 2011). Likewise, the Persian (Salary Majd et al., 2014), Brazilian Portuguese (Behlau et al., 2022), and Turkish versions (Ertan-Schlüter et al., 2020) demonstrated low inter-rater reliability for pitch and loudness. Roughness also showed variability, with lower inter-rater reliability reported in the English (Kelchner et al., 2010; Zraick et al., 2011), Brazilian Portuguese (Behlau et al., 2022), Tamil (Venkatraman et al., 2022), and Hindi (Joshi et al., 2020) versions. The Tamil version further showed lower inter-rater reliability for pitch and strain. Additionally, the bilingual Catalan/Spanish version (Calaf & Garcia-Quintana, 2024), inter-rater reliability for breathiness and loudness was notably low, further highlighting the challenge of achieving consistency in perceptual voice assessments. Similarly, the European French version (Pommée, Shanks, et al., 2024) showed low inter-rater reliability for most parameters except overall severity,

Intra-rater reliability of the auditory-perceptual judgements

Intra-rater reliability—which measures the consistency of the same evaluator over time—has generally been reported as high for the CAPE-V across different languages and vocal attributes, though some variability persists. For instance, the original English version demonstrated strong intra-rater reliability for overall severity (Karnell et al., 2007). Subsequent adaptations, such as the Italian version (Mozzanica et al., 2014), consistently showed high intra-rater reliability across attributes like overall severity, roughness, and breathiness. Similarly, the Spanish (Núñez-Batalla et al., 2015), Turkish (Özcebe et al., 2019), Japanese (Kondo et al., 2021), and Kannada versions (Gunjawate et al., 2020) displayed strong intra-rater reliability across most attributes. Finally, the European French version (Pommée, Shanks, et al., 2024) also demonstrated good intra-rater reliability for overall severity, strain, and pitch.

However, some studies have reported lower intra-rater reliability in specific dimensions. The English version exhibited lower intra-rater reliability for strain (Zraick et al., 2011). Likewise, the Persian version (Salary Majd et al., 2014) showed reduced intra-rater reliability for pitch and loudness, while the Brazilian Portuguese version (Behlau et al., 2022) also reported lower intra-rater reliability for breathiness and loudness. The Tamil version (Venkatraman et al., 2022) demonstrated moderate intra-rater reliability for pitch and strain, while the bilingual Catalan/Spanish version (Calaf & Garcia-Quintana, 2024) reported particularly low intra-rater reliability for loudness, highlighting the ongoing need for tools and strategies that can help evaluators achieve more consistent auditory-perceptual judgments.

The need for training materials development

The variability in auditory-perceptual judgments underscores the need for further standardization and the development of reference materials to improve reliability. Efforts such as the simulations created by the University of Wisconsin-Madison (Connor, Bless, Dardis, & Vinney, 2008), the Perceptual Voice Qualities Database (Walden, 2022), and the recent anchor and training sample set for auditory-perceptual voice evaluation (Labaere, De Bodt, & Van Nuffelen, 2023) aim to reduce this variability by providing structured materials that help standardize perceptual judgments.

The University of Wisconsin-Madison simulations consist of 45 cases, each including CAPE-V sentence samples, conversational speech, and sustained vowels /a/ and /i/. Each case also provides a detailed patient history, which can be used separately for class discussions or reflections on how medical conditions or symptoms might influence a patient's voice and what further assessments are needed for a complete voice evaluation. While the quality of this material is excellent, the application of the CAPE-V remains on paper, is not digitized, does not collect data, and offers limited feedback, as it only allows users to compare their evaluations to a single reference.

The Perceptual Voice Qualities Database (Walden, 2022) offers a greater potential, with each sample being evaluated twice by 3–4 evaluators. However, the raw data from these evaluations require processing to be effectively used in training programs. Unlike the simulations, this resource presents a larger dataset and the possibility of deeper analysis, but its full integration into educational settings remains dependent on further development to transform the raw data into actionable training materials.

The anchor and training sample set developed by Labaere et al. (2023) uses the GRBAS scale (Hirano, 1981), which, while still widely used, is now superseded by the standardized CAPE-V. The CAPE-V is internationally recognized for its standardized protocol for sample collection, refined vocal attributes, and use of a visual analog scale (VAS), allowing parametric statistical tests and improving reliability across clinical settings. A strength of Labaere et al.’s material is the strict reliability criterion for anchor samples, requiring 90% expert agreement on at least one attribute. However, relying on the GRBAS scale limits its alignment with more up-to-date practices like the CAPE-V, which offers better standardization in auditory-perceptual evaluations and allows for a more nuanced assessment than a four-category scale.

In response to these challenges, the All-Voiced app (Calaf, 2024b) was developed to provide a more accessible, scalable, and data-driven platform for both training and research in voice evaluation. The app, which is currently free to use, focuses on improving the consistency and accuracy of auditory-perceptual ratings by allowing users to practice their evaluations, receive real-time feedback, and compare their assessments with those of previous evaluators. This feedback system, combined with the app's interactive features, is designed to enhance reliability in both inter-rater and intra-rater assessments, addressing key gaps in the field.

The All-Voiced app is designed not only as a training tool but also as a research platform, with the goal of gathering large volumes of perceptual data from users with varying levels of expertise. This data collection is intended to support ongoing research aimed at refining auditory-perceptual voice evaluation techniques and identifying patterns that could inform the development of more targeted training materials and clinical tools in the future.

The primary aim of this study is to present preliminary findings on the user profiles and perceptual judgments collected during the first 75 days following the launch of the latest app All-Voiced app version. Specifically, this study analyzes the user registration patterns and voice evaluations of the most frequently assessed samples, and highlights the potential of the app to provide valuable insights into the perception of vocal qualities as more data is collected over time.

MethodApp descriptionDevelopment and versions

The All-Voiced app (Calaf, 2024b) has been developed iteratively, with significant milestones marking its progression. The app is built using modern web technologies, with React for the frontend and Node.js with Express for the backend, ensuring a responsive and scalable user experience. The first version of the app was tested from January to April 2024 with smaller groups of users, including students and participants in research projects led by the author.

On April 16th, 2024, coinciding with World Voice Day, the app was publicly launched with limited functionalities. A major update followed on May 31st, 2024, with the release of CAPE-V resources, published after obtaining permission from ASHA, and aligned with the 53rd Annual Symposium of the Voice Foundation. Notably, the All-Voiced app was cited during the special session Describing Voices, moderated by Nancy Solomon (Nagle, 2024, May 31).

The most recent version of the app, which is the focus of this study, was launched on September 4th, 2024. This version includes a fully integrated backend system allowing users to create accounts and contribute to science and education by sharing their evaluation data. This version was presented on September 6th at the Voice Lab of the Pan-European Voice Conference (PEVOC) in Santander, Spain (Calaf, 2024a, September 6).

User roles and access

The app can be accessed partially without the need for a user account, allowing users to explore some features freely. However, creating an account unlocks additional functionalities. Upon registration, users can choose from four options to declare their voice competence level:

  • 1.

    Beginner: Limited or no experience in the field of voice.

  • 2.

    Intermediate: Some practical experience with a basic understanding of voice concepts.

  • 3.

    Advanced: Good experience in voice, able to handle most cases independently.

  • 4.

    Expert: Extensive experience in voice, recognized as a specialist in the field.

All users, when registering, can choose to agree that their data may be used in research projects, in accordance with the app's Privacy Policy. For users who declare their voice competence level as “Advanced” or “Expert,” an additional option is also available: they can choose to share their evaluations anonymously with the community, providing valuable feedback to other users. Before sharing each evaluation, these users are prompted to give explicit consent, ensuring that they consider that specific evaluation valuable as feedback for others.

Features and functionalities

All registered users have access to the full range of resources available in the app, regardless of their declared competence level. These resources include:

  • -

    Resources for Autonomous Training: These resources allow users to practice voice evaluations independently. After completing each evaluation, users receive feedback in the form of comparison data between their own evaluations and those provided by other evaluators. The samples in this section are presented randomly from a pre-existing database, allowing users to gain experience with diverse voices.

  • -

    Resources for Teaching and Learning: Instructors choose specific voice samples for students to evaluate, guiding the training toward particular learning goals. These resources create an interactive and engaging environment, where participants can compare and discuss their evaluations in group sessions, encouraging thoughtful reflection. Students also have access to home training, where they evaluate instructor-selected samples and record their results in the required format, reinforcing the skills learned in class.

  • -

    Resources for Research: This section is dedicated to gathering essential resources and offering collaborations and personalized services for auditory-perceptual evaluation studies. In contrast to the other two resource types, users in the research section utilize their own voice samples, contributing original data to studies, rather than drawing from the pre-existing database.

The app's comprehensive set of resources ensures that users at all levels—whether beginners or experts—can benefit from structured learning, autonomous training, or active contributions to research. By integrating opportunities for feedback, collaboration, and data sharing, the app serves as a valuable platform for advancing skills in auditory-perceptual evaluation and promoting continuous improvement in the field.

Vocal samples

In this first phase of the app implementation, 120 vocal samples were selected from the Perceptual Voice Qualities Database (PVQD) (Walden, 2022) for use within the app. This selection was carefully curated to ensure a balanced representation of age groups, gender, diagnoses, and varying severities of dysphonia, providing a comprehensive dataset for auditory-perceptual evaluations. To standardize the listening conditions, all vocal samples were normalized to 70dB, ensuring a consistent loudness across evaluations. Additionally, the sociodemographic labels were standardized—for example, unifying variations such as “f”, “F”, “female”, and “Female” under a single label for gender, and harmonizing diagnosis terms to improve consistency and comparability across different groups.

Linguistic adaptations

To ensure accessibility across different regions, the app supports multiple languages. All versions of the CAPE-V available within the app have been included with permission from the authors and copyright holders. The currently available CAPE-V versions are English (Kempster et al., 2009), Spanish (Calaf & Garcia-Quintana, 2024), Catalan (Calaf & Garcia-Quintana, 2024), French (Pommée, Mbagira, et al., 2024), and Japanese (Kondo et al., 2021). The entire app has been translated into these languages, ensuring comprehensive accessibility. To ensure cultural and linguistic appropriateness, the app includes a feedback form in the footer, inviting users to contribute to improving the language adaptations.

At the time of writing this manuscript, negotiations are ongoing with the authors to include additional language adaptations of the CAPE-V.

Study designParticipants

The study involved 264 registered users of the All-Voiced app who consented to the use of their data for research purposes. Participants come from diverse professional backgrounds, including speech-language pathologists, SLP students, and other voice-related professionals. They also represent a wide range of experience levels in voice evaluation, categorized as beginner, intermediate, advanced, and expert.

Only users who provided explicit consent for research participation are included in this study. All evaluations were analyzed anonymously to ensure privacy and compliance with ethical standards. No identifying information was used in either the analysis or reporting of results.

Instruments

Auditory-perceptual voice evaluations were conducted using the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) (Kempster et al., 2009). The CAPE-V was integrated into the app, enabling users to evaluate vocal attributes such as overall severity, roughness, breathiness, strain, pitch, and loudness, along with other specific vocal characteristics like diplophonia, vocal fry, and hypernasality.

Procedure

Upon registration in the All-Voiced app, users are prompted to provide consent for their data to be used for research. Only those who agree to participate are included in the study. Evaluations are submitted during regular app usage, either for personal practice or classroom assignments, and are securely stored for analysis. Data analysis focuses on evaluations submitted during the first 75 days following the launch of the most recent version of the app on September 4th, 2024.

Statistical analysis

The statistical analyses and visualizations were conducted using Python (version 3.10.12) in a Google Colab environment. Key libraries included pandas (version 2.2.2) for data manipulation, matplotlib (version 3.8.0) and seaborn (version 0.13.2) for visualizations.

For the participants’ sociodemographic data, descriptive statistics were calculated to summarize the distribution of age, gender pronouns, country of residence, language preference, profession, and voice competence levels. The results are presented in counts and percentages.

To analyze the gathered auditory-perceptual evaluations, scatter plots and box plots (Figs. 1–4) were created to compare the evaluation results from All-Voiced users across different voice competence levels (beginner, intermediate, advanced, and expert) with those of the Perceptual Voice Quality Database (PVQD) by Walden (2022) for standard perceptual attributes such as Overall Severity, Roughness, Breathiness, and Strain.

Figure 1.

Distribution of overall severity scores for samples with 10 or more unique all-voiced raters, showing expertise levels of all-voiced participants and PVQD data. Note. PVQD=Perceptual Voice Qualities Database (Walden, 2022).

(0.11MB).
Figure 2.

Distribution of roughness scores for samples with 10 or more unique all-voiced raters, showing expertise levels of all-voiced participants and PVQD data. Note. PVQD=Perceptual Voice Qualities Database (Walden, 2022).

(0.11MB).
Figure 3.

Distribution of breathiness scores for samples with 10 or more unique all-voiced raters, showing expertise levels of all-voiced participants and PVQD data. Note. PVQD=Perceptual Voice Qualities Database (Walden, 2022).

(0.11MB).
Figure 4.

Distribution of strain scores for samples with 10 or more unique all-voiced raters, showing expertise levels of all-voiced participants and PVQD data. Note. PVQD=Perceptual Voice Qualities Database (Walden, 2022).

(0.11MB).

In this comparison, all available evaluations from the PVQD database were utilized, as each rater provided two independent assessments per sample. This approach ensured the maximum use of the limited data available from this source (3–4 raters per sample). In contrast, for the All-Voiced database, only one evaluation per rater was included to avoid potential biases from raters contributing multiple assessments, given the larger number of raters in this dataset. This methodological choice balanced the differences in dataset structures while preserving the integrity and richness of the PVQD data.

The evaluations from All-Voiced users were visually compared to the PVQD expert ratings, with voice competence levels distinguished by color. For each voice sample and attribute, variability across evaluators was illustrated. Cut-off values for overall severity, as established by Calaf and Garcia-Quintana (2024), were applied to facilitate the interpretation of the data. In addition to the visual analyses, descriptive statistics were calculated to further explore the evaluation data for voice samples, enabling the analysis of both standard perceptual attributes and less commonly evaluated features such as pitch, loudness, hypernasality, and fry.

ResultsParticipants’ sociodemographic characteristics

A total of 264 participants took part in the study, with at least one new user registering each day since the app's launch. In terms of age distribution, the majority of participants were between 21 and 30 years old (49%), followed by 31–40 years (22%), 41–50 years (17%), and 51–60 years (7%). A smaller proportion of participants were under 21 years old (1%), 61–70 years (3%), and only 2 participants (1%) were between 71 and 80 years. There were no participants over 80 years old. With regard to gender pronouns, 88% of participants identified with “she/her,” while 8% identified with “he/him.” Additionally, 2% preferred “she/they,” 1% preferred “he/they”, and 1 user preferred other pronouns.

Looking at the country of residence, the majority of participants were from the United States (51%), followed by Spain (18%) and Ecuador (12%). Other countries with notable representation included Canada, Peru and France (7 users each, 3%), as well as the United Kingdom, and Australia (6 users each, 2%). Countries with fewer participants included Italy, Albania, and Belgium (2 users each, 1%) and a diverse range of countries with one participant each, including Estonia, Colombia, Greece, Pakistan, Brazil, Chile, Turkey, Argentina, Bolivia, Costa Rica, and Slovakia.

In terms of language preference, 57% of participants chose English, followed by Spanish (25%) and Catalan (6%). French was also notably represented (4%), while Italian and Portuguese were each selected by 2 participants (1%). Other languages, such as Persian, Turkish, Greek, Pashto, Tamil, and Slovak, were each selected by 1 participant.

Regarding profession, the majority of participants were speech-language pathologists (78%). Other professions included “other” (17%), singing teachers (1%), one otolaryngologist (ENT specialist), and one vocal coach. Finally, regarding voice competence level, 40% of participants classified themselves as beginners, 25% as intermediate, 20% as advanced, and 14% as experts.

Gathered evaluations

A total of 557 evaluations were collected from users, reflecting a strong initial engagement with the platform. These evaluations encompassed 112 different voice samples, providing a diverse dataset for analysis. 60 evaluations, spanning 42 unique vocal samples, were voluntarily shared with the community, contributing to a collaborative feedback loop that enhances the training process for future participants. While the majority of samples received 1 or 2 evaluations (32 samples had 1 evaluation and 33 samples had 2 evaluations), certain voice samples were evaluated significantly more frequently, with some receiving evaluations from up to 23 unique raters.

Figs. 1–4 illustrate the distribution of Overall Severity, Roughness, Breathiness, and Strain scores for samples evaluated by 10 or more unique All-Voiced raters, compared with evaluations from the Perceptual Voice Quality Database (PVQD) by Walden (2022). In these graphs, the voice competence levels of the study participants are distinguished by different colors. This visual representation highlights how evaluators with different levels of voice competence assessed the same voice samples.

Complementing these visualizations, Table 1 presents descriptive statistics for the same attributes across three sources: All-Voiced raters, PVQD raters, and the combined dataset (All-Voiced+PVQD). This table offers detailed insights into central tendencies (mean, median) and variability (standard deviation [Std], interquartile range [IQR]) for each source.

Table 1.

Summary statistics for samples evaluated by 10 or more unique all-voiced raters, including comparisons with PVQD and combined (all-voiced+PVQD) datasets.

Sample  Attribute  Source  Mean  Std  Min  QMedian  QIQR  Max  Count 
BL01  Overall severity  All-voiced  10  21  71  17 
BL01  Overall severity  PVQD  15  12  13  20  15  35 
BL01  Overall severity  Combined  11  19  14  14  71  23 
BL01  Roughness  All-voiced  21  16 
BL01  Roughness  PVQD  10  12  14 
BL01  Roughness  Combined  21  22 
BL01  Breathiness  All-voiced  21  16 
BL01  Breathiness  PVQD  11 
BL01  Breathiness  Combined  21  22 
BL01  Strain  All-voiced  32  16 
BL01  Strain  PVQD  10 
BL01  Strain  Combined  32  22 
LA1003  Overall severity  All-voiced  29  18  15  24  40  26  65  11 
LA1003  Overall severity  PVQD  29  18  24  27  37  13  40 
LA1003  Overall severity  Combined  29  15  18  25  40  22  65  17 
LA1003  Roughness  All-voiced  25  16  17  21  29  13  65  11 
LA1003  Roughness  PVQD  22  16  19  22  27  28 
LA1003  Roughness  Combined  24  13  17  21  28  11  65  17 
LA1003  Breathiness  All-voiced  10  31  11 
LA1003  Breathiness  PVQD  13  10  13  22  18  24 
LA1003  Breathiness  Combined  10  17  17  31  17 
LA1003  Strain  All-voiced  18  18  14  23  16  66  11 
LA1003  Strain  PVQD  27  12  16  18  24  36  18  44 
LA1003  Strain  Combined  22  17  11  18  28  17  66  17 
LA2007  Overall severity  All-voiced  13  11  12  16  41  10 
LA2007  Overall severity  PVQD  29  17  13  16  23  42  26  51 
LA2007  Overall severity  Combined  19  15  11  14  23  12  51  16 
LA2007  Roughness  All-voiced  10  10 
LA2007  Roughness  PVQD  23 
LA2007  Roughness  Combined  23  16 
LA2007  Breathiness  All-voiced  21  18  15  19  28  13  62  10 
LA2007  Breathiness  PVQD  29  16  11  18  26  36  18  55 
LA2007  Breathiness  Combined  24  17  16  21  30  15  62  16 
LA2007  Strain  All-voiced  13  34  10 
LA2007  Strain  PVQD  19  19  14  35  32  42 
LA2007  Strain  Combined  12  16  25  25  42  16 
LA7008  Overall severity  All-voiced  46  30  20  56  69  49  77  10 
LA7008  Overall severity  PVQD  58  10  41  53  60  65  12  69 
LA7008  Overall severity  Combined  50  24  42  59  66  24  77  16 
LA7008  Roughness  All-voiced  44  24  38  48  51  13  82  10 
LA7008  Roughness  PVQD  56  44  49  59  63  14  65 
LA7008  Roughness  Combined  49  20  44  49  62  18  82  16 
LA7008  Breathiness  All-voiced  16  12  15  26  17  35  10 
LA7008  Breathiness  PVQD  28  18  14  30  39  25  51 
LA7008  Breathiness  Combined  21  15  10  18  33  24  51  16 
LA7008  Strain  All-voiced  11  12  11  15  13  37  10 
LA7008  Strain  PVQD  36  11  23  29  34  38  56 
LA7008  Strain  Combined  20  17  19  33  25  56  16 
NYU1025  Overall severity  All-voiced  63  23  10  61  65  81  20  87  10 
NYU1025  Overall severity  PVQD  74  14  53  69  75  81  12  94 
NYU1025  Overall severity  Combined  67  21  10  61  70  81  20  94  16 
NYU1025  Roughness  All-voiced  64  25  30  44  57  88  44  93  11 
NYU1025  Roughness  PVQD  69  14  55  58  66  78  20  91 
NYU1025  Roughness  Combined  66  21  30  54  65  87  33  93  17 
NYU1025  Breathiness  All-voiced  40  25  24  30  67  43  73  11 
NYU1025  Breathiness  PVQD  56  31  46  69  78  32  80 
NYU1025  Breathiness  Combined  46  28  27  43  72  45  80  17 
NYU1025  Strain  All-voiced  21  28  31  31  81  11 
NYU1025  Strain  PVQD  55  28  25  30  56  79  49  87 
NYU1025  Strain  Combined  33  32  28  61  56  87  17 
NYU1028  Overall severity  All-voiced  10  16  15  33  23 
NYU1028  Overall severity  PVQD 
NYU1028  Overall severity  Combined  10  14  14  33  29 
NYU1028  Roughness  All-voiced  10  12  15  12  58  23 
NYU1028  Roughness  PVQD  10  17 
NYU1028  Roughness  Combined  12  14  13  58  29 
NYU1028  Breathiness  All-voiced  12  53  23 
NYU1028  Breathiness  PVQD 
NYU1028  Breathiness  Combined  10  53  29 
NYU1028  Strain  All-voiced  13  59  23 
NYU1028  Strain  PVQD  10  10  14 
NYU1028  Strain  Combined  12  59  29 
NYU1029  Overall severity  All-voiced  51  26  32  48  76  44  84  17 
NYU1029  Overall severity  PVQD  58  20  16  52  62  70  18  80 
NYU1029  Overall severity  Combined  53  24  37  60  76  39  84  25 
NYU1029  Roughness  All-voiced  24  27  18  28  23  78  16 
NYU1029  Roughness  PVQD  36  23  23  34  46  23  80 
NYU1029  Roughness  Combined  28  26  21  43  36  80  24 
NYU1029  Breathiness  All-voiced  49  24  36  47  74  38  83  16 
NYU1029  Breathiness  PVQD  57  21  22  44  62  70  25  82 
NYU1029  Breathiness  Combined  52  23  37  54  74  37  83  24 
NYU1029  Strain  All-voiced  31  23  27  49  40  70  16 
NYU1029  Strain  PVQD  34  18  22  36  42  20  64 
NYU1029  Strain  Combined  32  21  12  31  49  37  70  24 
NYU1030  Overall severity  All-voiced  46  18  15  30  43  63  33  77  17 
NYU1030  Overall severity  PVQD  51  23  26  30  50  68  37  82 
NYU1030  Overall severity  Combined  48  19  15  30  43  65  35  82  23 
NYU1030  Roughness  All-voiced  34  19  21  29  49  28  63  15 
NYU1030  Roughness  PVQD  54  22  27  35  58  68  33  83 
NYU1030  Roughness  Combined  39  22  23  29  61  38  83  21 
NYU1030  Breathiness  All-voiced  28  22  26  45  36  67  15 
NYU1030  Breathiness  PVQD  34  26  17  27  40  23  81 
NYU1030  Breathiness  Combined  29  23  12  27  44  32  81  21 
NYU1030  Strain  All-voiced  35  28  13  28  51  38  86  15 
NYU1030  Strain  PVQD  45  33  21  49  67  46  85 
NYU1030  Strain  Combined  38  29  12  33  57  45  86  21 
PT003  Overall severity  All-voiced  13  13  24  22  37  10 
PT003  Overall severity  PVQD  11  10  10  16  14  26 
PT003  Overall severity  Combined  12  12  21  20  37  16 
PT003  Roughness  All-voiced  22 
PT003  Roughness  PVQD  10  10  15 
PT003  Roughness  Combined  22  15 
PT003  Breathiness  All-voiced  18  23  31  31  68 
PT003  Breathiness  PVQD  16  14  15  25  21  35 
PT003  Breathiness  Combined  17  19  29  27  68  15 
PT003  Strain  All-voiced  10  11  10  35 
PT003  Strain  PVQD  17  12  15  27  20  33 
PT003  Strain  Combined  13  12  19  14  35  15 
PT012  Overall severity  All-voiced  46  17  48  50  52  59  10 
PT012  Overall severity  PVQD  53  15  32  42  55  64  22  72 
PT012  Overall severity  Combined  49  16  47  50  57  10  72  16 
PT012  Roughness  All-voiced  29  21  33  43  34  58  10 
PT012  Roughness  PVQD  28  33  13  53  49  74 
PT012  Roughness  Combined  28  25  26  47  43  74  16 
PT012  Breathiness  All-voiced  28  26  20  55  49  64  10 
PT012  Breathiness  PVQD  34  31  31  60  52  70 
PT012  Breathiness  Combined  30  27  28  61  56  70  16 
PT012  Strain  All-voiced  56  26  18  34  61  70  36  96  10 
PT012  Strain  PVQD  66  22  36  51  67  83  31  95 
PT012  Strain  Combined  60  24  18  45  64  74  29  96  16 

Note. Std=standard deviation; Min=minimum; Q1=first quartile (25th percentile); Q3=third quartile (75th percentile); IQR=interquartile range (Q3–Q1); Max=maximum.

As shown in Fig. 1 and detailed in Table 1, Overall Severity ratings reveal varying levels of agreement among raters across samples. High agreement is observed in samples such as PT012 (IQR=10), LA2007 (IQR=12), BL01, and NYU1028 (both IQR=14). Conversely, substantial dispersion is evident in NYU1029 (IQR=39) and NYU1030 (IQR=35). Notably, PT012 and LA2007 present smaller IQRs in the All-Voiced dataset (3 and 9, respectively) compared to the PVQD dataset (22 and 26). In contrast, samples like LA7008 and NYU1029 exhibit greater variability among All-Voiced raters (IQRs of 49 and 44) than PVQD raters (IQRs of 12 and 18).

As depicted in Fig. 2 and detailed in Table 1, samples with lower Roughness levels, such as LA2007, BL01, and PT003 (median=2 for all), exhibit the least variability, with IQRs of 4, 8, and 9, respectively. Conversely, PT012, which showed the highest agreement for Overall Severity (IQR=10), exhibits the greatest variability for Roughness (IQR=43). In this sample, agreement is higher among All-Voiced raters (IQR=34) compared to PVQD raters (IQR=49). Conversely, NYU1025 shows lower overall agreement (IQR=33), with PVQD raters exhibiting greater consensus (IQR=20) than All-Voiced raters (IQR=44).

As illustrated in Fig. 3 and summarized in Table 1, samples with lower levels of Breathiness (median=0), such as NYU1028 and BL01, exhibit the least variability, with IQRs of 1 and 3, respectively. Interestingly, PT012 displays the highest variability for Breathiness (IQR=56), consistent with its high variability in Roughness (IQR=43). However, this same sample displayed the highest agreement in Overall Severity (IQR=10). Comparisons between datasets reveal differing patterns: for LA1003, All-Voiced raters showed greater consensus, while for NYU1029, NYU1030, NYU1025, and PT003, PVQD raters exhibited higher agreement.

For Strain, samples with lower medians, such as BL01 (median=0, IQR=4) and NYU1028 (median=1, IQR=8), exhibit the least variability. In contrast, NYU1029, NYU1030, and NYU1025 show the greatest variability, with IQRs of 37, 45, and 56, respectively. Notably, NYU1029 had greater agreement among PVQD raters (IQR=20) compared to All-Voiced raters (IQR=40). Conversely, PT003, NYU1025, and LA2007 show higher agreement among All-Voiced raters (Fig. 4, Table 1).

Finally, outliers were observed across samples and attributes, originating from both datasets (Figs. 1–4). Some correspond to All-Voiced raters with varying levels of expertise (e.g., BL01, LA2007, LA7008, NYU1025), while others stem from PVQD raters (e.g., BL01, LA2007, PT012).

As shown in Table 2, twenty voice samples with at least three evaluations were identified for less commonly assessed features, including Pitch (Too High, Too Low), Loudness (Too Loud, Too Soft), resonance-related comments (e.g., Hypernasality), and additional features (e.g., Fry). The number of evaluations per sample ranged from 3 to 9, with most evaluated by 3 to 5 raters and a few by as many as 7 or 9.

Table 2.

Summary statistics for samples with at least 3 evaluations of pitch, loudness, resonance comments, or additional features.

Sample  Attribute  Mean  Std  Min  QMedian  QIQR  Max  Count 
NYU1028  Hypernasality  45  19  30  30  41  56  26  69 
PT133  Hypernasality  12  10  13  20 
LA1003  Fry  25  16  15  26  35  20  42 
LA7006  Fry  12 
LA7008  Fry  48  32  35  40  65  30  91 
LA1003  Pitch too low  11  10  13  15 
LA7008  Pitch too low  50  35  16  33  50  68  35  85 
NYU1016  Pitch too low  21  12  18  24  26  27 
NYU1025  Pitch too low  38  29  18  18  20  48  30  94 
NYU1028  Pitch too low  10  11  13 
NYU1029  Pitch too low  45  30  26  50  69  43  77 
PT004  Pitch too low  36  35  13  16  19  48  32  77 
PT087  Pitch too low  36  31  22  40  53  31  65 
NYU1028  Loudness too loud  14  12  14  14  19 
LA2007  Loudness too soft  14  14  14  14  14  14 
LA5003  Loudness too soft  40  37  38  39  41  43 
NYU1025  Loudness too soft  39  28  23  32  59  36  76 
NYU1029  Loudness too soft  27  13  11  21  28  34  13  49 
PT003  Loudness too soft  11  12 
PT012  Loudness too soft  30  16  14  16  26  43  27  51 

Note. Std=standard deviation; Min=minimum; Q1=first quartile (25th percentile); Q3=third quartile (75th percentile); IQR=interquartile range (Q3–Q1); Max=maximum.

Variability in the evaluation of these less commonly assessed features is considerable. For example, for “Pitch too low,” the IQR ranges from 3 to 43, revealing a significant disagreement among raters for some samples, “Loudness too soft” shows IQRs ranging from 0 to 36, while “Fry” exhibits IQRs ranging from 6 to 30, reflecting varying levels of dispersion across samples. With respect to “Hypernasality”, despite significance is limited, as it was evaluated in only 2 samples, its assessment shows a notable variability, with IQRs ranging from 5 to 26.

Discussion

The registration data highlights a strong interest in the app, with a consistent daily enrollment since the launch of the latest version in September 2024. This successful turnout suggests that the application meets an existing need within the professional and academic communities—specifically, the need for reliable, high-quality material for training auditory-perceptual skills.

These findings are consistent with prior research that concluded the need for consensus-oriented training, structured forums for aligning internal standards, and the use of external reference points to improve the reliability of auditory-perceptual voice evaluations (Calaf & Garcia-Quintana, 2024; Gerratt, Kreiman, Antonanzas-Barroso, & Berke, 1993; Iwarsson & Reinholt Petersen, 2012Nagle, 2022; Nagle, Kempster, & Solomon, 2024). The app appears to address this identified need, providing valuable resources for both professional development and educational training, thus contributing to the broader goal of standardizing and improving auditory-perceptual evaluation practices.

Regarding profession, 78% of participants identified as speech-language pathologists (SLPs), though this percentage could be higher, as some participants did not declare their profession. The strong representation of SLPs is likely due to the app's focus on tools like the CAPE-V (Kempster et al., 2009), widely used in speech-language pathology practice, and its promotion through social media networks, where many contacts are SLPs. Additionally, 17% of participants selected ‘other’ as their profession, which may include both professionals who chose not to specify their field and students. To reduce any ambiguity, a revision of the registration form is planned, encouraging students to select the discipline they are studying and ensuring clarity in professional categorization.

The fact that participants from other professions, including singing teachers (4 participants), one otolaryngologist, and one vocal coach, have also shown interest is a positive sign. This highlights the potential for the app to expand its appeal beyond the strict speech-language pathology practice. Moving forward, it will be worth exploring the possibility of adding resources tailored to the needs of these other professional groups, further broadening the app's utility.

Educational training

The variation in the number of evaluations per sample suggests that the app is being actively used in academic settings. While the majority of the 120 samples had received only 1 or 2 evaluations 75 days after the launch, certain voice samples were evaluated significantly more frequently, with some reaching up to 23 unique raters. This disparity indicates that instructors may have selected specific samples for use in structured classroom activities, leveraging the app as a teaching tool. This aligns with the app's dual purpose: supporting autonomous training through randomly selected samples and facilitating guided instruction by enabling educators to target specific voice characteristics. These findings underline the app's potential for enhancing both individual learning and collaborative classroom-based training.

The distribution of voice competence levels among participants further supports the idea that the app is being incorporated into educational environments. A notable 40% of participants identified themselves as beginners, followed by 25% as intermediate, 20% as advanced, and 14% as experts. The high proportion of beginner users aligns with the app's potential use in teaching settings, where students learn auditory-perceptual skills under the supervision of instructors.

Importantly, it should be noted that registering for the app is not required for classroom use, yet many beginner participants chose to register and provide consent for their data to be used in research. This suggests that these users found additional value in the app beyond their initial class activities, motivating them to engage more deeply with the platform.

Cultural diversity

The distribution of participants by country and language can be attributed to several factors. First, the majority of participants in this study were from the United States (51%), which can largely be explained by the fact that the vocal sample database used in the app (Walden, 2022) is in English, making it more accessible to English-speaking professionals. While the app has been developed to support multiple languages with different CAPE-V adaptations, the sentences in the vocal samples remain limited to English at this stage. Efforts are underway to gather vocal samples in additional languages or to source from global databases that meet ethical requirements, including informed consent. This stepwise approach ensures immediate availability and utility, while laying the groundwork for greater linguistic diversity in future updates.

Additionally, Spain accounted for 18% of participants, with a significant portion selecting Catalan as their preferred language (half of them). This distribution is likely influenced by the professional connections of the app's developer, who has a strong network in Spain, including many Catalan speakers, which likely contributed to early registrations from this region.

Beyond these main regions, the app is also attracting users from a diverse array of countries and languages. This diversity highlights the app's growing international reach and its potential to engage a broad spectrum of users across linguistic and cultural contexts, even as efforts continue to expand the linguistic diversity of the vocal sample database.

Comprehensive analysis of vocal attributes

An advantage of the data collection process in All-Voiced lies in its ability to digitize the CAPE-V comprehensively, capturing not only the core attributes of overall severity, roughness, breathiness, and strain but also additional features such as distinctions in pitch (e.g., too high or too low) and loudness (e.g., too loud or too soft), resonance-related comments, and additional features. While these attributes are already part of the CAPE-V framework, they are often overlooked in research due to the absence of tools that facilitate their systematic collection. By enabling the efficient capture of all CAPE-V attributes, All-Voiced allows for the identification of voice samples that exemplify these characteristics, which can be utilized in training programs and further studies. This comprehensive approach represents a significant step forward in fully leveraging the CAPE-V's potential for evaluation and research.

To this respect, the results of pitch and loudness evaluations from the PVQD (Walden, 2022) have not been included because it is unclear whether their ratings for inadequate pitch or loudness refer to an alteration that is too high, too low, too loud, or too soft. Such ambiguity complicates the study of reliability in these perceptual judgments. However, by introducing the directional distinctions in the data collection process, the All-Voiced app offers the potential to gather robust data for future studies on the reliability of perceptual judgments also for pitch and loudness alterations, thus addressing a critical gap in the current body of research.

The preliminary results of this study provide further insight into the variability observed in less commonly assessed features, underscoring the complexity of these evaluations. For instance, while some samples showed high consensus among raters, such as for “Pitch too low” with IQRs as low as 3, others exhibited substantial dispersion, with IQRs up to 43 for the same feature. Similarly, “Loudness too soft” displayed variability ranging from 0 to 36, and “Fry” had IQRs from 6 to 30. This variation suggests that certain attributes are inherently more challenging for raters to assess consistently, emphasizing the need for continued research to better understand the factors contributing to this variability and to refine training approaches for these specific perceptual dimensions.

Insights from the gathered evaluations

The analysis of Overall Severity ratings reveals patterns of agreement among raters that do not fully align with previous published reports, which suggested greater agreement when evaluating either normal voices or severely dysphonic voices compared to mild and moderate dysphonias (Gerratt et al., 1993; Kreiman & Gerratt, 2000; Kreiman et al., 1993). In this study, the samples BL01 and NYU1028, classified as normal based on the cutoff values proposed by Calaf and Garcia-Quintana (2024), exhibit high agreement (IQR=14), which is consistent with this tendency. However, LA2007 (classified as mild, IQR=12) and PT012 (classified as moderate, IQR=10) show an even greater agreement, presenting a notable incongruence with the literature. This discrepancy suggests that factors beyond overall severity, such as specific perceptual attributes, sample characteristics, or evaluator familiarity with certain vocal patterns, may influence agreement levels, underscoring the need for further investigation.

When analyzing specific perceptual attributes, the tendency for normal or less altered samples to achieve higher agreement among raters becomes more evident. For instance, Roughness evaluation in samples LA2007, BL01, and PT003 show the least variability (median=2, IQRs=4, 8, and 9, respectively), consistent with their closer alignment to non-altered voice qualities. Similarly, for Breathiness, NYU1028 and BL01 (median=0, IQRs=1 and 3, respectively) exhibit a high level of agreement. For Strain, BL01 (median=0, IQR=4) and NYU1028 (median=1, IQR=8) show a strong agreement, whereas LA2007 (median=2, IQR=35) presents greater variability, which may be attributed to its classification as mild in Overall Severity compared to BL01 and NYU1028, which were classified as normal. Further investigation is needed to fully understand the factors influencing variability, particularly in attributes like Strain, where agreement levels appear to fluctuate depending on the sample's overall severity classification.

Despite these insights, a key limitation of this study is the limited volume of data at this early stage, which restricts the scope of inter-rater and intra-rater reliability analyses. Many voice samples received only a single evaluation, making it impossible to assess reliability across evaluators for those samples. Moreover, the current dataset size does not yet allow for meaningful exploration of trends based on demographic factors such as age, gender, profession, or language. However, as data collection expands, these analyses will become feasible, offering deeper insights into the patterns of agreement and variability observed in this study.

The observed variability in evaluations across samples and attributes, even among expert evaluators, aligns with findings from prior studies on CAPE-V adaptations across different languages. While high inter-rater reliabilities have been reported for certain attributes like overall severity (Karnell et al., 2007; Mozzanica et al., 2014; Pommée, Mbagira, et al., 2024), attributes such as strain, pitch, and loudness consistently exhibit lower inter-rater reliability across multiple languages, including English (Zraick et al., 2011), Brazilian Portuguese (Behlau et al., 2022), and Tamil (Venkatraman et al., 2022). Similarly, intra-rater reliability has varied across studies, particularly for attributes like pitch and loudness (Behlau et al., 2022; Salary Majd et al., 2014).

Outliers observed across all levels of expertise, including expert evaluators, further underscore the inherent challenges in achieving consistency. Factors such as differences in internal standards and interpretations of perceptual attributes likely contribute to these discrepancies, reflecting the persistent difficulty of ensuring consensus among evaluators, even within the structured framework of the CAPE-V.

The results for PT012 exemplify how All-Voiced can identify voice samples that pose special challenges for evaluators. Despite displaying the highest agreement in Overall Severity ratings (IQR=10), this sample exhibited substantial variability in the evaluation of specific perceptual attributes, such as Breathiness (IQR=56), Roughness (IQR=43), and Strain (IQR=29). This discrepancy suggests differences in how evaluators interpret and apply these specific perceptual attributes, reflecting persistent challenges in achieving consensus. According to Nagle, Kempster, et al. (2024), clinicians have reported difficulties with attributes like strain, overall severity, and pitch in the context of the CAPE-V, indicating a need for clearer definitions and guidelines. To address this issue, the same authors have proposed the CAPE-Vr [Pre-print] (Kempster, Nagle, & Solomon, 2024), a revised version of the CAPE-V framework that aims to resolve these conceptual ambiguities and provide more precise guidance for clinical practice.

While standardizing definitions and procedures is essential, the findings presented here emphasize the need to address the inherently subjective nature of auditory-perceptual evaluations. To this respect, All-Voiced emerges as a useful tool to identify voice samples with poor expert consensus, providing a unique opportunity to investigate the root causes of these discrepancies. Such disagreements may stem from disparities in raters’ cognitive representations of voice attributes, or from specific acoustic features in the samples that influence raters’ perceptions differently. Future work will focus on implementing targeted training modules and validated auditory anchors, aiming to create a shared training platform that enhances the reliability of auditory-perceptual evaluations through consensus building and the adoption of universally shared criteria.

Conclusions

The All-Voiced app has been immediately well received, particularly by speech-language pathologists, highlighting its potential as a valuable resource for auditory-perceptual training. The steady influx of daily registrations underscores the demand for standardized tools to enhance evaluation reliability, a need well-documented in previous research. The high number of beginner-level users, along with frequently evaluated samples, suggest the app is being used in academic settings, reinforcing its role as a promising educational tool. Through large-scale data collection and analysis, the All-Voiced app provides a foundation for addressing broader limitations in the subjective nature of auditory-perceptual evaluations, and bears the potential to enhance consistency and to support evidence-based practices in voice disorder management.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work, the author used GPT-4, a generative AI tool developed by OpenAI, to assist in the drafting and refinement of the manuscript. After using this tool, the author reviewed and edited the content as necessary, taking full responsibility for the content of the publication.

Funding

The author declares no funding.

Conflict of interest

The author is the developer of the app discussed in this manuscript. This potential conflict of interest has been addressed by ensuring that the data and analysis are presented objectively, with no financial interests involved.

Given his role as Editorial Board Member in the journal, Neus Calaf had no involvement in the peer review of this article and has no access to information regarding its peer review.

Data availability statement

The datasets generated and analyzed during the current study are available from the corresponding author upon request.

Acknowledgments

Special thanks to Noel Warren for the introduction to app development, which made this project possible, and to David Garcia-Quintana for his valuable assistance in revising the manuscript.

References
[Barsties and De Bodt, 2015]
B. Barsties, M. De Bodt.
Assessment of voice quality: Current state-of-the-art.
[Behlau, 2004]
M. Behlau.
Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V), ASHA 2003 [Refletindo sobre o novo].
Revista Da Sociedade Brasileira de Fonoaudiologia, 9 (2004), pp. 187-189
[Behlau et al., 2022]
M. Behlau, B. Rocha, M. Englert, G. Madazio.
Validation of the Brazilian Portuguese CAPE-V Instrument—Br CAPE-V for auditory-perceptual analysis.
[Calaf, 2024a]
N. Calaf.
All-voiced [conference presentation].
Voice Lab 3, 15th Pan-European Voice Conference (PEVOC),
[Calaf, 2024b]
N. Calaf.
All-Voiced [Web application].
(2024),
[Calaf and Garcia-Quintana, 2024]
N. Calaf, D. Garcia-Quintana.
Development and Validation of the Bilingual Catalan/Spanish cross-cultural adaptation of the consensus auditory-perceptual evaluation of voice.
Journal of Speech, Language, and Hearing Research, 67 (2024), pp. 1072-1089
[Chen et al., 2018]
Z. Chen, R. Fang, Y. Zhang, P. Ge, P. Zhuang, A. Chou, J. Jiang.
The Mandarin version of the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and its reliability.
Journal of Speech, Language, and Hearing Research, 61 (2018), pp. 2451-2457
[Connor et al., 2008]
N. Connor, D. Bless, C. Dardis, L. Vinney.
Voice disorders: Simulations.
Resources for Teaching & Learning, (2008),
[de Almeida et al., 2019]
S.C. de Almeida, A.P. Mendes, G.B. Kempster.
The consensus auditory-perceptual evaluation of voice (CAPE-V) psychometric characteristics: II European Portuguese Version (II EP CAPE-V).
[Ertan-Schlüter et al., 2020]
E. Ertan-Schlüter, E. Demirhan, E.M. Ünsal, E. Tadıhan-Özkan.
The Turkish version of the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V): A reliability and validity study.
[Gerratt et al., 1993]
B.R. Gerratt, J. Kreiman, N. Antonanzas-Barroso, G.S. Berke.
Comparing Internal and External Standards in Voice Quality Judgments.
Journal of Speech, Language, and Hearing Research, 36 (1993), pp. 14-20
[Gunjawate et al., 2020]
D.R. Gunjawate, R. Ravi, S. Bhagavan.
Reliability and validity of the Kannada version of the Consensus Auditory-Perceptual Evaluation of Voice.
Journal of Speech, Language, and Hearing Research, 63 (2020), pp. 385-392
[Hirano, 1981]
M. Hirano.
Clinical examination of voice.
Springer-Verlag, (1981),
[Iwarsson and Reinholt Petersen, 2012]
J. Iwarsson, N. Reinholt Petersen.
Effects of consensus training on the reliability of auditory perceptual ratings of voice quality.
[Jesus et al., 2009]
L.M.T. Jesus, A. Barney, R. Santos, J. Caetano, J. Jorge, P.S. Couto.
Universidade de Aveiro's voice evaluation protocol. Interspeech 2009.
[Joshi et al., 2020]
A. Joshi, I. Baheti, V. Angadi.
Cultural and linguistic adaptation of the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) into Hindi.
Journal of Speech, Language, and Hearing Research, 63 (2020), pp. 3974-3981
[Karnell et al., 2007]
M.P. Karnell, S.D. Melton, J.M. Childes, T.C. Coleman, S.A. Dailey, H.T. Hoffman.
Reliability of Clinician-Based (GRBAS and CAPE-V) and Patient-Based (V-RQOL and IPVI) Documentation of Voice Disorders.
[Kelchner et al., 2010]
L.N. Kelchner, S.B. Brehm, B. Weinrich, J. Middendorf, A. deAlarcon, L. Levin, R. Elluru.
Perceptual Evaluation of Severe Pediatric Voice Disorders: Rater Reliability Using the Consensus Auditory Perceptual Evaluation of Voice.
[Kempster et al., 2009]
G.B. Kempster, B.R. Gerratt, K. Verdolini Abbott, J. Barkmeier-Kraemer, R.E. Hillman.
Consensus auditory-perceptual evaluation of voice: Development of a standardized clinical protocol.
American Journal of Speech-Language Pathology, 18 (2009), pp. 124-132
[Kempster, Nagle, & Solomon, 2024]
G.B. Kempster, K.F. Nagle, N.P. Solomon.
Development and rationale for the consensus auditory-perceptual evaluation of voice – Revised (CAPE-Vr). [Pre-print].
PsyArXiv Preprints, (2024),
[Kondo et al., 2021]
K. Kondo, M. Mizuta, Y. Kawai, T. Sogami, S. Fujimura, T.… Kojima, T. Haji.
Development and validation of the Japanese version of the Consensus Auditory-Perceptual Evaluation of Voice.
Journal of Speech, Language, and Hearing Research, 64 (2021), pp. 4754-4761
[Kreiman and Gerratt, 2000]
J. Kreiman, B. Gerratt.
Sources of listener disagreement in voice quality assessment.
Journal of the Acoustical Society of America, 108 (2000), pp. 1867-1876
[Kreiman et al., 1993]
J. Kreiman, B.R. Gerratt, G.B. Kempster, A. Erman, G.S. Berke.
Perceptual evaluation of voice quality: Review, tutorial, and a framework for future research.
Journal of Speech, Language, and Hearing Research, 36 (1993), pp. 21-40
[Kreiman et al., 2003]
J. Kreiman, D. Vanlancker-Sidtis, B.R. Gerratt.
Defining and measuring voice quality.
ISCA tutorial and research workshop on voice quality: Functions, analysis and synthesis,
[Labaere et al., 2023]
A. Labaere, M. De Bodt, G. Van Nuffelen.
Construction of an anchor and training sample set for auditory-perceptual voice evaluation with the GRBAS-scale.
[Mohd Mossadeq et al., 2022]
N. Mohd Mossadeq, K.A. Mohd Khairuddin, M.N. Zakaria.
Cross-cultural adaptation of the Consensus Auditory-perceptual Evaluation of Voice (CAPE-V) into Malay: A validity study.
[Mozzanica et al., 2014]
F. Mozzanica, D. Ginocchio, E. Borghi, C. Bachmann, A. Schindler.
Reliability and validity of the Italian version of the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V).
Folia Phoniatrica et Logopaedica, 65 (2014), pp. 257-265
[Nagle, 2022]
K.F. Nagle.
Clinical use of the CAPE-V scales: Agreement, reliability and notes on voice quality.
[Nagle, 2024]
K. Nagle.
Describing voice quality in 5 minutes: Balancing feasibility and psychometric validity [Conference presentation].
53rd annual symposium of the voice foundation,
[Nagle, Kempster, & Solomon, 2024]
K.F. Nagle, G.B. Kempster, N.P. Solomon.
Survey of Voice-Focused Speech-Language Pathologists’ Usage of the Consensus Auditory Perceptual Evaluation of Voice (CAPE-V).
[Núñez-Batalla et al., 2015]
F. Núñez-Batalla, M. Morato-Galán, I. García-López, A. Ávila-Menéndez.
Adaptación fonética y validación del método de valoración perceptual de la voz CAPE-V al español.
Acta Otorrinolaringológica Española, 66 (2015), pp. 249-257
[Oates, 2009]
J. Oates.
Auditory-perceptual evaluation of disordered voice quality.
Folia Phoniatrica et Logopaedica, 61 (2009), pp. 49-56
[Özcebe et al., 2019]
E. Özcebe, F.E. Aydinli, T.K. Tiğrak, Ö. İncebay, T. Yilmaz.
Reliability and validity of the Turkish version of the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V).
[Pommée et al., 2024a]
T. Pommée, D. Mbagira, D. Morsomme.
French-language adaptation of the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V).
[Pommée et al., 2024b]
T. Pommée, M. Shanks, D. Morsomme, S. Michel, I. Verduyckt.
Validation of the European French version of the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-Vf).
[Roy et al., 2013]
N. Roy, J. Barkmeier-Kraemer, T. Eadie, M.P. Sivasankar, D. Mehta, D. Paul, R. Hillman.
Evidence-based clinical voice assessment: A systematic review.
American Journal of Speech-Language Pathology, 22 (2013), pp. 212-226
[Salary Majd et al., 2014]
N. Salary Majd, S. Maryam Khoddami, M. Drinnan, M. Kamali, Y. Amiri-Shavaki, N. Fallahian.
Validity and rater reliability of Persian version of the Consensus Auditory Perceptual Evaluation of Voice.
Audiology, 23 (2014), pp. 65-74
[Venkatraman et al., 2022]
Y. Venkatraman, S. Mahalingam, P. Boominathan.
Development and validation of sentences in Tamil for psychoacoustic evaluation of voice using the consensus auditory-perceptual evaluation of voice.
Journal of Speech, Language, and Hearing Research, 65 (2022), pp. 4539-4556
[Walden, 2022]
P.R. Walden.
Perceptual Voice Qualities Database (PVQD). Mendeley Data, V4.
[Zraick et al., 2011]
R.I. Zraick, G.B. Kempster, N.P. Connor, S. Thibeault, B.K. Klaben, Z. Bursac, C.R. Thrush, L.E. Glaze.
Establishing validity of the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V).
Copyright © 2024. Elsevier España, S.L.U. y Asociación Española de Logopedia, Foniatría y Audiología e Iberoamericana de Fonoaudiología
Descargar PDF
Opciones de artículo
es en pt

¿Es usted profesional sanitario apto para prescribir o dispensar medicamentos?

Are you a health professional able to prescribe or dispense drugs?

Você é um profissional de saúde habilitado a prescrever ou dispensar medicamentos

OSZAR »