The products are not shown to avoid any other influence on the subject judgements such as colours and design. As the subjects are listening over headphones, any reasonably quiet room can be used for the listening tests. It is possible to use specialist listening rooms but in this case the experiment has been designed to make this unnecessary. It is important that any reproduction equipment does not cause significant amount of noise.
It is important that the sounds are played at the same volume level as would be heard from the real device. If you have recorded a calibration tone at a known level and have a dummy head, it is possible to use this calibration level to ensure that the sound is reproduced at the correct level. If not, it is difficult to do this exactly. The only way to achieve approximately the right level, is to set the headphone level by ear, judging it to be a similar level to the original sounds. While not ideal, useful opinions can still be extracted from tests set up using this method.
There is a choice of test methods. Most commonly, there is a choice between a paired comparison test and a magnitude estimation test to identify which product sounds best.
In paired comparison, sounds from two products are directly compared to each other. The paired comparison test only requires that jurors state a preference between two stimuli, presented one after the other. Pairs of sounds are played and the juror are asked which one sounds more pleasant. Pairs are played in both orders, sound A then sound B, and later in the test sound B then sound A. By reversing the order it is possible to check how difficult the test is and so check the reliability of the answers. The sound samples used are usually short (<10 seconds), however, even with such short files, paired comparisons is too time consuming because all the different combinations of products need to be compared. There are standard statistical methods for taking the results from paired comparisons and then forming a rank order of the products.
For this reason we developed a magnitude estimation method to identify which product sounds best. This is also sometimes referred to as a semantic differential test. It is used in many areas of acoustics, for example in the testing and comparison of loudspeakers. In this case the subject listens to the recording of the product and then the subject completes a questionnaire. The questionnaire involves people judging on a scale between two opposites, say ranging from pleasant to unpleasant.
In this case they place a mark on the line wherever they feel is appropriate. Using this method gives noisier judgements than paired comparison (the experimental error due to the variability of the subjects' opinions are greater), but it is much more efficient, and the detailed answers from the questions can be more revealing as to why people like particular sounds. There are several other problems with this method:
It assumes that the adjectives chosen mean the same to everyone. This is overcome by proper questionnaire design, either ensuring the use of unambiguous terms, or ensuring the terms are properly defined on the questionnaire.
It is assumed that people use the extremes of scales the same, whereas some people tend to score higher than others. This is resolved in the statistical analysis - see later.
Respondents give consistently moderate answers. The tendency to score in the middle is difficult to avoid, but using the scales shown above, makes it harder for the respondent to always give a middling answer. Another solution is to use tick boxes, but only give a limited number of answer boxes, none of which are neutral.
Respondents always use one extreme of the scale or the other, effectively repeating the same judgement for all scales whether this is correct. This is overcome to a certain extent by switching the desirability of the ends of the scales (the best product sounds would not all result in ticks on the right hand side of the scales).
Respondents giving socially desirable responses; what they think the experimenter wants to here. This is overcome by blind testing, whereby the subject does not know which product they are listening to. It is important to stress the anonymity of the responses. Also, you should also emphasis the importance of the work, so subjects feel a responsibility to be honest.
It is unlikely that jury members will have experience of carrying out sound tests before. It is important to run through a certain number of dummy sounds with the subjects to ensure they are familiar with the method. The results from this training is disregarded. In the questionnaires given below, the preliminary training sheets are also given in the files. Total test time should not exceed 20-30 minutes for any one subject in one sitting otherwise fatigue sets in and judgements become unreliable.
The questionnaires used for our tests can be downloaded from the links below. The questionnaire for the leaf blowers was the same as for the kettles, and so isn't included here.
The difference between the two products was that for the kettles we used one sound, whereas with the washing machine we got people to separately judge the washing, rinsing and spinning sounds, and then a fourth set of questions was used to gain the overall impression.
The design of the questionnaire illustrates some important procedures for jury testing. The instructions given to the listeners are in written form on pages 1 and 2. You should always use written instructions (which you can read out as well), to ensure that every subject receives exactly the same set of instructions to avoid biasing some results inadvertently.
Page one contains some contextual data which is normally gathered before jury testing such as age, gender etc. You need to consider carefully who to use as subjects.
Page three contains two types of questions. Part 1 is completed while listening to the sound and part 2 during the 30 second break after each sound. The first section asks for subjects to tick adjectives that describe the sound, the second section allows the subjects to score the sound on more detailed scales concerning issues such as robustness.
This section is used as an informative way of understanding why subjects made the judgements they did on the more detailed scales (such as robustness) lower down the page. Subjects select adjectives that describe the sound, or provide their own. Because these questions only produce nominal data (two point scale, yes or no), they are not ideal for statistical analysis. Multi-point scales were not used because the questionnaire would then become too long. Consequently, these should be used in a qualitative sense to get some idea of why judgements were made.
The scales at the bottom of the questionnaire highlight the most important issues: function, pleasantness, loudness, robustness, quality and purchase influence. The person places a mark on the scale.