66.4, November 2019

Effects of Visual Signaling in Screenshots: An Eye Tracking Study

By Michael Meng


Purpose: Screenshots are an important means of visualization in software documentation. One question technical communicators need to address when dealing with screenshots is whether visual signaling elements, such as arrows or frames, should be added in order to highlight relevant information. This article reports the results of an experimental study that examined whether signaling elements successfully guide visual attention of readers to relevant screenshot information as intended. A second goal was to find out whether visual signaling has a positive impact on how accurate and fast users execute the tasks which the screenshots support.

Method: Two versions of a software tutorial were constructed that included screenshots with or without signaling elements. Participants’ eye movements were recorded while they studied the tutorial and executed the tasks described therein. In addition to eye movement measures, accuracy of task execution and time to complete the tasks were determined as measures of overall success on the tasks.

Results: Participants working with tutorials that used visual signaling executed more tasks correctly. No differences were found regarding the time needed to complete the tasks. Analysis of the eye tracking data showed that participants fixated relevant screenshot areas longer and more often if highlighted by signaling elements.

Conclusions: The results provide evidence that adding signaling elements to screenshots is an effective means to guide the visual attention of users. As predicted by the Cognitive Theory of Multimedia Learning, visual signaling does not simply increase interest in pictures but helps users to select relevant information.

Keywords: screenshots, software tutorials, visual signaling, eye tracking

Practitioner’s Takeaway:

  • Research on factors that modulate the effects of screenshots on user performance can help technical communicators to make informed decisions regarding screenshot usage and design.
  • The article provides empirical evidence that adding signaling elements to screenshots helps the user to select relevant information from screenshots and improves user performance.
  • More research is needed to determine which signaling techniques are effective and whether effects of signaling interact with other factors such as task complexity and user experience. Collecting eye movements while users can freely switch their attention between reading and acting can help to address these questions.


Planning and creating graphics that assist the user in executing a procedure correctly and efficiently is a routine task of technical communicators. In the area of software documentation, the type of graphic most commonly used for this purpose is screenshots. Screenshots provide an efficient means to visualize the state of the graphical user interface of a software system at a certain point, such as the initial state from which a procedure starts, the overall goal state, or intermediate states that result if the procedure is carried out successfully (Farkas, 1999; van der Meij & Gellevij, 2004).

When adding screenshots to software manuals, online help systems or tutorials, technical communicators have to address several important questions (van der Meij & Gellevij, 1998). A first set of decisions concerns when screenshots should be used and which function each individual screenshot should serve. For example, when developing a procedure consisting of several action steps, decisions have to be made whether none, some, or all action steps should be supported by a screenshot, and whether the individual screenshots should help the user to locate and identify user interface elements (e.g., by depicting a menu entry or to help the user to compare the current state of the user interface with the intended goal state, e.g., by depicting the screen that results from carrying out an action step).

Once the decision to use a screenshot at a certain point has been made, several additional decisions follow that relate to screenshot design (van der Meij, 2000), such as which portion of the user interface the screenshot should depict or where the screenshot should be positioned. This article focuses on the effects of a design decision that is particularly important: the decision of whether signaling elements should be added to a screenshot or not.

Signaling refers to a set of cueing techniques that have been discussed extensively both in the domain of text design and text comprehension (Lorch, 1989; Spyridakis, 1989) as well as in multimedia learning (Mautone & Mayer, 2001; Koning, Tabbers, Rikers, & Paas, 2009; Moreno, 2007; Richter, Scheiter, & Eitel, 2016). Common to all signaling techniques is that they are added to material in order to help the user know how to process the material (Mayer, 2009). Signaling elements do not add content. Their main function is to direct the attention of the user to content that is relevant in the context of a certain task or goal.

Two types of signaling have been distinguished in the literature: verbal signaling and visual signaling (Mayer, 2009). Verbal signaling includes cues that are added to a text, such as headings, highlighted words, or outlines presented before a text. Headings, highlighted words, and outlines help users to understand the organization of a text. They are added to direct the user’s attention to particular terms, concepts, or propositions that the author regards important. In technical documentation, verbal signaling is also commonly used to indicate the function that a particular unit of text serves. For example, numbers or bullets may be used to identify a text unit as an action step, and check marks could be used to identify a text unit as a result statement.

Visual signaling, on the other hand, refers to cues that are added to pictures, such as frames, arrows, magnifiers, or distinctive colors. Frames, arrows, and similar signaling techniques are added to pictures to direct the visual attention of users to specific areas of the picture containing important information. The goal of the current study is to investigate the effect of visual signaling in screenshots contained in procedures on user behavior. More specifically, the study asks whether signaling indeed guides visual attention, and whether this “guiding effect” improves user performance.

Knowing whether and why users benefit from visual signaling in screenshots is not only of interest from a theoretical point of view but also if an applied perspective is taken (Martin-Michiellot & Mendelsohn, 2000). Adding frames, arrows, or similar cues to screenshots is associated with a specific cost, which reflects the effort it takes to implement the signaling elements. This cost increases considerably if not only the initial efforts in designing a screenshot are taken into account but also subsequent efforts related to documentation maintenance and localization. In order to decide whether these efforts are justified, technical communicators have to know whether visual signaling works as intended. The present study is intended to provide arguments on which such a decision can be based. It contributes the results of research that—in the sense of Boekelder and Steehouder (1999)—was designed to serve practice by studying the functional relation between a design aspect (visual signaling in screenshots) and reading behavior of documentation users. The study asks whether a specific design change (adding visual signaling) changes reading behavior in the intended way, which is that users pay more attention to relevant information on a screenshot and ignore irrelevant information. In addition, the study asks whether this change in reading behavior results in documentation that enables users to execute a procedure more effectively.

By focusing on user behavior, the type of research reported here is closely related to efforts to make documentation more usable, which is a fundamental concern of technical communicators (Guillemette, 1989; Redish, 2010; Alexander, 2013; Arachya, 2017). However, unlike standard usability testing, the aim is not to find out whether a particular information product is effective or not or where potential trouble spots are. Rather, the goal is to reveal more fundamental principles of communication whose consequences can be applied to information products containing screenshots in general.

Prior Research on Screenshots

The question whether screenshots effectively support users of software documentation has been addressed in a series of studies by Hans van der Meij, Mark Gellevij, and colleagues conducted in the late 1990s and early 2000s (see Gellevij & van der Meij, 2004, and van der Meij, Karreman, & Steehouder, 2009, for overviews). One of the first experiments demonstrating that screenshots may positively affect task execution was reported in van der Meij (1996). The study tested users on two versions of a training manual for a database application. The study found that users of the manual version containing screenshots completed the training tasks significantly faster compared to the group working with a text-only control version. No effect of screenshots on the accuracy of task execution was found.

Evidence confirming this finding was reported by several later studies which focused more closely on individual screenshot functions. For example, Gellevij and van der Meij (2004) examined the effect of screenshots that were intended to support users comparing the current state of the software user interface with the intended goal state. They report that users working with a manual containing screenshots made fewer errors and executed actions steps supported by screenshots faster. Gellevij, van der Meij, de Jong, and Pieters (2002) investigated whether screenshots also support the development of a mental model of the user interface, again contrasting a manual version containing screenshots with a text-only version. Users working with the screenshot manual completed training faster and scored higher on tasks designed to measure learning effects. Like van der Meij (1996), both studies provide important hints regarding global effects of screenshots on task execution and learning. However, since all studies discussed so far directly contrasted a visual manual version with a text-only version, it remains open why these effects arise and what contributions individual design variables such as visual signaling make.

While the studies discussed so far clearly demonstrate that including screenshots in manuals can have a positive effect on task performance and learning, other studies failed to find such effects (van der Meij & Gellevij, 2002; Nowaczyk & James, 1993), suggesting that effects induced by screenshots are subject to boundary conditions. For example, the results reported in Sweller and Chandler (1994) provide preliminary evidence that screenshots only support user performance if the task to be solved is sufficiently complex. The critical role of design variables as possible boundary condition for screenshot effects has been emphasized in a different line of studies by van der Meij, Gellevij, and colleagues. Gellevij, van der Meij, de Jong, and Pieters (1999) addressed screenshot coverage as design variable. Their study compared a text-only version of a manual with two types of visual manuals that differed with respect to the screenshot area depicted by screenshots: a visual manual version in which screenshot coverage was restricted to elements relevant in the context of a specific task, and a visual manual version containing screenshots covering the full screen. Overall, the results suggest that coverage had an effect: Participants working with a partial-screenshot manual performed worse on several measures, such as performance on untrained tasks, compared to participants working with a manual containing screenshots that captured the full screen. Interestingly, participants using text-only manuals also tended to outperform participants using manuals with partial screenshot.

In addition to screenshot coverage, van der Meij (2000) also tested effects of the positioning of screenshots relative to related text segments by comparing different layout variants that arranged screenshots and text segments in adjacent columns. The study replicated the finding that full-screen screenshots lead to better performance than partial-screen screenshots, but only if the instructions were positioned in the left column and the screenshots to the right. Similarly, Martin-Michiellot and Mendelsohn (2000) included a manual version with juxtaposed screenshots and a manual version in which text elements and corresponding screenshot areas were integrated more closely by using callout lines. In addition, this study also manipulated the complexity of the tasks that subjects had to perform for test purposes. Although both types of visual manuals accelerated training time compared to a text-only control version, there was no significant difference between juxtaposed and integrated screenshot conditions on task performance, neither for simple nor more complex tasks.

Taken together, prior research clearly backs the use of screenshots in software documentation. This research also points to conditions that screenshots have to meet in order to be effective. However, whether visual signaling contributes to effects of screenshots on user performance is still an open question.

Background on Eye Tracking

Besides addressing an issue that is still unexplored, the current study also extends prior research on screenshots by using eye tracking. The eye tracking method makes it possible to determine with a high degree of accuracy where people look at a certain point in time (Holmqvist et al., 2011; Williams et al., 2005). To determine where people look, special eye tracking hardware is used which measures the gaze position of a user in very short intervals continuously over time. The eye tracker used in the current study—the RED 250mobile system from SensoMotoric Instruments (SMI)—recorded the gaze position at a rate of 120 Hz, which means that a gaze position measure was taken roughly every 8 milliseconds. Based on this continuous data stream, analysis software can then model various types of eye movement events.

The eye movement events of relevance to the current study (and for most studies in the field of usability) are fixations and saccades. Fixations are intervals of 200–500 milliseconds in which the gaze position remains rather stable and in which the eyes can take in visual information (Cooke, 2005; Goldberg & Wichansky, 2003). Saccades refer to the rapid movements which the eyes must perform in order to fixate on another position in the visual field. Based on the observed pattern of fixations and saccades, it is possible to infer which information observers attend to (e.g., while reading, examining a webpage, or interacting with a software system). Research using eye tracking is based on the general assumption that the information a user visually attends to correlates highly with the thoughts or mental activities the user is involved with (“eye-mind hypothesis,” Goldberg & Wichansky, 2003, p. 507).

Eye tracking has been used successfully for many years to study reading, text comprehension, scene perception, or multimedia learning (Rayner & Pollatsek, 2006; Rayner, 2009; Mayer, 2010). Eye tracking has also become a prominent method to explore various aspects of Web usability, such as screen design, navigation architecture, or processes of visual search (Cooke, 2005; Cooke et al., 2008; Duchowski, 2017; Nielsen & Pernice, 2010). Moreover, eye tracking can make verbalizations obtained from participants in concurrent think-aloud tests more informative (Cooke, 2010; Elling, Lentz, & de Jong, 2012).

The current study uses eye tracking to study the impact of a specific design decision (adding signaling elements to screenshots) on which information users attend to when working with technical documentation to solve a problem. At a more general level, the study can be viewed as an attempt to extend the range of applications for eye tracking in technical communication research and to explore a new way in which this method can be used to further our understanding of how people read and apply technical information.

Research Objectives and Predictions

To sum up, the research reported here was designed to address two objectives. One objective was to test whether visual signaling indeed guides visual attention to relevant information in a screenshot, or whether signaling more generally increases interest in pictures. A second objective was to find out whether visual signaling has a positive impact on the accuracy and efficiency of task execution.

The hypotheses guiding the current study are based on the Cognitive Theory of Multimedia Learning (CTML; Mayer, 2009; Moreno & Mayer, 2007), which attempts to model the process of learning from material containing text and pictures. CTML distinguishes three types of processes that compete for working memory capacity when learning from multimedia material. CTML uses the terms “essential processing” and “generative processing” to refer to processes which are necessary for learning to take place, such as selecting information for processing in working memory, developing verbal and pictorial representations for incoming information, as well as integrating verbal and pictorial representations with information that already exists in long-term memory. Besides essential and generative processing, working with multimedia material may also trigger processes which are unrelated to learning and which result from the way multimedia material is presented to the learner, such as efforts to identify relevant information in texts or pictures, or to connect text and picture elements that contain related information. These processes are referred to as “extraneous processing.”

According to CTML, signaling helps the learner because it reduces extraneous processing (Mayer, 2009, p. 108ff.). By reducing extraneous processing, signaling frees up working memory capacity which then becomes available for cognitive processes related to learning. Several studies have already demonstrated that both verbal and visual signaling can foster learning, and more specifically, can successfully direct the attention of the learner to relevant information (see Mayer, 2009, and Richter et al., 2016, for reviews). So far, however, this research has focused on “reading-to-learn” scenarios in the sense of Redish (1989), e.g., scenarios in which learners study material in order to understand how lightning is formed or how a pump works. However, procedures contained in manuals, online help, or tutorials are often not read with the intention to learn the series of action steps described there but with the goal to directly execute the steps in order to accomplish a certain goal, a scenario which Redish (1989) characterizes as “reading-to-do.”

Only few attempts have been made so far to investigate whether results obtained by research on multimedia learning in “reading-to-learn” scenarios generalize to procedural text that is “read-to-do,” but at least preliminary evidence is available suggesting that it does (Irrazabal, Saux, & Burin, 2016; van Genuchten, Hooijdonk, Schüler, & Scheiter, 2014). With respect to the current study, I therefore take CTML to motivate the prediction that visual signaling supports the user in allocating more visual attention to relevant areas of a screenshot. Hence, this theory leads us to expect that more fixations and longer fixation times should be observed in a relevant screenshot area if signaling is used. Since signaling reduces working memory capacity required for extraneous processing, making this capacity available for other cognitive processes involved in solving the current task, I also derive the hypothesis that visual signaling improves overall task performance.



For the test, I developed a short tutorial that described how to add a colored picture frame around a digital image using GIMP (www.gimp.org), an open source image manipulation program similar to Adobe Photoshop. The tutorial was created in German, as the participants were native speakers of German (see section Participants below).

The tutorial consisted of 12 pages. It started with a short introduction (2 pages) which explained the goal of the tutorial, illustrated the initial state and the intended result, and described relevant elements of the user interface. The process to create the picture frame was divided into 8 sub-tasks (procedures) that contained 2 to 5 action steps each. The tutorial was completed by a page that congratulated the participants for completing the tutorial and again displayed the intended result. An example illustrating the tasks is given in Figure 1. The task describes how to apply one of the filters offered by GIMP in order to create a blur effect on a layer that will later be part of the picture frame. The four action steps instruct the reader to select the correct layer from a list, to launch the filter, to add values that control how intense the blur effect will be, and to apply the settings by clicking OK.

In each task, one or two of the action steps were supported by screenshots. Screenshots were inserted at points that were judged by the author to potentially lead to errors. Screenshots were intended to help participants to locate and identify user interface elements, and to compare the current state of the user interface with the intended goal state. Following design recommendations in Gellevij et al. (2002), screenshot size and coverage were optimized depending on screenshot function. Each tutorial task was designed to fit on a single screen page, except one task which had to be distributed over two pages due to the size of the screenshots used.

To address the research questions of the current study, I created two versions of the tutorial that contained identical instructions but that differed with respect to whether the screenshots used visual signaling elements or not. The first version (condition “signaled”) included screenshots to which arrows, frames or magnifiers were added in order to highlight relevant information, such as fields in the GIMP user interface for which values had to be checked or changed, buttons on which participants were to click or elements to select from a list. For example, in the task shown in Figure 1, an arrow was added to the screenshot following step 1 to help the user identify which of the two elements to select from a list of available layers. The screenshot following step 3 uses a frame to emphasize the fields into which values have to be entered to control how intense the blur effect will be. The second tutorial version (condition “nonsignaled”, see Figure 2) contained the same screenshots, but without signaling elements.

Figure 1. Example page of the tutorial in condition “signaled.” The task describes how to apply one of the filters offered by GIMP in order to create a blur effect. The blur effect is used on the layer that will later serve as part of the picture frame. The steps instruct the reader to select the correct layer from a list, to launch the filter, to add values that control how intense the blur effect will be, and to apply the settings by clicking OK.
Figure 2. Example page of the tutorial in condition “nonsignaled”

Participants were instructed to work though the tutorial and to execute the tasks described therein. Special care was taken to create a task setting that was as natural as possible and encouraged normal reading. Therefore, the tutorial was presented on a computer screen adjacent to the GIMP program (see Figure 3); hence, the tutorial and the GIMP software were available to the user simultaneously.

Figure 3. Positioning of tutorial and GIMP on the stimulus presentation monitor


The experiment was run in the usability lab of Merseburg University of Applied Sciences and comprised two parts. Participants first received a short questionnaire that contained four questions regarding age of participants, level of computer experience, general experience with image manipulation software, and specific experience with GIMP. After the questionnaire was completed, participants received instructions for the test. They were then positioned in front of the test monitor and a calibration procedure was run to prepare the eye movement recording. Participants then started to work on the tasks described in the tutorial. Participants were instructed to work with the tutorials as they would normally. No special emphasis was put on reading the tutorials. Statement of consent was obtained from the participants at the beginning of each session.

While participants worked through the tutorial, their eye movements were recorded using an RED 250mobile eye tracker. The RED 250mobile is a video-based eye tracker that operates in head-free mode. For the test, the eye tracker was attached to a 24-inch monitor that served for stimulus presentation. Tracking rate was set to 120 Hz, and data were recorded from both eyes simultaneously. The software package ExperimentSuite Scientific 3.6 (also from SMI) was used for data recording (ExperimentCenter) and analysis (BeGaze).

The tutorial was displayed as a series of JPEG pictures using the Windows 7 Photo Viewer. The Windows 7 Photo Viewer was chosen because it features salient back and forward navigation buttons that are easy to operate and that participants used to navigate through the document. The tutorials were presented alongside GIMP on the stimulus monitor using windows of fixed size and position. The tutorial was presented to the right of GIMP and occupied about one third of the monitor area. The remaining monitor area was occupied by GIMP (see Figure 3).

During each session, the SMI software recorded a screen video onto which gaze data were mapped later for analysis purposes.


32 students (14 female, 18 male) were recruited from Merseburg University of Applied Sciences as participants for the experiment. Participants were native speakers of German and unaware of the purpose of the experiment. All participants had normal or corrected-to-normal vision. The participants were randomly assigned to one of the two conditions, resulting in 16 participants per condition. Mean age was 27.6 years (median= 27, min = 21, max = 42). Mean experience with image manipulation programs rated on a Likert scale ranging from 1 (“very experienced”) to 6 (“no experience”) was 2.78 (median=3). Mean experience with GIMP rated on the same scale was 5.40 (median=6), and general computer experience was 1.75 (median=2). In sum, participants were experienced computer users with some experience with image manipulation programs but no significant prior exposure to GIMP. Rating data are summarized in Table 1 individually for the two conditions. Mann-Whitney U-tests on the rating data confirmed that there were no significant differences between the groups regarding general computer experience (W=136, p=0.76), general experience with image manipulation programs (W=136, p=0.76), or experience with GIMP (W=132, p=0.87).

Table 1. Participants’ self-assessments of general computer experience, general experience with image manipulation programs and experience with GIMP. Means per experimental condition with medians in parentheses.

Rating Condition
signaled nonsignaled
General computer experience 1.75 (1.5) 1.75 (2.0)
General experience with image manipulation programs 2.75 (2.5) 2.81 (3.0)
Experience with GIMP 5.31 (6.0) 5.5 (6.0)

Dependent Variables

Four dependent variables were used to examine effects of visual signaling. Accuracy of task execution and the time participants needed to complete the tasks were intended as measures to inform about the participants’ overall performance on the tutorial task. Fixation times and number of fixations were selected as eye tracking measures to examine whether or not visual signaling effectively guides visual attention.

Accuracy on tasks

To determine the accuracy of task execution, I coded for each of the 8 tasks whether participants had executed the task correctly or not. Task execution was scored as correct (coded by value “1”) if participants correctly executed all action steps required by the task. If one or more of the action steps were not executed as described in the tutorial, the task was scored as incorrect (coded by value “0”). Scoring was done by post-hoc inspection of the screen videos produced by the SMI software during the eye tracking session for each participant. All coding was done by the author, no additional coders were involved.

Time for task completion

The screen videos also formed the basis for analyzing the time participants needed to complete the tutorial. Like task accuracy, time for task completion was obtained on a per-task basis using the time stamps that were recorded automatically with the screen videos. For each task, the start time was defined by the time stamp of the first video frame on which the tutorial page for the respective task was open. Likewise, the end time was defined as the time stamp of the last video frame, immediately before participants opened the tutorial page for the next task. Time for task completion for each task, then, is the difference between the time stamp for end time and start time.

Fixation times and number of fixations

To determine the number of fixations and fixation times, two different types of so-called “areas of interest” (AOIs) were defined: “screenshot” and “relevant area.” The AOI “screenshot” encompassed the screenshots on each page of the tutorial. The AOI “relevant area” was embedded into the AOI “screenshot.” This AOI marked the area that was relevant in the context of the current task and was therefore highlighted in the condition “signaled.” All AOIs were applied to the conditions “signaled” and “nonsignaled,” and drawn manually using the BeGaze AOI editor.

Since the pages for tasks 1–8 of the tutorial differed regarding how many screenshots were used, as well as their size and position, special care had to be taken to ensure that the AOIs used on the different tutorial pages were identical in size and occupied identical positions across the conditions “signaled” and “nonsignaled.” I therefore first defined the AOIs in the condition “signaled,” then exported the AOI definitions and afterwards reimported them to the data set of condition “nonsignaled.” Positioning of AOIs is illustrated in Figure 4, which also shows an individual scan path pattern for one of the tutorial pages.

Figure 4. Positioning of AOIs “screenshot” and “relevant area” and sample scan path

For identifying fixations and saccades, BeGaze was used with unmodified default settings for event detection. Prior to analysis, data for fixation times and number of fixations were inspected visually for outliers. Since no clear outliers could be identified, no data were excluded.

Data Analysis

All statistical analyses reported in this paper were conducted using the statistics software R, version 3.4.2 (R Core Team, 2017). The data were analyzed with mixed-effects regression modeling (Baayen, Davidson, & Bates, 2008) using the lme4 R package (Bates, Mächler, Bolker, & Walker, 2015). Over the last decade, this technique has become a standard method of analysis in psycholinguistic studies, both for the analysis of continuous outcomes (such as fixation times in this study) and for the analysis of categorical data (such as task accuracy). There are various reasons why mixed-effects modeling has been adopted so widely, which are discussed in detail in Baayen et al. (2008) and Jaeger (2008), among others (for a nontechnical introduction, see Balling, 2018). I selected this technique because, in contrast to alternative methods such as analysis of variance (ANOVA), mixed-effects regression modeling makes it possible to control for multiple random effects in a single analysis, such as random variation that is due to the particular selection of stimuli, or systematic differences between participants. In the present study, random effects for tasks were included besides participants in order to capture variance, which is due to individual properties of the 8 tutorial tasks and the screenshots they contained, such as differences regarding screenshot size, function or positioning as well as size and proportion of AOIs. A significant effect of a fixed factor in such a model therefore reflects differences that persist after the variance introduced by the particular set of participants and tasks has been identified and controlled for.

Regarding the procedure to follow when constructing a mixed-effects model for confirmatory hypothesis testing, several approaches have been proposed which differ mainly with respect to the random effects structure that models should include (Barr, Levy, & Scheepers, 2013; Bates, Kliegl, Vasishth, & Baayen, 2015; Matuschek, Kliegl, Vasishth, Baayen, & Bates, 2017). I followed Matuschek et al. (2017) and Bates, Kliegl et al. (2015) who recommend using the most parsimonious model that can be assumed to have generated the observed pattern of results. To identify the most parsimonious model, I first constructed a model with the maximum random effects structure supported by the factorial design (Barr et al., 2013). The model was then simplified in an iterative manner by identifying and removing parameters from the random effects structure with the smallest variance contribution. This process was continued as long as the simpler model did not differ significantly from the preceding model in terms of goodness of fit.

In the results tables below, the final models arrived at are provided using the notation of (Bates, Mächler et al., 2015) which specifies the dependent variable, the independent variable(s) entered as fixed effect(s) and—using parentheses—the random effect terms. Note that the random effect terms may differ from analysis to analysis, which is a consequence of the iterative approach used here.


Accuracy on Tasks

Mean accuracy scores and standard errors for each of the two tutorial conditions are provided in Table 2. As Table 2 shows, accuracy on tasks was higher in condition “signaled” compared to condition “nonsignaled.”

Table 2. Mean accuracy scores as proportion of tasks solved correctly (in %) and standard errors

signaled nonsignaled
96.88 (1.40) 82.03 (3.60)

To analyze the effect of visual signaling on how accurately participants solved the tutorial tasks, a logistic mixed-effects analysis using the glmer function of the lme4 package was conducted, which is suited for the analysis of binary dependent variables (Jaeger, 2008). Since each participant contributed accuracy values for each of the 8 tutorial tasks, the analysis was based on a total of 256 observations. The mixed-effects model included the factor “screenshot” with levels “signaled” and “nonsignaled” as fixed effect. Table 3 reports the parameter estimates of the model, the standard errors, the resulting z-values, and the associated probabilities. The analysis reveals that accuracy on tasks was significantly higher for condition “signaled” compared to condition “nonsignaled.”

Table 3. Results of mixed-effects model for task accuracy

Contrast Estimate Std. Error z value Pr(>|z|)
Formula: score_correct ~ screenshot + (1|task)
(Intercept) 3.0423 0.5915 5.143 < 0.001
“signaled” vs. “nonsignaled” 2.1184 0.5696 3.719 < 0.001

Time for Task Completion

As described above, time for task completion was also determined on a by-task basis, again yielding 8 observations for each of the 32 participants. The mean duration per task for each condition, along with standard errors, is shown in Table 4.

Table 4. Mean time per task (in milliseconds) to complete the tasks and standard errors

signaled nonsignaled
38864 (1054) 37846 (1441)

The data for time to task completion reveal a slight advantage for condition “nonsignaled” compared to condition “signaled.” Task duration data were analyzed using the lmer function of the lme4 package, including factor “screenshot” as fixed effect. Table 5 reports the parameter estimates, the standard errors, and the resulting t-values. Because exact probabilities cannot be computed for such models, I consider contrasts with a t-value greater 2 as significant (Baayen et al., 2008).

Table 5. Results of mixed-effects model for time to task completion

Contrast Estimate Std. Error t-value
Formula: duration ~ screenshot + (1|subject) + (1|task)
(Intercept) 38355 3062 12.528
“signaled” vs. “nonsignaled” -1019 2034 -0.501

As Table 5 shows, the difference between the means is not significant. I therefore conclude that the time participants needed to complete the tasks was not dependent on whether they worked with a tutorial containing screenshots with or without visual signaling.

Fixation Times

In order to determine whether signaling techniques, such as colored frames or arrows, indeed direct visual attention to screenshot areas that are relevant in the context of a task, a second factor was defined that distinguishes relevant and irrelevant screenshot areas. Relevant screenshot areas are highlighted by signaling and contain the information that readers should attend to. In this study, these areas are designated by the AOI “relevant area” (see Figure 4). By definition, all screenshot information not included within the AOI “relevant area” is considered irrelevant in the context of the task at hand. Fixation times for irrelevant areas were computed by subtracting all fixations that landed in the AOI “relevant area” from the fixations collected by the AOI “screenshot,” which marked the entire screenshot. Therefore, a 2×2 design resulted for subsequent analyses including the factor “area” with factor levels “relevant” and “irrelevant” in addition to the factor “screenshot” with levels “signaled” and “nonsignaled.”

The mean fixation times for each of the four conditions are visualized in Figure 5 and are provided numerically along with factor means and grand mean as well as respective standard errors in Table 6.

Figure 5. Mean fixation times per task in relevant or irrelevant areas of the screenshots depending on the availability of visual signaling. Bars represent 95% confidence intervals.

Table 6. Mean fixation times per task with standard errors in parentheses

screenshot area
relevant irrelevant mean
signaled 1560 (112) 1414 (106) 1487 (77)
nonsignaled 828 (81) 1568 (131) 1198 (80)
mean 1194 (73) 1491 (84) 1343 (56)

For statistical analysis, the lmer function of the lme4 package was used. The model included “screenshot” and “area” as fixed effects. Results of mixed-effects modeling are provided in Table 7.

Table 7. Mixed-effects model for fixation times

Contrast Estimate Std. Error t-value
Formula: fixation_time ~ area * screenshot + (1|subject) + (1 + area||task)
(Intercept) 1342.6 195.0 6.886
area 296.8 276.4 1.074
screenshot -289.0 184.0 -1.570
area:screenshot 887.1 176.4 5.028

As Table 7 shows, main effects were not significant, but there was a significant interaction between “area” and “screenshot.” To explore the interaction further, I performed post-hoc Tukey HSD comparisons using the lsmeans R package (Lenth, 2016). The comparisons revealed a significant difference between the conditions “signaled” and “nonsignaled” for relevant screenshot areas (1560 ms vs. 828 ms, t(45.21) = 3.59, p < .01), whereas the difference was not significant for irrelevant areas (1414 ms vs. 1568 ms, t(45.21) = -0.757, p = 0.45). I conclude that relevant screenshot areas indeed attract longer fixation times if visual signaling is used compared to when no visual signaling is used.

Number of Fixations

Number of fixations was analyzed using the same 2×2 factorial design and the same approach to mixed-effects modeling. The respective condition means are shown in Figure 6 and provided along with factor means, grand mean, and standard errors in Table 8.

Table 8. Mean number of fixations per task with standard errors in parentheses

screenshot area
relevant irrelevant mean
signaled 4.21 (0.26) 5.09 (0.30) 4.65 (0.20)
nonsignaled 2.73 (0.25) 6.19 (0.47) 4.46 (0.29)
mean 3.47 (0.19) 5.64 (0.28) 4.55 (0.17)
Figure 6. Mean number of fixations per task in relevant or irrelevant areas of the screenshots depending on the availability of visual signaling. Bars represent 95% confidence intervals.

The data resulting from mixed-effects modeling are summarized in Table 9. As is evident from Table 9, the analysis revealed a very similar pattern, which is not surprising as there is a very strong overall correlation between fixation times and number of fixations (r=0.93). The only difference is that the main effect of “area” now reached significance as well, but it is qualified by a significant interaction.

Table 9. Mixed-effects model for number of fixations

Contrast Estimate Std. Error t-value
Formula: fixation_count ~ area * screenshot + (1|subject) + (1 + area||task)
(Intercept) 4.55 0.64 7.166
area 2.16 0.85 2.540
screenshot -0.19 0.57 -0.328
area:screenshot 2.58 0.52 4.916

Post-hoc comparisons show that the mean number of fixations is significantly different between the conditions “signaled” and “nonsignaled” for relevant screenshot areas (4.21 vs. 2.73, t(43.79) = 2.34, p < .05) and marginally significant for irrelevant screenshot areas (5.09 vs. 6.19, t(43.79) = -1.75, p = 0.09). I therefore derive the additional conclusion that relevant screenshot areas highlighted by visual signaling are not only fixated longer but also more often.


Main Findings

The research reported here was designed to investigate whether and how visual signaling affects user performance in a “reading-to-do” scenario. More specifically, I wanted to know whether visual signaling in screenshots indeed directs visual attention of documentation users to the signaled areas. Related to that, I also examined whether directing visual attention to relevant screenshot areas would improve performance on executing the tutorial tasks.

Regarding overall user performance, I found a significant effect of visual signaling on overall task accuracy. Participants working with the tutorial containing screenshots with visual signaling made fewer errors compared to participants in the “nonsignaled” condition. No reliable difference between conditions was found for the time participants needed to execute the tutorial tasks.

Consistent with the effect of visual signaling on task accuracy, I found evidence that signaling triggers users to allocate more visual attention to screenshot areas that are relevant in the context of a certain task and therefore highlighted by arrows, frames or similar techniques. If relevant screenshot areas are emphasized, users fixate on them more often and longer compared to the condition in which no signaling was used. At the same time, I observed a tendency that irrelevant screenshot regions were fixated less often and with overall shorter fixation times. Taken together, the results imply that participants do not generally look more often or longer on a screenshot when visual signaling is used, but they look more often and longer at the right place, as is intended. This pattern of effects is consistent with accounts of visual signaling in the Cognitive Theory of Multimedia Learning (Mayer, 2009). According to this account, visual signaling does not simply increase overall interest in pictures but specifically helps the user to identify and select relevant information from a picture, thereby reducing the amount of working memory capacity that has to be devoted to extraneous processing.

Practical Implication and Suggestions for Future Research

When designing screenshots for software documentation, whether or not to use visual signaling is one of the important design decisions that technical communicators have to make. The main practical implication of the current study is that the results—if confirmed by future studies—back the use of visual signaling in screenshots. The results suggest that signaling is a design dimension which supports positive effects of screenshots on user performance. Visual signaling elements guide the user‘s visual attention to relevant information. They help the user to identify relevant information, which improves user performance. In light of the findings reported here and findings from prior research, practitioners can derive the recommendation that they should consider using screenshots in software documentation containing procedures, but that screenshots should be designed properly to leverage their full potential. Screenshots are more helpful if they are enriched with visual signaling elements that relate to the task at hand and appropriately support the communicative function of the screenshot.

Given that implementing and maintaining visual signaling elements is associated with investments in terms of time and resources, and given further that technical communicators typically work under pressure to reduce costs, the results of this study may provide an additional argument that can be used to justify these costs: The costs are justified because they contribute to making users more effective. In this sense, the study attempts to contribute to a line of research that helps to validate design decisions by demonstrating that they are grounded in empirical research supporting their supposed effects. This line of research has a long tradition in the field of technical communication (as shown, e.g., by van der Meij et al., 2009), and its relevance from a practitioner’s point of view has been confirmed recently by Carliner, Coppola, Grady, and Hayhoe (2011), and St.Amant and Meloncon (2016).

The effect of visual signaling on the accuracy of task execution was statistically significant, but it was rather small numerically. Note that accuracy on tasks was fairly high in general and reached an almost perfect score in condition “signaled,” which may have reduced the effect size. A follow-up study could use more complex tasks to check whether the effect size increases with increasing task complexity.

Although the current study demonstrates the effectiveness of signaling, it does not allow conclusions regarding which specific signaling techniques are effective in guiding the visual attention of the user and which aren’t. Another question that future research needs to address is whether the effect of visual signaling is modulated by other factors, such as user experience or reading goals of the user. As discussed above, prior research on screenshots suggests that such interactions with other factors can play a role. Of course, knowing about relevant factors is of interest both from a theoretical and from an applied perspective. Note that the importance of understanding boundary conditions for design decisions has also been emphasized in research on multimedia learning (Mayer, 2009; Sweller, Ayres, & Kalyuga, 2011).

The eye tracking method and the specific experimental setup developed to study the reading behavior of users in a “reading-to-do” scenario have proven very useful, which suggests that eye tracking can help to address the questions for future research raised above as well. A key feature of the experimental setup used here is that the software system and the documentation were presented simultaneously on the same screen. An advantage of this setup is that remote eye tracking systems can be used for data collection and that techniques for quantitative data analysis, such as the definition of areas of interest and comparisons across areas of interest, can be applied in an efficient way, simply because the screen provides a fixed point of reference.

This opens new possibilities to leverage the potential of eye tracking for determining accurately where users look while still allowing users to work with the software system and the respective documentation simultaneously in a fairly unconstrained way. In particular, the setup does not enforce a fixed sequence of reading and acting. Consequently, the setup described here opens the possibility of applying eye tracking to new areas of technical communication research that extend the scenarios discussed in Cooke (2005). One such possibility is to use eye tracking to study reading strategies and aspects of information selection in manuals beyond visual signaling. For example, eye tracking investigations could reveal which types of information users access spontaneously in manuals, at which point they access certain information (e.g., before, during or after carrying out an action), and which variables (e.g., reading goal or level of expertise) influence the section process (van der Meij et al., 2009; Ummelen, 1999).

Another area in which eye tracking—when used in a setup that enables parallel use of documentation and software system—can prove very useful is to study how users coordinate reading and acting in task execution, and how they switch attention between documentation and software system when working with procedures. Switching attention is an important process that mediates between selecting information contained in a manual and putting this information to actual use (Boekelder & Steehouder, 1998; van der Meij, 1998; van der Meij & Gellevij, 1998). An important task for future research is to determine which design techniques support the attention switching process effectively and, therefore, contribute to more effective and efficient task execution. For example, this study has shown that visual signaling in screenshots helps the user to identify screenshot information that is relevant in the context of the current task. With respect to attention switching and the coordination between reading and acting, the question arises whether visual signaling also helps the user to locate the relevant parts of the user interface that are depicted by the screenshot and emphasized by signaling. Eye tracking may pave the way for a deeper understanding of attention switching processes, which in turn could lead to specific design recommendations that help add additional value to technical documentation.


Acharya, K. R. (2017). User value and usability in technical communication: A value-proposition design model. Communication Design Quarterly Review, 4(3), 26–34.

Alexander, K. P. (2013). The usability of print and online video instructions. Technical Communication Quarterly, 22, 237–259.

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390–412.

Balling, L. W. (2018). No effect of writing advice on reading comprehension. Journal of Technical Writing and Communication, 48, 104–122.

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278.

Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious mixed models. arXiv.org. Retrieved from https://arxiv.org/pdf/1506.04967.pdf.

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.

Boekelder, A., & Steehouder, M. (1998). Selecting and switching: Some advantages of diagrams over tables and lists for presenting instructions. IEEE Transactions on Professional Communication, 41, 229–241

Boekelder, A., & Steehouder, M. (1999). Switching from instructions to equipment: the effect of graphic design. In H. Zwaga, T. Boersema, & H. Hoonhout (Eds.), Visual information for everyday use: Design and research perspectives (pp. 67–73). London, UK: Taylor & Francis.

Carliner, S., Coppola, N., Grady, H., & Hayhoe, G. (2011). What does the transactions publish? What do transactions’ readers want to read? IEEE Transactions on Professional Communication, 54, 341–359.

Cooke, L. (2005). Eye tracking: How it works and how it relates to usability. Technical Communication, 52, 456–463.

Cooke, L. (2010). Assessing concurrent think-aloud protocol as a usability test method: A technical communication approach. IEEE Transactions on Professional Communication, 53, 202–215.

Cooke, L., Taylor, A. G., & Canny, J. (2008). How do users search Web home pages? Technical Communication, 55, 176–194.

Duchowski, A. T. (2017). Eye Tracking Methodology (3rd ed.). Cham: Springer International Publishing.

Elling, S., Lentz, L., & Jong, M. de. (2012). Combining concurrent think-aloud protocols and eye-tracking observations: An analysis of verbalizations and silences. IEEE Transactions on Professional Communication, 55, 206–220.

Farkas, D. K. (1999). The logical and rhetorical construction of procedural discourse. Technical Communication, 46, 42–43.

Gellevij, M., & van der Meij, H. (2002). Screen captures to support switching attention. IEEE Transactions on Professional Communication, 45, 115–122.

Gellevij, M., & van der Meij, H. (2004). Empirical proof for presenting screen captures in software documentation. Technical Communication, 51, 224–258.

Gellevij, M., van der Meij, H., Jong, T. de, & Pieters, J. (1999). The effects of screen captures in manuals: A textual and two visual manuals compared. IEEE Transactions on Professional Communication, 42, 77–91.

Gellevij, M., van der Meij, H., Jong, T. de, & Pieters, J. (2002). Multimodal versus unimodal instruction in a complex learning context. The Journal of Experimental Education, 70(3), 215–239.

Guillemette, R. A. (1989). Usability in computer documentation design: Conceptual and methodological considerations. IEEE Transactions on Professional Communication, 32, 217–229.

Holmqvist, K., Nyström, M., Andersson, R., Dewhurst, R., Jarodzka, H., & van de Weijer, J. (2011). Eye tracking: A comprehensive guide to methods and measures. Oxford, UK: Oxford University Press.

Irrazabal, N., Saux, G., & Burin, D. (2016). Procedural multimedia presentations: The effects of working memory and task complexity on instruction time and assembly accuracy. Applied Cognitive Psychology, 30, 1052–1060.

Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59(4), 434–446.

Koning, B. B. de, Tabbers, H. K., Rikers, R. M., & Paas, F. (2009). Towards a framework for attention cueing in instructional animations: Guidelines for research and design. Educational Psychology Review, 21(2), 113–140.

Lenth, R. V. (2016). Least-Squares Means: The R package lsmeans. Journal of Statistical Software, 69(1), 1–33.

Lorch, R. F. (1989). Text-signaling devices and their effects on reading and memory processes. Educational Psychology Review, 1(3), 209–234.

Martin-Michiellot, S., & Mendelsohn, P. (2000). Cognitive load while learning with a graphical computer interface. Journal of Computer Assisted Learning, 16(4), 284–293.

Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H., & Bates, D. (2017). Balancing Type I error and power in linear mixed models. Journal of Memory and Language, 94, 305–315.

Mautone, P. D., & Mayer, R. E. (2001). Signaling as a cognitive guide in multimedia learning. Journal of Educational Psychology, 93(2), 377.

Mayer, R. E. (2009). Multimedia learning (2nd ed.). Cambridge, UK: Cambridge University Press.

Mayer, R. E. (2010). Unique contributions of eye-tracking research to the study of learning with graphics. Learning and Instruction, 20(2), 167–171.

Moreno, R. (2007). Optimising learning from animations by minimising cognitive load: Cognitive and affective consequences of signalling and segmentation methods. Applied Cognitive Psychology, 21(6), 765–781.

Moreno, R., & Mayer, R. (2007). Interactive multimodal learning environments. Educational Psychology Review, 19(3), 309–326.

Nielsen, J., & Pernice, K. (2010). Eyetracking web usability. Berkeley, CA: New Riders.

Nowaczyk, R. H., & James, E. C. (1993). Applying minimal manual principles for documentation of graphical user interfaces. Journal of Technical Writing and Communication, 23, 379–388.

Rayner, K. (2009). Eye movements and attention in reading, scene perception, and visual search. Quarterly Journal of Experimental Psychology (2006), 62(8), 1457–1506.

Rayner, K., & Pollatsek, A. (2006). Eye-movement control in reading. In M. Traxler & M. Gernsbacher (Eds.), Handbook of psycholinguistics (pp. 613–657). Cambridge, MA: Academic Press.

R Core Team. (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Redish, J. C. (1989). Reading to learn to do. IEEE Transactions on Professional Communication, 32, 289–293.

Redish, J. C. (2010). Technical communication and usability: Intertwined strands and mutual influences. IEEE Transactions on Professional Communication, 53, 191–201.

Richter, J., Scheiter, K., & Eitel, A. (2016). Signaling text-picture relations in multimedia learning: A comprehensive meta-analysis. Educational Research Review, 17, 19–36.

Spyridakis, J. H. (1989). Signaling effects: A review of the research—Part I. Journal of Technical Writing and Communication, 19, 227–240.

St.Amant, K., & Meloncon, L. (2016). Reflections on research: Examining practitioner perspectives on the state of research in technical communication. Technical Communication, 63, 346–364.

Sweller, J., Ayres, P., & Kalyuga, S. (Eds.). (2011). Cognitive load theory. New York, Dordrecht, Heidelberg, London: Springer.

Sweller, J., & Chandler, P. (1994). Why some material is difficult to learn. Cognition and Instruction, 12(3), 185–233.

Ummelen, N. (1999). Studying the process of information selection in manuals: a review of four instruments. Document Design, 1(2), 119–130.

van der Meij, H. (1996). A closer look at visual manuals. Journal of Technical Writing and Communication, 26, 371–383.

van der Meij, H. (1998). Optimizing the joint handling of manual and screen. In J. M. Carroll (Ed.), Minimalism beyond the Nurnberg Funnel (pp. 275–309). Cambridge, Mass: MIT Press.

van der Meij, H. (2000). The role and design of screen images in software documentation. Journal of Computer Assisted Learning, 16, 294–306.

van der Meij, H., & Gellevij, M. (1998). Screen captures in software documentation. Technical Communication, 45, 529–543.

van der Meij, H., & Gellevij, M. (2002). Effects of pictures, age, and experience on learning to use a computer program. Technical Communication, 49, 330–339.

van der Meij, H., & Gellevij, M. (2004). The four components of a procedure. IEEE Transactions on Professional Communication, 47, 5–14.

van der Meij, H., Karreman, J., & Steehouder, M. (2009). Three decades of research and professional practice on printed software tutorials for novices. Technical Communication, 56, 265–292.

van Genuchten, E., Hooijdonk, C., Schüler, A., & Scheiter, K. (2014). The role of working memory when ‘learning how’ with multimedia learning material. Applied Cognitive Psychology, 28(3), 327–335.

Williams, T. R., Mulligan, C., Koprowicz, K., Miller, J., Reimann, C., & Wang, D.-S. (2005). Does isolating a visual element call attention to it? Results of an eye-tracking investigation of the effects of isolation on emphasis. Technical Communication, 52, 21–27.

About the Author

Michael Meng is a professor of Applied Linguistics at Merseburg University of Applied Sciences (Germany) where he teaches text production, research methods, and usability. Before joining the university faculty, he worked as a technical writer and localization specialist for an international software company. His research focuses on using empirical methods to study the effects of linguistic and design variables on the usability of information products in technical communication. Contact: michael.meng@hs-merseburg.de.

Manuscript received 12 March 2018, revised 29 May 2018; accepted 3 July 2018.