68.1 February

The Prevalence and Utility of Formal Features in Screen-Capture Tutorial Videos

By Christopher Brett Jaeger, Joshua Little, and Daniel T. Levin


Purpose: Screen-capture video is an increasingly popular vehicle for communicating information online. We posit that screen-capture video represents a distinct genre of technical communication, which leverages a specific set of formal features to communicate information to viewers. We propose and evaluate an initial catalog of formal features, grouping them in four categories: attention cues, segmentation cues, content features, and vocal performance.

Method: To evaluate our catalog, we completed a systematic survey of the features of 200 screen-capture tutorial videos from YouTube.

Results: We found that many of the features in our catalog are already being leveraged by screen-capture creators and, further, that use of these features is correlated with video viewership.

Conclusion: We provide a practical catalog of formal features that screen-capture creators use to effectively convey information to viewers, and demonstrate that these features are predictive of video viewership. Further, our results suggest that certain features, like vocal performance and segmentation cues, are especially predictive of viewership.

KEYWORDS: screencast, screen-capture video, tutorial, online communication, attention

Practitioner’s Takeaway:

  • The catalog of formal features we describe is a resource for authors creating instructional screen-capture videos and for researchers investigating those videos’ effectiveness.
  • Authors of instructional screen-capture videos should pay special attention to vocal expressiveness and disfluencies, to their use of mouse movements and “gestures” to guide visual attention, and to their use of segmentation cues to break content up for viewers.
  • Many screen-capture videos do not include vocal narration, and, for these videos, the authors’ use of attention and segmentation cues may be especially important.



Screen-capture video is an increasingly common vehicle for communicating technical information online (Morain & Swarts, 2012; Selber, 2010). Screen-capture videos, sometimes called “screencasts,” are digital recordings of the output displayed on a computer screen, often accompanied by vocal narration (Udell, 2004; Oxford English Dictionary, 2018). Screen-capture videos are a staple of online education: Teachers frequently screen-capture PowerPoint slides accompanying lectures or illustrate problems using an interactive whiteboard. Screen-capture videos are also frequently used in less formal instruction. This article focuses on screen-capture tutorials, a subset of screen-capture videos in which an instructor demonstrates uses or features of a software application for viewers. On YouTube, screen-capture tutorials are finding vast audiences for applications ranging from word processing to video editing.

But what attributes make for an effective screen-capture tutorial? To tackle this question, it is helpful to consider the basic features and structural principles that are available to, and used by, screencast authors. It is difficult to assess the qualities that differentiate effective screen-capture tutorials from less effective ones without first understanding the array of formal features that authors use to communicate information to viewers.

In this paper, we offer a psychologically grounded framework for understanding the formal features of screen-capture tutorials. Drawing upon research in cognitive science, technical communication, and cinema, our framework includes four categories of features: attention cues, segmentation cues, content features, and vocal performance. We then present data from a survey of 200 instructional screencasts from YouTube, assessing their use of features in these categories and statistically evaluating the degree to which these features predict video viewership.


Research on Online Instructional Videos

In recent years, researchers across a variety of fields have shown increased interest in online instructional videos, including—but not limited to—screencasts (e.g., Morain & Swarts, 2012; Selber, 2010; Giannakos, 2013; Ritzhaupt et al., 2015). Some researchers have cataloged properties of online instructional videos in specific content domains. For example, Chan, Choo, and Woods (2013) searched YouTube for “principles of animation” and noted that the search results included live lectures (i.e., videos of an instructor speaking), screen-capture demonstrations, and “mash-ups,” and that almost all of the videos were accompanied by audio. The researchers concluded that YouTube provided animation students with a wide variety of animation-related videos, but that most of the videos were exemplars that would be of limited use to students who did not already have a basic understanding of animation principles. Tewell (2010) surveyed 1,070 online tutorial videos produced by libraries for visual arts students, finding that the majority of such videos (778) were screencasts. Tewell reported that the videos tended to be brief, averaging approximately 4 minutes in duration, and typically addressed topics suitable for step-by-step instruction such as database search strategies.

Other researchers have studied the preferences of viewers of online instructional videos, using a variety of methods for assessing viewer preferences. For example, Guo, Kim, and Rubin (2014) examined how video features related to viewer engagement (i.e., watch time) using data from the edX MOOC (massive open online course) platform. The best predictor of engagement was video length, with students watching a greater portion of shorter videos. Specifically, students’ engagement dropped off in videos longer than 12 minutes: Students’ median engagement time hovered around 6 minutes for videos between 6 and 12 minutes in length, but fell to less than 4 minutes for videos more than 12 minutes in length. Guo, Kim, and Rubin also found that students engaged longer with videos in which the instructor spoke rapidly (faster than 185 wpm). In another study, ten Hove and van der Meij (2015) surveyed the physical design characteristics (e.g., resolution, use of on-screen text) of a sample of 75 instructional videos on YouTube, identifying a variety of differences in these characteristics across video popularity classes (e.g., more popular videos tended to be higher resolution and include more on-screen texts, among other things).

Morain and Swarts (2012) conducted a descriptive study of 46 user-rated screen-capture tutorials and developed a rubric for assessing their instructional content, noting certain characteristics common to highly rated, poorly rated, and average videos. Some other researchers have examined relations between instructional video features and viewer preferences by soliciting post-video feedback from viewers or testing content learning (Ritzhaupt et al., 2008; Cross et al., 2013; Ilioudi et al., 2013). Growing interest in the relation between the properties of online tutorial videos and viewer preferences has yielded valuable rubrics and “best practice” guidelines for instructors (e.g., Morain & Swarts, 2012; Swarts, 2012; Sugar et al., 2010; Pflugfelder, 2013). Recently, Chong (2018) surveyed the ten most-viewed beauty tutorials on YouTube and found that a majority of those videos followed best practices suggested by Swarts (2012).

This paper presents a survey of YouTube screen-capture tutorials that contributes to the existing literature in three ways. First, this paper focuses exclusively on screen-capture tutorials. As the screen-capture tutorial continues to grow in popularity, we aim to identify central formal features that define the genre—features that authors might leverage to improve communication. We discuss these features in the next section and use them to structure our survey. Second, this paper statistically examines the relation between formal features of screen-capture tutorials and the number of views they get on YouTube, identifying the types of features that are most predictive of viewership. Third, our survey includes 200 videos, making it, to our knowledge, the largest survey to date of screen-capture tutorials on YouTube.

Screen-Capture as a Distinct Genre

Screen-capture video dates to the 1990’s, when companies like TechSmith (developer of SnagIt and Camtasia) and IBM/Lotus (developer of ScreenCam) pioneered the technology (Williams & Goodwin, 2007; TechSmith, 2018). As increasingly widespread broadband internet access facilitated distribution of screen-capture videos, an online community interested in the genre developed. Indeed, the term screencast originated when a blog post by journalist Jon Udell asked readers what he should call the developing genre (Udell, 2004; Williams & Goodwin, 2007). Screen-capture software tutorials have become a common tool for users who want to learn new software features and for companies training employees with challenging software.

The premise of this article is that screen-capture videos can be usefully understood as a distinct genre of communication: “Genres are identified both by their socially recognized communicative purpose and by common characteristics of the form” (Yates et al., 1997; for useful discussions of genre, see Henze, 2012; Miller, 1984). Communication within a particular genre often uses a set of associated formal features to help communicators convey, and viewers understand and interpret, content (Getto et al., 2011).

For example, in the early 20th century, DW Griffith developed the core formal features that define cinema as a genre (Slide, 2012). He did so by carefully observing how his audiences responded to his films, then iteratively tailoring the films based on his observations. Ultimately, this produced a set of principles of editing that filmmakers still use to shape viewers’ cognition for desired effect (e.g., the use of a close-up to accentuate a dramatic moment). Collectively, these practices constitute a set of structural principles that support both perceptual continuity and conceptual integration of meaning by using broadly applicable visual cues that interact with meaningful events (for review see Smith et al., 2012; Levin & Baker, 2017).

Screen-capture authors share many functional goals with filmmakers. For example, like filmmakers, screen-capture authors need to direct viewer attention to important objects, properties, and concepts. But screen-capture authors have more limited visual tools at their disposal to achieve these goals, as the video component of screencasts typically consists of output from a single computer screen. Thus, authors have been forced to devise new methods for using the limited set of tools at their disposal to shape viewers’ cognition in support of communication. We suggest that one potentially fruitful way to analyze screen-capture video is by treating it as a unique genre with its own distinct set of formal features.

Existing literature in technical communication and cognitive science suggest that certain types of formal features might be especially important in orienting the attention of screencast viewers, both to on-screen events and to meaningful internal representations that viewers build as they integrate information from the screencast. Drawing on the existing literature, we identify four categories of formal features of screencasts: attention cues, segmentation cues, content features, and vocal performance. In the following paragraphs, we define each category and relate it to existing research, flagging links between categories of formal features and basic cognitive skills used in viewing screen-capture tutorials. We list particular formal features in each of our four categories later in the paper.

Attention cues

When we refer to attention cues, we mean methods for directing viewers’ visuospatial attention. Filmmakers frequently use the gaze of on-screen individuals to cue viewers’ attention. Humans track, follow, and interpret the gaze of others, often by default (Butterworth & Jarrett, 1991; Moll & Tomasello, 2004; Samson et al., 2010; Baker et al., 2016), and filmmakers exploit this tendency by using the gaze of people on-screen to direct the viewer’s attention to the important part of a scene (Levin & Baker, 2017). But without people onscreen, screen-capture authors must use other cues to direct visuospatial attention. These other cues might include clicking and dragging to highlight areas of interest on the screen or circling those areas with the cursor. Further, screen-capture authors might communicate information typically conveyed by hand gestures in the physical world through analogous ‘gestures’ with on-screen objects, such as the cursor.

Segmentation cues

Research suggests that viewers will best understand and remember video content if they can organize it within a coherent event structure (Zacks et al., 2007; Zacks et al., 2010). When researchers ask participants to segment videos into discrete events (for example, by pressing the space bar when a new event begins), they find that “better” event segmentation (i.e., segmentation more in line with other participants’ segmentation) predicts better subsequent memory for the videos’ contents (Sargent et al., 2013; Bailey et al., 2013; Kurby & Zacks, 2011). Viewers are also more likely to notice changes to objects and properties in videos at event boundaries (Baker & Levin, 2015), while encoding of details within an event can be surprisingly sparse (Levin & Varakin, 2004). These findings suggest that providing clear event structure—for example, by visually or verbally marking breaks between steps in a sequence—can help filmmakers convey information in a way that viewers will remember. But it can be difficult to provide clean, coherent event boundaries in screencasts, since they conventionally lack cinema-style edits (e.g., cuts between scenes). However, screencast creators have devised some cues that can be used to create clearer event structure. For instance, creators can use the appearance of on-screen text to signify the beginning of a new event, or use “fast-forward” effects or “fades” to separate and highlight the critical steps in a procedure. By using these cues, screencast creators might help viewers segment the video into meaningful steps or units and, consequently, facilitate understanding.

Content features

The way content is structured in screencasts can vary in a number of ways (Morain & Swarts, 2012). For example, some videos use introductions and closings to structure content, while others do not (Tewell, 2010). Some explicitly state learning objectives at the outset, often through the use of nonspatial text (e.g., text superimposed on the screen-capture for purposes other than labeling on-screen objects) (Chong, 2018; Swarts, 2012). Some screencast tutorials even show footage of the author talking before or after screen-capture components to provide context. The choices that authors make in structuring their instructional content will inevitably influence viewers’ cognition (Mayer, 2002), just as the ways in which filmmakers choose to structure their narratives (e.g., through heavy foreshadowing or by combining action with voiceover) influences audience experience.

Vocal performance

Viewers usually do not see content creators during screencasts, but content creators frequently provide vocal narration of the actions. Because this off-screen narration often plays a critical role in guiding viewers’ attention and providing context for on-screen actions, aspects of the narrator’s vocal performance might be especially important for effective communication (Mohamad Ali et al., 2011). Relevant aspects might include the narrator’s enthusiasm, speech rate, vocal expressiveness, and fluidity. Guo, Kim, and Rubin (2014) found that, in the context of MOOC videos, the speaker’s rate of speech—which they interpreted as a proxy for enthusiasm level—was predictive of viewer engagement.

In this paper, we use the preceding four categories to create and organize a catalog of formal features of screen-capture tutorials (presented in the Method section). We then survey 200 instructional screencasts from YouTube and assess the degree to which they used features falling within our categories. We report the prevalence of formal features in each category, then investigate how the use of formal features relates to video viewership. Our results suggest that our catalog captures many formal features that are predictive of video viewership.


Video Sample

We selected a sample of 200 YouTube screen-capture tutorials that met several criteria. We began by compiling a list of the top 10 best-selling applications from each category in the Apple App Store, excluding the Games category, giving us a total of 190 applications. Then, a YouTube search of “[application] tutorial” was conducted for each application, with results limited to videos equal to or less than four minutes long. Based on these search results, we chose a subset of 34 applications for which the results suggested it would be relatively easy to find relevant videos (i.e., the first two search results that we screened appeared to be relevant screen-capture tutorials).

For each of these 34 applications, we randomly[1] viewed videos on the first page of search results to locate up to seven relevant screen-capture tutorials. We moved to the next application either when we found a seventh relevant video or when five attempts failed to produce a relevant video.[2] This methodology produced an initial list of 126 screen-capture tutorials.

Fourteen of the 34 applications required, at most, eight random viewings to find seven relevant screen-capture tutorials. For these 14 applications, we repeated our process to find up to nine additional screen-capture tutorials. This resulted in 109 additional screen-capture tutorials.

Thus, a total 235 relevant YouTube screen-capture tutorials relating to 34 applications were found. Our goal was to code 200 screen-captured tutorials; we oversampled and chose 235 in anticipation of having to eliminate some during the coding process. Thirty videos were eliminated: iPad and mobile device screen-capture tutorials were eliminated due to the absence of a cursor, and videos that served as global introductions to a series of specific tutorials were eliminated due to their non-instructive function. This left us with 205 screen-capture tutorials, of which we randomly cut 5 to reach our target of 200.

Our final sample of 200 screen-capture tutorials included tutorials for software in the following content areas: graphics & design (e.g., Affinity Designer), medical (e.g., Human Anatomy Atlas), music (e.g., GarageBand), photography (e.g., Photoshop), productivity (e.g., 1Password), social networking and communication (e.g., iMessage), utilities (e.g., Disk Doctor), and video (e.g., Final Cut Pro). The majority of videos were posted by individual users of the software, though some were posted by the companies producing the software. The average “age” (time since posting) of the tutorials in our sample was 6.7 months. The mean number of views for the videos in the sample was 9,359, but that was skewed by one video that had been viewed over 1,000,000 times and another that had been viewed over 300,000. The median number of views for the videos in the sample was 457.

Video Coding

Each of the 200 screen-capture tutorials in our sample was viewed and coded, independently, by two raters. First, each video was coded by one of three omnibus raters. Omnibus raters viewed the videos on YouTube using their personal laptop computers, recorded video statistics (e.g., duration on YouTube, number of views), and coded for all of the formal features included in our catalog (described below).[3] Subsequently, a blind rater independently viewed the videos on a laboratory computer, where they were embedded in a custom program that hid video statistics (e.g., number of views, subscribers, etc.). The blind rater rated each video only on three dimensions of quality: content quality, production quality, and effective guidance of attention.

The omnibus raters’ primary task was to code each video for a variety of features within the four categories of our catalog (attention cues, segmentation cues, content features, and vocal performance). These features are described on the next page.

Attention cues. Omnibus raters coded the frequency of seven attention cues:

  1. Cursor Location Highlighting: use of highlighting or other graphical cues (e.g., a circle around the cursor) to call attention to the cursor as it moves around the screen. Rating guideline: 0 = never, 1 = once, 2 = 2–3 times, 3 = frequent.
  2. Cursor Click Highlighting: use of highlighting or other graphical cues (e.g., a circle around the cursor) to call attention to the cursor only when the cursor is clicked. Rating guideline: 0 = never, 1 = once, 2 = 2–3 times, 3 = frequent.
  3. Screen Zooms: transitioning from a full-screen view to a “close-up” of a region of interest. Rating guideline: 0 = never, 1 = once, 2 = 2–3 times, 3 = frequent.
  4. Goal Region Highlighting: highlighting a particular region of the screen relevant to the narrator’s task. Rating guideline: 0 = never, 1 = once, 2 = 2–3 times, 3 = frequent.
  5. Spatial Text: text appears on-screen in a spatial location relevant to its meaning (e.g., labeling a region of interest). Rating guideline: 0 = never, 1 = once, 2 = 2–3 times, 3 = 4–10 times, 4 = frequent.
  6. Deictic Mouse Gestures: intentional cursor movements that refer to or highlight part of the screen. Rating guideline: 0 = never, 1 = once, 2 = 2–3 times, 3 = 4–10 times, 4 = frequent.
  7. Deictic Hand Gestures: movements of the narrator’s hand, visible in the video, that refer to or highlight part of the screen. Rating guideline: 0 = never, 1 = once, 2 = 2–3 times, 3 = 4–10 times, 4 = frequent.

We calculated a composite attention cue score for each video by taking the mean of the video’s standardized[4] ratings for each of the seven attention cues. Raters also responded to a catch-all “other attention cues” category, which was coded dichotomously: Videos were scored ‘1’ if the rater identified the use of at least one type of attention cue not included in the catalog, and ‘0’ if the rater did not.

Segmentation cues. Omnibus raters coded the frequency of five segmentation cues:

  1. Speed Changes: the speed at which the video plays is sped up or slowed down for specific parts of the video (e.g., the video “fast forwards” through a part that demonstrates a redundant step). Rating guideline: 0 = never, 1 = once, 2 = 2–3 times, 3 = frequent.
  2. Between-Scene Cuts: visible cuts (breaks in recording) between scenes or events. Rating guideline: 0 = never, 1 = once, 2 = 2–3 times, 3 = frequent.
  3. Ellipses: omitting intervals of video within an event (e.g., the video completely skips a part that would demonstrate a redundant step). Rating guideline: 0 = never, 1 = once, 2 = 2–3 times, 3 = frequent.
  4. Non-Cut Cues: manipulations of the screen other than cuts to signify that one event is ending and another is beginning (e.g., fade or dissolve effects). Rating guideline: 0 = none, 1 = some.
  5. Nonspatial Text to Segment: text appears on-screen to signify the beginning or end of an event (e.g., the words “Step Two” appearing on-screen as the narrator begins a new step).

We calculated a composite segmentation cue score for each video by taking the mean of the video’s standardized ratings for each of the five segmentation cues. Raters also responded to a catch-all “other segmentation cues” category, which was coded dichotomously: Videos were scored ‘1’ if the rater identified the use of at least one type of segmentation cue not included in our catalog, and ‘0’ if the rater did not.

Content features. Raters coded the content of the videos on nine dimensions.

  1. Face Shown: the narrator’s face appears on-screen at some point during the video. Rating guideline: 0 = absent, 1 = present.
  2. Vocal Narration: the video (or at least some part of it) is accompanied by vocal narration. Rating guideline: 0 = absent, 1 = present.
  3. Instructor Introduction: the instructor introduces himself or herself to the viewer either vocally (e.g., “Hello, my name is Bill”) or through on-screen text or graphics (e.g., Bill’s name appears on screen). Rating guideline: 0 = absent, 1 = present.
  4. Graphical introduction: an on-screen graphic related to the instructor or video content appears early in the video (e.g., a logo for the company creating the video; a title screen reflecting the video topic). Rating guideline: 0 = absent, 1 = present.
  5. Content Introduction: the instructor describes, either vocally or through on-screen text or graphics, what the video will cover before beginning the substantive instruction. Rating guideline: 0 = absent, 1 = present.
  6. Background Music: the video (or at least some part of it) is accompanied by music that was not part of the tutorial itself. Rating guideline: 0 = absent, 1 = present.
  7. Mouse Click Errors: the narrator clicks in the wrong place on the screen (e.g., the narrator attempts to click an “OK” button and misses; the narrator clicks on the wrong menu before correcting himself or herself and selecting the correct menu). Rating guideline: Raters recorded a count of the total number of mouse click errors.
  8. Non-Deictic Mouse Movements: mouse movements that are expressive but lack spatial reference (e.g., moving the mouse rapidly around the screen to signify frustration). Rating guideline: 0 = never, 1 = once, 2 = 2–3 times, 3 = 4–10 times, 4 = frequent.
  9. Nonspatial Text: on-screen text that does not label a particular area or serve a segmentation function (e.g., a ‘This Will Not Work in Windows’ disclaimer on the bottom of the screen).

We calculated a composite content features score for each video by taking the mean of the video’s standardized ratings for each of the nine dimensions of content features (with the Mouse Click Errors dimension reversed).

Vocal performance. 139 of the 200 surveyed videos included narration. For these videos, the omnibus raters coded two aspects of narrative quality: vocal expressiveness and the prevalence of disfluencies in the narrator’s speech. By disfluencies, we mean interruptions in the flow of a narrator’s speech, such as long pauses, repetitions of words or syllables, or distracting use of vocal fillers such as “um” and “like.” For vocal expressiveness, raters used a seven-point scale with anchor points of 1 = inexpressive and 7 = highly expressive. For disfluencies, raters used the following seven-point scale: 1 = none; 2 = very few, not disruptive; 3 = very few, slightly disruptive; 4 = few, somewhat disruptive; 5 = few, disruptive; 6 = many, very disruptive; 7 = many, extremely disruptive. We calculated a composite vocal performance score for each video by taking the mean of its standardized rating for expressiveness with its reverse-scored and standardized rating for disfluencies.

Ratings Evaluating Quality

In addition to being coded for the cataloged features, each video tutorial was also rated on three dimensions of quality: content quality, production quality, and effective guidance of attention. These quality ratings were provided both by the video’s original omnibus rater and, more importantly, by the blind rater. We had the blind rater evaluate quality because of the potential that the omnibus raters’ quality ratings were biased: Omnibus raters had access to possible heuristics for estimating video quality (e.g., the omnibus raters could have used views as a basis for quality ratings), and also could have been biased by the way they viewed the videos (e.g., omnibus raters on the lookout for deictic mouse gestures and segmentation cues may view videos in fundamentally different ways than the typical YouTube viewer, who watches the video more holistically). To address these concerns, the blind rater viewed videos on a desktop computer in our lab, using a program that ensured that the blind rater was blind to video statistics (views, subscriptions, etc.) and to all of the omnibus raters’ coding. Thus, these factors could not influence the blind rater’s quality ratings. The blind rater provided evaluations of content quality, production quality, and effective guidance of attention for 195 of the 200 videos (five of the videos were no longer available when the blind rater began coding).

All raters were given anchor points to guide their quality ratings. For content quality, raters were told that “1 = poor – worst – very unclear, not useful, inaccurate,” that “4 = average,” and that “7 = excellent – best – very clear, very useful, highly accurate.” For production quality, raters were told that “1 = poor – worst – amateurish looking and sounding,” that “4 = average,” and that “7 = excellent – best – professional looking and sounding.” For effective guidance of attention, raters were told that “1 = poor – worst – less than necessary/ineffective guidance,” that “4 = average – amount of guidance reasonable/somewhat effective guidance,” and that “7 = excellent – best – appropriate amount of guidance/very effective guidance.”

We calculated two composite quality ratings for each video, one based on scores awarded by the omnibus rater and one based on the scores awarded by the blind rater. We computed these scores by averaging the relevant rater’s standardized ratings of content quality, production quality, and effective guidance of attention. We use the blind rater’s composite quality ratings in all viewership analyses reported below to avoid the possibility, mentioned above, that the omnibus raters’ quality ratings may have been influenced by their coding of other video features.[5]

Inter-rater Reliability

For the purpose of calculating inter-rater reliability, we asked each one of the three omnibus raters to provide a “second opinion” (a second complete set of ratings) for 20 of the videos in our sample. We used these “second opinion” ratings only for the purpose of calculating inter-rater reliability. We assessed reliability with a one-way random-effects, consistency, single-measure intraclass correlation coefficient (ICC; see Hallgren, 2012), generated using IBM’s SPSS software platform. Table 1 summarizes the ICC’s for the coders’ ratings of attention cues, segmentation cues, content features,[6] vocal expressiveness, vocal disfluency, and quality. All were highly significant and indicated at least ‘fair’ agreement (Ciccheti, 1994).[7]

In addition, with respect to the ratings of video quality, we evaluated inter-rater reliability between the omnibus rater and the blind rater for the 195 videos the blind rater coded. Our results are summarized in Table 2. The ICC for overall quality ratings was “good,” and ICC’s among the three component quality ratings were all highly significant and indicated at least ‘fair’ agreement (Ciccheti, 1994). We note these results because of their interesting implication: While there was potential for bias in omnibus raters’ quality ratings (given their access to views, likes, etc.), their ratings were actually fairly similar to those of the blind rater.


Frequency of Video Features

First, we report the prevalence of the formal features coded in our survey. Table 3 shows the percentage of screen-capture tutorials in our sample that used (at least once) the attention cues in our catalog. A substantial majority (88.5%) of videos used one or more attention cue. The most common attention cue was deictic gestures with the mouse cursor, which were used in 56.5% of videos surveyed. However, software highlighting features such as cursor click highlighting and screen zooming were used in approximately a quarter of videos each.

Table 4 shows the percentage of screen-capture tutorials in our sample that used (at least once) the segmentation cues in our catalog. Segmentation cues were used less frequently than attention cues: Only 31.5% of the videos in our survey used at least one segmentation cue. Non-cut cues—for example, fadeouts between topics—were the most common type of segmentation cue.

We observed a few cues that were not included in our catalog. In three cases, animations were added to videos to direct spatial attention. Two of these were cartoon characters, and one was an animated arrow. Another video incorporated an unusual 3-D zooming effect. Unusual segmentation cues included a whooshing sound effect and a change in music to signal a new concept. In one video, a new section was introduced by blurring the screen in addition to using text to introduce a new concept.

Table 5 shows the percentage of screen-capture tutorials in our sample that contained the content features in our catalog. As expected, the narrator’s face was rarely shown in screen-capture videos. Most videos (69.5%) included vocal narration. Among the videos that did not include vocal narration, some communicated instructions via on-screen text, but most simply demonstrated program features and left it to viewers to infer steps from viewing the action. A majority of videos included some sort of introduction of the video’s content or its creator. Slightly over one-third of the screencasts sampled were set to music.

Formal Features as Predictors of Video Viewership

In addition to surveying formal features, we investigated the relation between the instructor’s use of various types of formal features and video viewership. Views have long been of critical importance for content creators, as views are needed to generate revenue from advertisements.[8] Viewership on YouTube is influenced by an array of factors outside the scope of our catalog of formal features, including things like video tags, search algorithms, and channel inclusion. Nevertheless, we expected that use of the formal features that we cataloged would predict[9] some portion of the variability in video viewership. We reasoned that this might occur either because formal features tend to make videos more effective, or because the content creators who leverage formal features in their videos tend to be more thoughtful in designing and structuring their videos in general.

To quantify video viewership, we calculated each video’s views per month available. (We did this to account for differences in total number of views related to video age.) The mean number of views per month was 960 (range 1.25 to 85,194), but the distribution was strongly right-skewed, with a median of 80 views per month. We normalized the distribution by log-transforming the monthly views data. All viewership analyses were performed on the log-transformed data.

We hypothesized that viewership would be predicted by videos’ use of the features we cataloged: attention cues, segmentation cues, content features, and vocal performance. We recognized, however, that the hypothesized relations might be indirect, explained—at least in part—by relations between formal features and viewers’ perceptions of video quality. That is, it could be the case that the effective use of formal features tends to leave viewers with the impression that a video is of high quality, and that perceptions of high quality, in turn, predict viewership. Nevertheless, we posited that formal features might also predict viewership directly, separate and apart from their relation with perceived quality (for a discussion of indirect versus direct effects, see Loehlin, 2011; MacKinnon et al., 2007).

To test our hypotheses, we performed three sets of analyses. We performed three sets because, while 69.5% of videos in our sample featured vocal narration (and were therefore coded for vocal expressiveness and disfluencies), 30.5% did not (and were not). Thus, our first set of analyses includes the full sample of screen-capture tutorials, our second set examines the 139 screen-capture tutorials featuring vocal narration, and our third set examines the 61 videos without vocal narration.[10]

Our first two sets of analyses begin with path analyses. Path analyses are not frequently used in the technical communication literature, but they are useful tools for analyzing complex relations among variables. Specifically, path analysis builds upon multiple regression to allow researchers to probe whether relations between variables—for example, the relation between childhood abuse and aggression as an adult—are direct (e.g., increased experience with childhood abuse predicts increased aggression as an adult) or indirect (e.g., increased experience with childhood abuse predicts altered processing of social behaviors, which in turn predicts increased aggression as an adult; see MacKinnon et al., 2007). Here, we opted to perform path analyses because they allow us to evaluate direct relations between formal features and video viewership, while also evaluating potential indirect relations in which formal features predict perceived quality, which in turn predict viewership. Our path analyses were conducted with IBM SPSS Amos.

We summarize the results of our path analyses by presenting path models. Path models graphically illustrate the direct and indirect relations among the relevant exogenous variables (i.e., predictor variables, which are not explained by other variables in the model) and endogenous variables (i.e., outcome, or predicted, variables, which are explained by other variables in the model). Path models capture these relations by using path coefficients—standardized values akin to beta weights in regressions[11]—to communicate the relative strength of relations and arrows to display the hypothesized direction.[12] In Figures 1 and 2, these numerical coefficients are printed adjacent to arrows illustrating relations between variables. The coefficients reflect the amount of variance that the variable at the arrow’s source explains for the variable at the arrow’s head. Because the coefficients are standardized, they vary between +1 and -1. The farther the coefficient is from 0 the stronger the relation, and statistically significant coefficients are printed in black text in the figures. The sign on the coefficient represents the direction of the relations. Positive coefficients indicate that increases in one variable are associated with increases in the other, while negative coefficients indicate that increases in one variable are associated with decreases in the other. A useful introduction to path analysis is provided by Loehlin (2011).

The variables in our path analyses represent the composite variables for each feature category (as many of the individual features that we coded were too rare to be meaningfully incorporated into the analyses on their own). Because segmentation cues, as a category, were rare, they were combined with attention cues for the purposes of our path analyses. For any given video, the combined variable “Attention and Segmentation Cues” variable represents the average of the video’s composite attention cue score and its composite segmentation cue score, the “Content Features” variable represents the composite content features score, and (in our analyses of the subset of videos featuring vocal narration) the “Vocal Performance” variable represents the composite vocal performance score. As described above, Attention and Segmentation Cues, Content Features, and Vocal Performance were coded by omnibus raters. The “Quality” variable represents the composite quality rating from our blind rater, as the blind rater’s ratings were not susceptible to influence from video statistics such as views. Quality was positioned as a potentially mediating variable in all of our path models.

Analysis of the full sample

Our first path analysis examined the relations between formal features (Attention and Segmentation Cues; Content Features) and viewership, including direct relations and also indirect relations mediated by Quality. This analysis included the 195 videos for which we had Quality data (all but the five videos that our blind rater was unable to rate). The resulting, just-identified path model can be seen in Figure 1. The variables in our model explain 7.7% of the observed variability in video viewership (R2Views/Month = .077). In the model, significant relations between variables are represented by black lines, while non-significant relations are represented by gray lines.

As shown in Figure 1, Attention and Segmentation Cues and Quality were both directly and positively related to viewership (Attention and Segmentation Cues: (β = 0.179, p = .012; Quality: β = 0.171, p = .017). Attention and Segmentation Cues only predicted viewership directly; they were not related to Quality (β = 0.071, p = .326). Content Features also predicted video viewership, but only indirectly via Quality.[13]

Having observed a direct link between Attention and Segmentation Cues and video viewership, we were interested in whether that direct link was more attributable to attention cues or to segmentation cues. To investigate, we conducted a multiple regression with viewership as our outcome variable. The predictor variables of interest were Attention Cues (the composite attention cues score) and Segmentation Cues (the composite segmentation cues score); we also controlled for Content Features and Quality. Our results, reported in Table 6, indicate that Segmentation Cues significantly predicted video viewership while Attention Cues did not.

Finally, we wanted to confirm that the patterns of relations among the variables in Figure 1 were not driven by the presentation styles or popularity of particular screencast authors. Although the majority of the screen-capture tutorials in our sample were created by different screencast authors, there were 26 authors who created more than one tutorial in the data set. To account for this, we trimmed our data set to allow for only one entry per author. For those who authored multiple tutorials in the sample, we created a single entry by averaging the coded variables across all of the authors’ tutorials. We then re-ran the path analysis reflected in Figure 1 using this trimmed data set. The pattern of significant and insignificant paths was the same as in Figure 1, with one exception: the direct link between Attention and Segmentation Cues and viewership, which was statistically significant in Figure 1, fell just short of significance with less power in the model based on the trimmed data set (β = .154, p = .069).

Analysis of videos with vocal narration

We conducted a second path analysis on the 139 screen-capture tutorials in our sample that featured vocal narration. This path analysis included Vocal Performance as an additional predictor.

The variables in this path model explain 14.2% of the observed variability in video viewership (R2Views/Month = .142). The pattern of statistically significant and insignificant paths illustrates that Vocal Performance is quite predictive of viewership of screen-capture tutorial videos. As shown in Figure 2, Vocal Performance is the only significant predictor of Quality (β = .390, p < .001) and of video viewership (β = .356, p < .001) in the model. Thus, it seems that among the videos with vocal narration, Vocal Performance is such a powerful predictor of perceived Quality and viewership that including it in the model eliminates the predictive value of Attention and Segmentation Cues and Content Features.

Once again, we re-ran this path analysis using the trimmed data set that included a maximum of one entry per author. The pattern of significant and insignificant paths was identical to the pattern reported in Figure 2.

Analyses of videos with no vocal narration

We expected that in videos with no narration, Attention and Segmentation Cues would be especially relevant to audience response. To test this hypothesis, we conducted a multiple regression predicting viewership for the videos without narration in our sample. (We conducted a multiple regression because the smaller subsample of videos without narration did not provide enough power for a meaningful path analysis.) Of the videos without narration, 59 were among those coded for quality by our blind rater.

Our multiple regression included Attention and Segmentation Cues, Content Features, and Quality as predictors of viewership. Overall, despite limited power, our model was nearly statistically significant, F(3, 55) = 2.593, p = .062. As reported in Table 7, Attention and Segmentation Cues was the only significant predictor in the model (β = .277, p = .046), with more use of attention and segmentation cues predicting more views.


In this article, we have presented a catalog of formal features that creators of screen-capture tutorial videos can use to facilitate communication to viewers. A survey of 200 screen-capture tutorial videos from YouTube revealed that screencast authors are using many of the formal features in our catalog, to varying degrees. The catalog we have developed can no doubt be revised and expanded in future work. There are other characteristics of screen-capture tutorials that can contribute to variability in viewership, including the topic itself, the creator or channel, physical characteristics (e.g., screen resolution), accessibility characteristics (e.g., captions or subtitles), and characteristics relating to the broader YouTube ecology (e.g., video tags). Further, the catalog will need to change as screen-capture software develops and provides authors with additional flexibility. However, we suggest that the categories of formal features in our catalog are likely to be stable and useful as screen-capture technology changes, in part because of their close relations to basic cognitive skills used when viewing screencasts.

Path analyses and regression analyses indicated that creators’ use of formal features in our catalog predicted video viewership. Certain types of features—vocal performance and segmentation cues—were especially predictive. When considering the relations between formal features and viewership, one interesting possibility is that some of the features available to screencast authors may be systematically underused. Segmentation cues may be an example. In the multiple regression reported in Table 6, segmentation cues had a bigger positive relation with viewership (β = .390, p = .013) than other categories of formal features. Yet, segmentation cues were used with the least frequency, appearing in only 31.5% of the videos rated. It is important to keep in mind that our analyses do not allow us to draw inferences about the direction of this relation between segmentation cues and views. It could be that segmentation cues themselves help videos draw an audience. Perhaps by reducing the viewers’ burden to mentally organize the events in the tutorial, segmentation cues allow viewers to more readily extract the critical information (see Zacks et al., 2010). Alternatively, perhaps creators who go through the effort to add segmentation cues to their videos simply tend to produce more thoughtful videos with better content. In any case, our findings tentatively suggest that screen-capture authors might benefit from using segmentation cues more frequently.

One caveat to our observation about segmentation cues is that our survey sample was deliberately restricted to brief (under 4–minute) videos. It is possible that longer videos make more use of segmentation cues, if only because they contain more content to segment. Nevertheless, even among the brief videos we surveyed, there were many lessons that were complex enough for segmentation to have been useful. It may be that, regardless of video length, authors who are familiar with the content they are teaching tend to underestimate the usefulness of segmentation cues to their audience of novices.

Our path analyses also suggest the importance of vocal performance in screencasts. Among screencasts with narration, vocal expressiveness and disfluency were such strong predictors of video viewership that they effectively washed out the predictive value of every other category of formal feature (see Figure 2). This was surprising. Vision is widely considered to be the most important sense humans use to obtain information (e.g., Hershberger, 1992; Moore, 1996). Yet, our results suggest that the auditory component of the screencast may be more important than the visual. One potential explanation relates to the fact that screencast videos typically involve step-by-step demonstrations of the performance of an operation. The sequence of the steps is important. Thus, the relationship between viewership and narrative expressiveness/fluency may reflect the role of linguistic working memory in sequence comprehension (see Magliano et al., 2016). Furthermore, content creators who are knowledgeable and enthusiastic about a topic may tend to produce both better vocal performances and better overall information, which is ultimately reflected in viewership.

While vocal performance appears to be a significant contributor to the popularity of screen-capture tutorials, there is hope for screencast authors who are less able (or less confident in their ability) to deliver fluent, expressive narration. Over 30% of the screencasts we surveyed had no narration, and many had received a lot of views. Indeed, in our sample, video viewership did not significantly differ between videos with vocal narration and videos without vocal narration. The videos without narration were viewed an average of 384 times per month, and four were viewed more than 1,000 times per month. The regression analyses we performed on the videos without vocal narration indicated that, in the absence of vocal narration, attention cues and segmentation cues may take on added importance. Thus, if a screencast author does not wish to provide voice-over, the author may want to pay careful attention to his or her use of visual cues to guide viewers’ attention and help them parse events.

Finally, it is noteworthy that, in our data set, the relation between content features and viewership was different from the relation between attention/segmentation cues and viewership. There was no direct relation between content features and viewership—only an indirect relation mediated by perceived quality. In contrast, attention/segmentation cues had a more direct relation with viewership, independent of their relation with perceived quality. This distinction should be interpreted with some caution, given that the mediating perceived quality variable is based on the ratings of a single blind rater. More generally, our single study cannot definitively specify all of the links in the path models we propose. Nevertheless, screencast creators should be aware of the possibility that using some formal features may directly relate to viewership, while other formal features may relate to viewership if they are executed effectively enough to contribute to perceived quality.


Our goal in this paper was to begin describing the formal features of screen-capture instructional videos. We have cataloged many of the formal features that screencast authors use to communicate with their viewers, and we offer an empirically grounded framework for thinking about these features according to function and modality. In addition, we identified relations between authors’ use of these features and video viewership. Our results suggest that certain features, like segmentation cues and vocal performance, are particularly closely related to viewership. It is our hope that our findings provide additional scaffolding for future research on screencasts—research that might help content creators leverage formal features to communicate more effectively with viewers.


Bailey, H. R., Zacks, J. M., Hambrick, D. Z., Zacks, R. T., Head, D., Kurby, C. A., & Sargent, J. Q. (2013). Medial temporal lobe volume predicts elders’ everyday memory. Psychological Science, 24(7), 1113–1122.

Baker, L. J., & Levin, D. T. (2015). The role of relational triggers in event perception. Cognition, 136, 14–29.

Baker, L. J., Levin, D. T., & Saylor, M. M. (2016). The extent of default visual perspective taking in complex layouts. Journal of Experimental Psychology: Human Perception and Performance, 42(4), 508–516.

Borghol, Y., Ardon, S., Carlsson, N., Eager, D., & Mahanti, A. (2012). The untold story of the clones: Content-agnostic factors that impact YouTube video popularity. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1186–1194

Borghol, Y., Mitra, S., Ardon, S., Carlsson, N., Eager, D., & Mahanti, A. (2011). Characterizing and modeling popularity of user-generated videos. Performance Evaluation, 68(11), 1037–1055.

Butterworth, G., & Jarrett, N. (1991). What minds have in common is space: Spatial mechanisms serving joint visual attention in infancy. British Journal of Developmental Psychology, 9(1), 55–72.

Chan, Y. M., Choo, K. A., & Woods, P. C. (2013). YouTube videos for learning principles of animation. In Proceedings of the 2013 International Conference on Informatics and Creative Multimedia, 43–46.

Chong, F. (2018). YouTube beauty tutorials as technical communication. Technical Communication, 65(3), 293–308.

Cichetti, D.V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284–290.

Cross, A., Bayyapunedi, M., Cutrell, E., Agarwal, A., & Thies, W. (2013, April). TypeRighting: combining the benefits of handwriting and typeface in online educational videos. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 793–796.

Getto, G., Cushman, E., & Ghosh, S. (2011). Community mediation: Writing in communities and enabling connections through new media. Computers and composition, 28(2), 160–174.

Giannakos, M. N. (2013). Exploring the video-based learning research: A review of the literature. British Journal of Educational Technology, 44(6), 191–195.

Guo, P. J., Kim, J., & Rubin, R. (2014). How video production affects student engagement: An empirical study of MOOC videos. In Proceedings of the First ACM Conference on Learning at Scale, 41–50.

Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23.

Hasson, U., Nir, Y., Levy, I., Fuhrmann, G., & Malach, R. (2004). Intersubject synchronization of cortical activity during natural vision. Science, 303(5664), 1634–1640.

Henze, B. (2013). What do technical communicators need to know about genre? In J. Johnson-Eilola & S. Selber (Eds.), Solving problems in technical communication (pp. 337–361). University of Chicago Press

Hershberger, P. J. (1992). Information loss: The primary psychological trauma of the loss of vision. Perceptual and Motor Skills, 74(2), 509–510.

Ilioudi, C., Giannakos, M. N., & Chorianopoulos, K. (2013). Investigating differences among the commonly used video lecture styles. In CEUR Workshop Proceedings, 983, 21–26.

Kurby, C. A., & Zacks, J. M. (2011). Age differences in the perception of hierarchical structure in events. Memory & Cognition, 39(1), 75–91.

Levin, D. T., & Baker, L. J. (2017). Bridging views in cinema: a review of the art and science of view integration. WIREs Cognitive Science, 8(5), doi: 10.1002/wcs.1436

Levin, D. T., & Varakin, D. A. (2004). No pause for a brief disruption: Failures of visual awareness during ongoing events. Consciousness and Cognition, 13(2), 363–372.

Loehlin, J. C. (2011). Latent variable models: An introduction to factor, path, and structural analysis (5th ed). Routledge.

MacKinnon, D. P., Fairchild, A. J., & Fritz, M. S. (2007). Mediation analysis. Annu. Rev. Psychol., 58, 593–614.

Magliano, J. P., Larson, A. M., Higgs, K., & Loschky, L. C. (2016). The relative roles of visuospatial and linguistic working memory systems in generating inferences during visual narrative comprehension. Memory & Cognition, 44(2), 207–219.

Mayer, R. E. (2002). Multimedia learning. In Psychology of learning and motivation (Vol. 41, pp. 85–139). Academic Press.

Miller, C. R. (1984). Genre as social action. Quarterly Journal of Speech, 70(2), 151–167.

Mohamad Ali, A. Z., Samsudin, K., Hassan, M., & Sidek, S. F. (2011). Does screencast teaching software application need narration for effective learning? The Turkish Online Journal of Educational Technology, 10, 76–82.

Moll, H., & Tomasello, M. (2004). 12-and 18-month-old infants follow gaze to spaces behind barriers. Developmental Science, 7(1), F1–F9.

Moore, J. (1996). The visual system and engagement in occupation. Journal of Occupational Science: Australia, 3(1), 16–17.

Morain, M., & Swarts, J. (2012). YouTutorial: A framework for assessing instructional online video. Technical Communication Quarterly, 21(1), 6–24.

Preacher, K. J., & Kelley, K. (2011). Effect size measures for mediation models: quantitative strategies for communicating indirect effects. Psychological methods, 16(2), 93–115.

Pflugfelder, E. H. (2013). The minimalist approach to online instructional videos. Technical Communication, 60, 131–146.

Ritzhaupt, A. D., Gomes, N. D., & Barron, A. E. (2008). The effects of time-compressed audio and verbal redundancy on learner performance and satisfaction. Computers in Human Behavior, 24, 2434–2445.

Ritzhaupt, A. D., Pastore, R., & Davis, R. (2015). Effects of captions and time-compressed video on learner performance and satisfaction. Computers in Human Behavior, 45, 222–227.

Samson, D., Apperly, I. A., Braithwaite, J. J., Andrews, B. J., & Bodley Scott, S. E. (2010). Seeing it their way: Evidence for rapid and involuntary computation of what other people see. Journal of Experimental Psychology. Human Perception and Performance, 36(5), 1255–1266.

Sargent, J. Q., Zacks, J. M., Hambrick, D. Z., Zacks, R. T., Kurby, C. A., Bailey, H. R., Eisenberg, M. L., & Beck, T. M. (2013). Event segmentation ability uniquely predicts event memory. Cognition, 129(2), 241–255.

Selber, S. A. (2010). A rhetoric of electronic instruction sets. Technical Communication Quarterly, 19(2), 95–117.

Slide, A. (2012). The Encyclopedia of Vaudeville. Univ. Press of Mississippi.

Smith, T. J., Levin, D., & Cutting, J. E. (2012). A window on reality: Perceiving edited moving images. Current Directions in Psychological Science, 21(2), 107–113.

Sugar, W., Brown, A., & Luterbach, K. (2010). Examining the anatomy of a screencast: Uncovering common elements and instructional strategies. The International Review of Research in Open and Distributed Learning, 11(3), 1–20.

Swallow, K. M., Zacks, J. M., & Abrams, R. A. (2009). Event boundaries in perception affect memory encoding and updating. Journal of Experimental Psychology: General, 138(2), 236–257.

Swarts, J. (2012). New modes of help: Best practices for instructional video. Technical Communication, 59(3), 195–206.

TechSmith (2018). About us. https://www.techsmith.com/about.html

ten Hove, P., & van der Meij, H. (2015). Like it or not. What characterizes YouTube’s more popular instructional videos? Technical communication, 62(1), 48–62.

Tewell, E. (2010). Video tutorials in academic art libraries: A content analysis and review. Art Documentation: Journal of the Art Libraries Society of North America, 29, 53–61.

Udell, J. (2004, November 15). Name that genre [Blog Post]. http://jonudell.net/udell/2004-11-15-name-that-genre.html

Udell, J. (2004, November 17). Name that genre: screencast [Blog Post]. http://jonudell.net/udell/2004-11-17-name-that-genre-screencast.html

Williams, J. M., & Goodwin, S. P. (2007). Teaching with technology: An academic librarian’s guide. Chandos Publishing.

Yates, J., Orlikowski, W. J., & Rennecker, J. (1997, January). Collaborative genres for collaboration: Genre systems in digital media. In Proceedings of the Thirtieth Hawaii International Conference on System Sciences (Vol. 6, pp. 50–59). IEEE.

Zacks, J. M., Speer, N. K., Swallow, K. M., Braver, T. S., & Reynolds, J. R. (2007). Event perception: A mind-brain perspective. Psychological Bulletin, 133(2), 273–293.

Zacks, J. M., Speer, N. K., Swallow, K. M., & Maley, C. J. (2010). The brain’s cutting-room floor: Segmentation of narrative cinema. Frontiers in Human Neuroscience, 4, 1–15.


Christopher Brett Jaeger is an Acting Assistant Professor at New York University’s School of Law. He earned his JD from Vanderbilt Law School and his PhD in Psychology from Vanderbilt University. His research explores, among other areas, the role that knowledge, beliefs, and expectations play in shaping cognitive processes ranging from perception to decision-making. Contact: christopher.b.jaeger@nyu.edu

Joshua Little, originally from Charlotte, Tennessee, is a recent graduate of Florida State University, with an MFA in Film Production. He received a BA from Vanderbilt University, majoring in Psychology and Cinema & Media Arts. He is now pursuing a career in the film industry as a visual effects artist based in Los Angeles. Contact: joshua.w.little@vanderbilt.edu

Daniel Levin is a Professor of Psychology at Vanderbilt University. He received his BA from Reed College, and his PhD at Cornell University. His research explores the role of knowledge in visual perception, especially in naturalistic settings. Contact: daniel.t.levin@vanderbilt.edu

The research in this report was supported by NSF grant 1623625 to DTL.


[1] Specifically, we used a random number generator to generate numbers between 1 and 20, and viewed the video in the position corresponding to the randomly-generated number.

[2] We considered a video relevant if it was a screen-capture tutorial containing instruction about how to perform at least one task in, or use at least one feature of, the application.

[3] Two omnibus raters coded 70 screen-capture tutorials and the third coded 60 screen-capture tutorials.

[4]  A standardized rating represents the number of standard deviations above or below the mean that a particular rating falls. To standardize any particular rating X for formal feature Y, we subtracted the mean of all ratings of formal feature Y, then divided the difference by the standard deviation of all ratings of formal feature Y. For example, assume that omnibus rater A rated Video 1 a “4” for deictic mouse gestures. Further assume that the mean of all ratings for deictic mouse gestures across all videos was 2, with a standard deviation of 1.5. We would standardize omnibus rater A’s rating of Video 1 as follows: (4 – 2) / 1.5 = 1.33.

[5] We note, however, that the pattern of results in our viewership analyses is substantially similar if the omnibus raters’ quality ratings are used instead of the blind rater’s quality ratings.

[6] Because many of the individual features that we coded in the videos were rare, we used overall category-level scores for the first three categories of features in our assessment (attention cues, segmentation cues, and content features).

[7] “The guidelines state that, when the reliability coefficient is below 0.40, the level of clinical significance is poor; when it is between .40 and .59, the level of clinical significance is fair; when it is between .60 and .74, the level of clinical significance is good; and when it is between .75 and 1.00, the level of clinical significance is excellent.” (Ciccheti, 1994, at 286.)

[8] The number of times a video is viewed may be a useful proxy for perceived effectiveness, as viewers tend to recommend and share links to videos that they find useful.

[9] When we discuss formal features “predicting” views in the context of our statistical analyses, we are not suggesting that use of formal features causes increased viewership. Indeed, our analyses are built on correlations and do not allow us to make any definitive claims about the direction of causal relations among variables. When we say that one variable “predicts” another, we mean changes in one variable—referred to as a “predictor” variable in regression and path analysis—is associated with a change in another variable—referred to as an “outcome” variable.

[10] On a descriptive note, the 139 videos with vocal narration were viewed an average of 1,214 times per month (with a standard deviation of 7,651 views per month). Videos without vocal narration were viewed an average of 384 times per month (with a standard deviation of 909 views per month). This difference in views between videos with vocal narration and videos without vocal narration was not statistically significant, t(198) = 0.843, p = .40.

[11] Some path models use unstandardized path coefficients, though these are less common than path models using standardized path coefficients (see Loehlin, 2011). All of the path models in this article show standardized path coefficients.

[12] Double-headed arrows reflect correlations with no direction hypothesized.

[13] To test for a significant indirect relation between Content Features and viewership (via Quality), we ran a mediation analysis using Hayes’ PROCESS macro for IBM SPSS. Our analysis indicated a significant indirect relation (completely standardized indirect effect: β = .0378, SE = .0201, BCa CI [.0084, .0925], p < .05). For discussion of completely standardized indirect effects and their interpretation, see Preacher & Kelley, 2011. The relation between Content Features and viewership in our data was only indirect. As shown in Figure 1, when Quality is included in the model as a mediator, there is no direct relation between Content Features and viewership (β = .033, p = .653).