62.1, February 2015

Like It or Not. What Characterizes YouTube’s More Popular Instructional Videos?

Petra ten Hove and Hans van der Meij


Purpose: There is a tremendous growth in the production of instructional videos. This study investigates whether popular YouTube instructional videos for declarative knowledge development differ in their physical characteristics from unpopular and average ones.

Method: Sampling followed a three-step procedure. First, 250 YouTube videos aiming for declarative knowledge development were selected. Next, a formula for popularity rating was developed. After distinguishing three classes of popularity, the five most viewed videos for five types of declarative knowledge were selected. This resulted in a sample of 75 videos. After coding and scoring, statistical analyses were performed to discover differences between popularity classes in the physical characteristics of videos.

Results: Popular videos differed significantly from unpopular and/or average videos in the following ways: (1) higher production quality (that is, resolution); (2) more frequent presence of static pictures (both iconic and analytic); (3) more frequent presence of a combination of static and dynamic pictures; (4) more often short on-screen texts; (5) more often subtitling with different languages; (6) more frequent inclusion of background music; (7) less background noise; (8) faster speaking rate (that is, words per minute).

Conclusion: The sampled videos strongly varied in their physical characteristics. There were also many significant differences across popularity classes. The findings can be used to optimize video designs for popularity. In addition, they provide a starting point for further research on how physical characteristics may affect knowledge development.

Keywords: instructional video, design characteristics, declarative knowledge, YouTube, popularity ratings

Practitioner’s Takeaway

  • A formula with both viewer appraisals and viewing rates is introduced for gauging YouTube popularity.
  • Seventy-five videos on declarative knowledge development varied hugely in the physical characteristics of resolution, visuals, verbal & sound, and tempo.
  • Popular YouTube videos differed from average and unpopular videos on most of these physical characteristics.
  • Designers may find the frequency findings and their discussion helpful for constructing (more) popular videos.


This paper reports on a study of a particular type of YouTube video, namely instructional videos. More precisely, we closely examine the physical characteristic of YouTube videos that aim to support declarative knowledge development. Declarative knowledge includes factual and conceptual information. It is also characterized as “knowing that” and contrasted with procedural knowledge that refers to “knowing how” (Smith & Ragan, 2005).

YouTube provides statistics without making a distinction into video types. Even so a conservative estimate of the development rate of instructional videos for declarative knowledge development can be made. On YouTube 100 Hours of videos are uploaded every hour (“Statistics,” 2015). Assuming that this is at least a stable production rate (it is likelier to go up than down), and 10 percent of these videos addresses declarative knowledge development, this amounts to a production rate of 1.680 new videos each week and 87.360 each year. Because YouTube started in 2007, there is thus little doubt that a vast number of instructional videos can be found on its website.

Videos in general vary considerably in how well they are appreciated and how often they are viewed. Instructional videos are probably no different in this respect. This made us wonder whether there are certain characteristics that make some instructional videos more popular than others. To investigate this issue, we decided to conduct a systematic analysis, concentrating on the physical characteristics of video. This paper describes our approach to investigating these characteristics and their possible relationships with popularity. Our primary purpose was that of mapping the field. In addition, we believed that the findings might help designers in creating instructional videos that reach a large audience. Furthermore, the outcomes could give a glimpse of video characteristics that contribute to (stronger) knowledge development.

First, we address the question how to gauge the popularity of YouTube videos. Then a discussion follows about what physical characteristics we studied of the videos. A substantial part of the method section addresses the sampling of the 75 videos that were analyzed in detail. In the results, we define, illustrate and discuss each physical characteristic. The conclusion debates on what may be driving video popularity. In addition, we address some limitations of the study.

Popularity Ratings

YouTube gathers numerous statistics about the videos that are uploaded to its Web site. From the data that are publicly available, two quickly come to mind for gauging popularity: viewer appreciation and viewing rates. Both factors were included in the formula that we set out to create for obtaining popularity ratings.

The obvious first choice for a popularity rating is viewer appreciation. When we conducted our study, YouTube had changed its original five-star viewer rating into the Like or Dislike dichotomy. A like, displayed as a thumbs up icon, means that the viewer positively valued the video. A dislike means the opposite. It is displayed as a thumbs-down icon on YouTube. We included both variables in our formula.

Another factor that we included was viewing rates. From the data that YouTube provides on usage we took the variables Views and Shares. The View statistic represents the number of views a video has accumulated over its life span. The Share statistic stands for the number of times a video has been made available to others. There are at least three ways that YouTube counts a “share”: (1) a Web address is mailed to others, (2) a video is embedded in another Web page, and (3) a video or link is sent to other parties via Facebook, Twitter, Blogger, LinkedIn, and other social media.

We found mindboggling statistics for both viewer appreciation and viewing rates. Therefore, the basic frequency data for the variables in the formula (that is, Likes, Dislikes, Shares, and Views) were classified into five ordinal categories. With these values as input, the formula afforded us to make a distinction between unpopular, average and popular videos (see Method).

Physical Characteristics

Two recent attempts to classify instructional videos on their physical characteristics are directly relevant for the present study. One is the review of Ploetzner and Lowe (2012) on expository animations. The other investigation is the research of Swarts (2012), and Morain and Swarts (2012) on “how to” videos for software training.

Ploetzner and Lowe (2012) decided to conduct their inventory study of expository animations used in educational research because “there is still no systematic account of the main characteristics” (p. 781). An important characteristic of these animations is their instructional purpose. The analyzed studies predominantly used animations for developing declarative knowledge.The authors distinguished six main dimensions in how the subject matter was presented to the user: representations employed, abstraction, explanatory focus, viewer perspective, spatio-temporal arrangement, and duration. Representations employed stands for how information is conveyed to the user. The basic distinction is that between a visual and an auditory mode. Abstraction refers to the level of concreteness of the images. Iconic pictures are distinguished from analytic ones. Explanatory focus refers to the kind of information that is represented. A distinction is made between behavioral, structural and function-oriented expositions. With this dimension, the authors qualify the kind of knowledge that a video aims for. Viewer perspective concerns the issue whether an animation consistently retained a single perspective on the subject matter, or whether there were multiple views. The dimension spatio-temporal arrangement covers a broad range of features that cover spatial organization and timing. Among others, the dimensionality of pictures (for example, 2-D or 3-D) and the handling of pauses and chronology fall within this dimension. Also subsumed are aspects of production quality, such as the resolution of an animation. Duration is simply the length of the animation.

Swarts (2012), and Morain and Swarts (2012) analyzed YouTube videos for procedural knowledge development. More specifically, they concentrated on “how to” videos for software training. According to these authors, the rubric physical design encompasses three facets, namely accessibility, viewability and timing.

Accessibility refers to features that provide navigational support within the video. In other words, this facet deals with features that direct viewers to pertinent screen areas. Examples are croppings, headings, voice-overs, and zooms and pans. Viewability refers to the issue of production quality (that is, audio, video, and text). Features that belong to this facet are the presence or absence of imperfect recordings of sound and image, and resolution. Timing is characterized as the issue of pacing. Features are speed of the video, pace of the narration, and well-synchronized audio and video tracks.

The aforementioned studies investigated a fairly extensive set of physical characteristics. We decided to look for a framework with a more limited set of features that still represented the most critical basic characteristics. Our first step therefore involved pruning.

The dimension explanatory focus from Ploetzner and Lowe (2012) received a different place because we wanted to qualify of the goal of the video before looking at physical characteristics. This dimension is now presented in the method section where we discuss different types of declarative knowledge development supported by video. Duration was also considered part of the descriptive data of each video. Finally, we decided to exclude the dimension of viewer perspective because such perspective changes were already rarely found by the authors (and we also hardly found any).

In our view, the categories discussed by Swarts (2012) mainly represented more advanced physical characteristics of video (such as zooming and cropping) rather than the basic ones that we were interested in. In addition, most of the features that Swarts discussed hinged on interpretation, while we aimed for features that could be objectively assessed. For instance, his interest in the voice-over primarily stemmed from the functional role this feature could play in directing the viewer’s attention to pertinent screen information.

With these considerations in mind, we decided to investigate the following physical categories: resolution, visuals, verbal & sound, and tempo (see Table 1). Below we briefly describe each category and we point out the relationships with the aforementioned frameworks. Detailed descriptions and illustrations of our own framework are presented in the results section.


Resolution is a critical feature for assessing the production quality of a video. Swarts (2012) briefly mentions video resolution. Ploetzner and Lowe (2012) distinguish between two types of resolution: temporal and spatial. Temporal resolution denotes the number of picture elements per unit of time, such as the number of frames per second. This resolution is important for perceiving animations as continuous, among others. Spatial resolution denotes the number of picture elements per unit of length, such as dots per inch. It is important for perceiving pictures as sharp and for distinguishing details. YouTube videos give no information about temporal resolution, but they do provide the data for spatial resolution. In this study we therefore concentrated on this aspect of resolution.

The category visuals stands for the pictorial representations used in the video. Within this category a further distinction is made between static and dynamic displays, and between real or realistic and abstract representations. This main category, and the distinction between static and dynamic displays, is also present in Ploetzner and Lowe’s (2012) dimension of “representations employed.” In their classification abstractness is a separate dimension. Since it referred only to pictures, we grouped it under visuals.

The category verbal & sound stands for the written text and auditory mode in which the videos convey information. We have chosen to include all written text under this category because the information primarily comes from the verbal representation. Ploetzner and Lowe (2012) classify written texts as visuals. The inclusion of subtitles makes our classification for written text more extensive than that of Ploetzner and Lowe. The difference probably stems from the fact that their animations were (made) suitable only for a dedicated audience, whereas YouTube videos are intended for a large and worldwide audience that can be reached only if the videos include subtitles and language translations. Ploetzner and Lowe (2012) distinguish three types of audio, namely sound (that is, nonverbal information), speech (that is, verbalizations that were part of the animation), and narration (that is, verbalizations about the narration). Our framework simply refers to the latter two as narration.

Tempo is a vital aspect of the temporal characteristic of video. The feature that we extracted from Ploetzner and Lowe’s (2012) discussion on this matter is pauses. Pauses are important signs of event boundaries. In addition, they can help viewers better process the video by giving them some time for letting the information sink in. Only Swarts (2012) specifically addresses narrative speed, arguing that pace should be neither too quick or too slow. Our own framework includes both pauses and narrative speed as features of tempo.

We believe that the four categories of resolution, visuals, verbal & sound, and tempo exemplify a representative and meaningful choice from the physical constituents of video. Another reason for choosing precisely these characteristics was that the analyses could be done objectively. Coding required little interpretation. Analyzing YouTube videos that vary in popularity on these physical characteristics is seen as a first step toward better understanding what makes some videos more popular than others. Because of the exploratory nature of our study, no a priori hypotheses were formulated.


Data Sampling

Sampling the YouTube videos was done in three steps. First, a large database was formed with instructional videos for declarative knowledge development. The leading selection principle was that the videos should have an instructional aim and addressed factual and conceptual knowledge rather than procedures or attitudes. Second, a formula was created to obtain a popularity rating for each video. The formula combines viewer appreciation and viewing rates. It yields a score that is a combination of the number of likes, dislikes, views and shares. Third, a classification rubric was constructed for grouping each video. This rubric distinguished between five cognition types and three popularity classes. After thus classifying each video from our initial sample of 250, in each cell the five videos with the largest number of views, a total of 75 videos, was selected for detailed analyses. The steps are detailed below.

Step 1: Selection of Instructional Videos on Factual and Conceptual Knowledge. Swarts (2012) recently conducted an analysis of YouTube videos for procedural knowledge development, or “knowing how.” Our focus was on a different set, namely YouTube videos for factual and conceptual knowledge development. An important characteristic of these videos is their emphasis on explanation. These videos primarily address “knowing that.”

For classifying the videos on the type of knowledge that was presented, we departed from the widely accepted, adapted version of Bloom’s taxonomy (Anderson et al., 2001). The taxonomy makes a fundamental distinction between factual and conceptual knowledge. Factual information concerns isolated bits of information. These bits are the basic elements that people need to understand things. Conceptual information is more complex. It revolves around relationships between elements and the larger structure that enables elements to function together. Within each class, a further subdivision is possible.

For factual information, a distinction can be made between terms and facts. The first includes discussions about domain-specific terms (for example, an explanation of the terms row, field, data value in a video on database terms). The latter refers to an enumeration of facts or findings (for example, historic dates in a presentation on World War 1).

For conceptual information, a distinction can be made between concepts (classifications and categorizations), principles (principles and generalizations), and models (models, theories and structures). Videos about concepts may present different learning styles or music styles. Videos on principles may discuss how inflation comes about, or how gravity works. Videos about models may discuss evolution theory, or explain Maslow’s theory of need.

Two broad classes of instructional videos were excluded from sampling: lectures and documentaries. Both have unique characteristics that set these videos apart from other instructional videos (see Guo, Kim, & Rubin, 2014).

In addition to the main aspect of content, our first selection of videos was based on four criteria. One was language. Only videos in Dutch or English were included because the researchers were fluent in only these languages. Another criterion concerned video length. A maximum of 30 minutes was adopted, mainly for practical reasons. As we will discuss later, video length is an important factor in viewing rates and presumably also for viewer appreciation. The 30-minute criterion safeguards against low scores on both factors due to video length alone. A fourth criterion was intended audience. We decided to focus on videos for an audience of young adults of 12 years and older. When words like “for kids” and “children” appeared in the title, the video was excluded from sampling. The fifth criterion concerned baseline measures. Only videos were selected that had been online for one month or longer and had received at least 1000 views and 25 ratings (likes and/or dislikes). This criterion was chosen to enhance the validity of the sample. For the same reason we included maximally five videos per subject matter, and per channel or uploading account.

Our search began with putting the browser setting on incognito mode. This setting avoids obtaining personalized results depending on cookies, browsing history, networks and the like. Next, we opened the YouTube Web site and typed keywords in the search window. The taxonomy helped us in formulating keywords that would yield instructional videos. That is, we searched for videos that included words like ‘‘explanation,’, ‘‘understanding,” ‘‘why,” “terms,” “principle of,” “structure,” “categories,” “models,” and “theories.”

The search results were then screened for their title, the number of views, and we engaged in a quick perusal of the content of the videos. When this initial inspection did not lead to a clear decision on inclusion or exclusion, the video was viewed in more detail. For all five cognition types, we assembled the same number of videos, until a total of 250 videos was initially sampled. These videos were downloaded with the program ‘Free YouTube Download v. 3.2.29. build 303’ and saved as mp4-files. Each video also received a unique ID that was kept along with pertinent YouTube statistics on channel, upload date, duration, view count, and viewer rating.

Step 2: Assessment of Video Popularity. YouTube collects and publicly posts data that can be used to assess video popularity. One of these measures is viewer ratings, of course. When we started our study, YouTube had already changed its original five-star rating into a system in which the viewer can express appreciation by selecting a thumbs-up (Like) or thumbs-down (Dislike) icon.

Another measure that can be used to gauge popularity is viewing rates. The number of times a video has been seen is another signal of its popularity. As indicated earlier, the sampling procedure excluded videos with fewer than 1000 views. This criterion was chosen to safeguard against low viewing rates. In addition, it prevented the inclusion of videos with the problematic score of 301 views (see “Here’s why the view count,” 2014). The third measure of popularity that we included in our formula was the number of times shared. Sharing a video is a sign of appreciation. However, the statistic does not reveal whether this is done because the video is liked or disliked.

The following formula for assessing a popularity rating was constructed: PR = (2Lr +V+S)/4. PR stands for popularity rating. The variable Lr in the formula represents a Like-ratio. For likes (L) and dislikes (D) we computed this Like-ratio with the formula Lr = (L/(L+2D))*100. Dislikes were counted twice in the Like-ratio because we assumed that such ratings are less common, if only because viewers may shy away (‘not worth the time’) from videos that receive a considerable number of dislikes. The variables V and S in the formula stand for, respectively, the number of views and the number of times shared. By including both appraisals and usage date, the formula should give a more robust assessment of popularity than when only one of these measures is included.

The variables in the formula (that is, Lr, V and S) showed tremendous variations in their frequencies. This prompted us to create an ordinal scale with five categories (1 to 5) for each variable. For instance, we coupled the frequency data for views to a category coding in the following way: 1.000 – 10.000 views = category 1; 10.001 – 100.000 views = category 2; 100.001 – 1.000.000 views = category3; 1.000.001 – 10.000.000 views = category 4; > 10.000.001 views = category 5. In addition to clustering the diverse raw scores, the ordinal scaling also served the purpose of giving an equal weight to user appraisals and viewing rates in the formula. The popularity rating (PR) that resulted produced a score between 1 and 5 for each video. When the PR-score of a video fell respectively in the range of 1 – 2.3, 2.4 – 3.6, and 3.7 – 5, it was classified as unpopular, average, or popular, respectively.

Although the formula uses a ratio to compute popularity ratings, older videos on YouTube might still score higher because they can collect viewing rates longer. To check this possibility, we computed the correlation between days on YouTube (based on the upload date) and our popularity rating. This yielded a significant but negative correlation, r (N=75) = – 0.32, p = 0.002. The sampled unpopular and average videos had been posted on YouTube for respectively a mean of 1317 days (s.d. 733) and a mean of 1355 (s.d. 705) days. Time online did not differ for these two popularity classes. In contrast, popular videos had been posted for a mean period of 752 days (s.d. 721), which differed significantly from the unpopular ones, F (1,49) = 7.54, p = 0.008, as well as the average ones, F (1,49) = 8.91, p = 0.004. This finding therefore indicates that the popularity rating is not favoring videos with a longer YouTube presence.

After classifying each video as primarily addressing one declarative knowledge type and classifying its popularity class, all videos were organized accordingly. That is, each of the 250 videos was placed in the proper cell of the 5*3 (knowledge *popularity class) matrix. This led to a minimum of 15 videos in each cell.

Step 3 – Selection of the Most Viewed Videos. From each cell of the 5*3 matrix, the five videos with the largest number of views were selected for inclusion. This resulted in a total of 75 videos. Inter-rater agreement on the classifications for knowledge type was computed by comparing the scores of the two researchers on all videos. Cohen’s (weighted) Kappa was computed to assess reliability. A Kappa score above 0.61 is generally considered a sign of satisfactory reliability. An outcome of κ = 0.68 was found for the basic distinction between factual and conceptual information. This indicates that the basic classification could reliably be made. Within subclasses, coding was unreliable, however. For terms and facts the outcome was κ = 0.45. For concepts, principles and models the score was κ = 0.28. Here the distinction between principles and models was especially problematic. When these were grouped together, a Kappa score of 0.66 was obtained. In view of these findings, we report only the outcomes for the basic distinction between factual and conceptual information.


Registration and analyses of the characteristics of the videos was supported with a codebook that described how to code and score a video on its external properties, and physical characteristics.

External Properties. These properties include descriptive data and viewer statistics that are important to locate and identify a video, along with findings about usage and appraisals. After giving each video a unique ID, pertinent data on these aspects provided by YouTube were recorded. Thus, we registered descriptive data such as the URL, channel, title, subject language¸ video length, upload date, and download date. Also, we included statistical data about appraisals and usage such as views, likes, dislikes, times shared, average time watched, and channel members.

Physical Characteristics. An objective description is given of the physical quality, types of words and pictures, and temporal aspects. The four main categories are: resolution, visuals, verbal & sound, and tempo (see Table 1).

The codebook provides a detailed description and illustration for coding and scoring each physical facet. Two coders (one of the researchers and a University graduate) assessed inter-rater agreement on the physical characteristics that were studied. First, each coder independently of the other coded six randomly selected videos on all the physical characteristics. Next, Cohen’s (weighted) Kappa was computed to assess reliability. Only natural pauses yielded a very low Kappa (that is, κ = .37). This flags that features as unreliably coded. All other features yielded Kappa scores of κ = 0.66 and higher.

Data Analyses

Comparisons between the three popularity classes for descriptive variables either involved Chi- square (χ2), Mann-Whitney (U-statistic), Kruskal-Wallace, or ANOVAs, depending on the nature of the data. Before conducting an ANOVA, the assumption of homogeneity of variance was examined (Levine’s statistic). All analyses were two sided with alpha set at 0.05. Only the test outcomes for statistically significant findings are reported.


Description of the Sampled Videos

Seventy of the 75 sampled videos were in English, five were in Dutch. The videos originated from 62 unique YouTube channels. They covered a diverse set of 72 unique topics that included atoms, black holes, diabetes, inflation, impressionism, kidneys, learning styles, music styles, neurons, satisfaction, stress, telephone, types of irony, and World War 1.

The average video length was 3 minutes and 35 seconds (range 0:29 – 20:41). Video length was slightly, but not significantly longer for more popular videos (Unpopular 4:42; Average 5:07; Popular 6:55). Mean watch time was 55.6% (available only for 20 videos). Video length and watch time correlated significantly, r = – 0.51. The finding indicates that viewers watched a smaller percentage of longer videos. This is a common outcome as reported in studies where viewer statistics are mined for viewing patterns (e.g., Guo et al., 2014; Wistia, 2012). The overall mean time online was 3 years and 2 months (range 39 days – 7 years and 7 months). As discussed earlier, there was a significant difference for presence with popular videos being shorter online than unpopular or average ones.

Table 2 presents the PR-scores for each popularity class across the two main types of declarative knowledge (that is, factual, conceptual). The three popularity classes differed significantly on this rating, F (2, 69) = 427.802, p < 0.001. Planned contrasts revealed that the PR-score for unpopular videos was significantly lower than for average, t (72) = 15.00, p < .001. Likewise, the PR-score for average was significantly lower than score for popular, t (72) = 13.70, p < 0.001. These findings support the validity of the PR-score.

Analyses further revealed an unexpected main overall effect for knowledge type, F (1, 69) = 7.45, p = 0.008. Videos with factual information received a higher PR-score than videos with conceptual information. There was no significant interaction between knowledge type and popularity class on the PR-score.


Raw Scores for the Variables in the PR-Formula

Table 3 presents the basic statistics of the variables in the PR-formula. The data show that the sampled videos have been seen by hundreds of thousands of viewers. Only a very small percentage of these viewers gave like or dislike ratings. Also, like ratings (1.1%) were considerably more common than expressions of dislike (0.03%). Times shared took on a middle position with a mean of 0.17%.

Invariably the standard deviation was higher than the mean score. This signals huge frequency differences within and across groups. These differences also show up in the findings for range. For instance, where the least viewed video had been watched 1.451 times, the most viewed one had been looked at more than nine million times. 9.340.314 to be exact.


For views, a striking difference between groups was that popular videos had been watched at least ten times more often than unpopular or average ones. The comparison is statistically significant, U = 1.212, p < 0.001. For likes and dislikes the unpopular video stands out in comparison with the average and popular video, U = 1.247, p < 0.001. Unpopular videos are appreciated with almost the same percentage for likes (49.3%) and dislikes (50.7%). In contrast, average and popular video both have a high percentage of favorable ratings (respectively 89.5% likes and 97.3% likes).

Physical Characteristics: Resolution

The resolution is the number of distinct pixels in which screen objects are presented. As of November 2008, YouTube supported 720p HD. From that time, it also changed its display ratio to the current widescreen format of 16:9. The resolution data shown in Table 4 are those of the highest production quality of the sampled videos. As can been from the table, popular videos are predominantly (84%) produced in High Definition (HD) quality.

There is a significant difference between popularity classes for resolution, χ2 (6, N=75) = 37.0, p < .000. Unpopular videos have a lower mean resolution than average videos, (χ2 (3, N=50) = 19.0, p < .000). In turn, average videos have a lower mean resolution than popular ones, χ2 (3, N=50) = 19.9, p < .000.


Physical Characteristics: Visuals

The category visuals encompasses the pictorial information in the videos. Two videos did not contain any pictures at all. One of these dealt with the topic of irony. The other discussed types of music. Both were strictly verbal presentations with some of the spoken text also appearing on screen.

We discuss the various types of pictures below. In coding, we systematically registered only their physical presence or absence. We did not also code their instructional relevance. Because we believe that readers will nevertheless want to get an impression of this facet, we describe pictures serving a functional role and pictures with a decorative function. In visuals, a distinction was made between static and dynamic pictures.

Physical Characteristics: Visuals – Static Pictures. Static pictures, or stills, are single images without motion. A significant difference was found between popularity classes for static pictures, χ2 (2, N=75) = 14.1, p = 0.001. As Table 4 shows, static pictures were nearly always present in popular videos, whereas their presence in unpopular and average videos was about fifty-fifty.

Two kinds of static pictures are distinguished: iconic and analytic ones (compare Ploetzner & Lowe, 2012). The term iconic picture refers to illustrations that resemble real objects (see Figure 1). Iconic pictures include displays such as schematic, realistic and photo-realistic pictures. An example of a functional iconic picture is the display of a Van Gogh painting in a video on impressionism. An example of a decorative one is the display of a discount label in a video that explains the concept of discount in the thermodynamic system of entropy. Comparisons between the three popularity classes revealed the presence of a significant difference, χ2 (2, N=75) = 9.7, p = 0.008. Iconic pictures appear more often in popular videos than in unpopular or average ones.

The term analytic picture refers to illustrations that symbolize objects or states. Analytic pictures include displays such as charts, diagrams, graphs and maps (see Figure 2). An example of a functional analytic picture is the display of a chart illustrating the increased use of hydraulic fracturing in gas and oil recovery from deep layers of the earth in a video on fracturing. Another example of a functional analytic picture is a display of a shared field for customer identity in two databases in a video on database concepts. Analytic pictures were predominantly functional. A rare decorative analytic picture that we encountered was a flow chart of a computer program. It served to illustrate that computers can do innovative things. Comparisons between the three popularity classes revealed the presence of a significant difference, χ2 (2, N=75) = 915.8, p < 0.001. Analytic pictures appear more often in popular videos than in unpopular or average ones.

Figure 1. An Iconic Picture from a Video Titled “Why Things Are Creepy” (source: https://www.youtube.com/watch?v=PEikGKDVsCc)
Figure 1. An Iconic Picture from a Video Titled “Why Things Are Creepy” (source: https://www.youtube.com/watch?v=PEikGKDVsCc)
Figure 2. An Analytic Picture from a Video Titled “Water – Liquid Awesome”(source: https://www.youtube.com/watch?v=HVT3Y3_gHGg&list=PLIc7tSCO2aDENsUh4rOUE49RUtcLrBfb0).
Figure 2. An Analytic Picture from a Video Titled “Water – Liquid Awesome”(source: https://www.youtube.com/watch?v=HVT3Y3_gHGg&list=PLIc7tSCO2aDENsUh4rOUE49RUtcLrBfb0).

Physical Characteristics: Visuals – Dynamic Pictures. Dynamic pictures show change over time. There is considerable educational debate on the question whether dynamic pictures better achieve instructional goals than static ones. It is generally argued that dynamic representations are favored for conveying temporal order and spatial relations (e.g., Arguel & Jamet, 2009; Höffler & Leutner, 2007; Tversky, Bauer-Morrison, & Bétrancourt, 2002). Because it has been found very difficult to realize equal conditions in experimentation, so far empirical evidence supporting this stance is found lacking. About three of four videos included dynamic pictures. No difference was found between popularity classes for these pictures. Two kinds of dynamic pictures are distinguished: real and animated.

Real or realistic dynamic pictures present real-world images that display time-related modifications. A good example of functional usage of such dynamic pictures is a video in which the viewer gets to experience creepiness from watching three humans who turn their face towards the viewer to reveal the scary masks they are wearing. A decorative usage of dynamic pictures is the use of flipping or rotating screenshots of newspaper articles on loss of data and stolen CDs in a video on computer security. Comparisons between the three popularity classes revealed no significant difference for these pictures.

Animations consist of sets of highly similar stills whose rapid presentation creates the illusion of change over time. An example of functional usage is a video that employs realistic animations to show how messages are transmitted across the human nervous system. Functional animations were also found in a TedEd video about the production of tears. Animations mainly served a decorative role in a video where an animated presenter made movements that had nothing to do with the talk he was giving on differentiated instruction. There were no significant differences between popularity classes for the presence of these pictures.

Physical Characteristics: Verbal & Sound

The category verbal & sound includes the presence of written and spoken words in the video. The main distinction here is that between title, on screen text, subtitles and audio.

Physical Characteristics: Verbal & Sound – Title. The subcategory title refers to the presence or absence of a video title or name. Table 4 shows that a considerable percentage (38%) of the videos did not display their title. Popular videos had the lowest score here, but there was no statistically significant difference between popularity classes.

Physical Characteristics: Verbal & Sound – On Screen Text. The subcategory on-screen text refers to all verbal information presented on the screen, the title excepted. Almost eighty percent of all videos presented some verbal information to the viewers. Comparisons between the three popularity classes revealed the presence of a significant difference, χ2 (2, N=75) = 8.2, p = 0.016. Unpopular videos less often included on-screen text than popular ones, χ2 (1, N=50) = 7.0, p = 0.008.

Within the class of on-screen text, a further distinction was made between short and long texts. The difference between the two shows up in their presentation and in the main role that such texts appear to play. Short texts come in the shape of labels or annotated pictures (see Figure 3). The information for viewers to read is very short. Usually just a single word is presented. The labels and annotations support the visual information on the screen. They provide a name or term to a displayed picture or object therein.

Fifty-two percent of the videos carried short texts (see Table 4). Comparisons between the three popularity classes revealed the presence of a significant difference, χ2 (2, N=75) = 6.0, p = 0.048. Short texts are more common in popular videos than in unpopular or average ones, respectively, χ2 (1, N=50) = 5.1, p = 0.023, and χ2 (1, N=50) = 4.0, p = 0.045.

Long texts generally come in the shape of slides or written messages (see Figure 4). The text on these slides or messages is presented all at once or gradually appears on the screen (for example, in an animation of writing). Long texts almost always carry the main message. When there are also visuals on the screen, these tend to support the text rather than the other way around. Thirty-six percent of the videos presented long texts. These texts more commonly appeared in average than unpopular or popular videos. However, comparisons between the three popularity classes revealed no significant difference.

Physical Characteristics: Verbal & Sound – Subtitles. The subcategory subtitles refers to the affordance of presenting the spoken words in on-screen text. Depending on the setting, subtitles can be obtained in the original language, or in translated form.

Figure 3. An Annotated Static Picture from a Video Titled “The Seed Germination Process” (source: https://www.youtube. com/watch?v=3Ij1eW_gsrM).
Figure 3. An Annotated Static Picture from a Video Titled “The Seed Germination Process” (source: https://www.youtube. com/watch?v=3Ij1eW_gsrM).

YouTube can automatically generate subtitles. With poor audio this yields poor subtitles which can lead to funny mistakes such as when the spoken “human anatomy – neuron” is subtitled with “human anatomy you’re on”. (Later on YouTube got this keyword right and subtitling became excellent for this video on neurons.) By and large, the automatic subtitling worked fine, except for jargon. For instance, we found the word “fluid” subtitled as “do it,” the phrase “bearish and bullish” subtitled as “parish and bullish,” and “considerable energy” subtitled as “consider a battery.” We also found one odd instance in which a video with an English narrative, presumably spoken by someone from India, automatically caused YouTube to subtitle the video in German. YouTube seems to experience considerable difficulties in subtitling videos with a Dutch voice-over. That is, for the five Dutch videos in our sample, we found the subtitling extremely poor.

Figure 3. An Annotated Static Picture from a Video Titled “The Seed Germination Process” (source: https://www.youtube. com/watch?v=3Ij1eW_gsrM).
Figure 3. An Annotated Static Picture from a Video Titled “The Seed Germination Process” (source: https://www.youtube. com/watch?v=3Ij1eW_gsrM).

To optimize subtitling, it is recommended to include a transcript with the video. Besides improving the subtitling, transcripts play two other important roles. Both concern accessibility. One advantage is that a transcript can create more traffic to the video because the keywords in the transcript can easily be picked up by search engines. The other advantage is that the presence of a transcript makes a video more accessible for people who might not (be able to) listen to the audio or watch the video.

Figure 5. Subtitling in the Original Language Only from a Video Titled “New Gravity Understanding” (source: https://www.youtube.com/watch?v=1hPXivsrqnk).
Figure 5. Subtitling in the Original Language Only from a Video Titled “New Gravity Understanding” (source: https://www.youtube.com/watch?v=1hPXivsrqnk).

Virtually all videos came with subtitles. Four percent of the videos had no subtitles. The affordances for handling the available subtitling options varied considerably. Twelve percent of the videos were equipped with fixed subtitles. Viewers cannot choose whether or not to display subtitles in these videos. When fixed, the subtitles automatically appear on the screen, always in the native language of the video. Fifty-one percent of the videos included the option to switch on or off the presentation of subtitles in the native language (see Figure 5). The remaining thirty-three percent of the videos offered the viewer the broadest choice. In these videos viewers could switch the presentation of subtitles on or off, and they could select the language of the subtitles (see Figure 6). With this option viewers can obtain subtitles in translated form.

Figure 6. Subtitling with the Option to Get Translations from a Video Titled “Fracking Explained” (source: http://www.youtube.com/watch?v=Uti2niW2BRA).
Figure 6. Subtitling with the Option to Get Translations from a Video Titled “Fracking Explained” (source: http://www.youtube.com/watch?v=Uti2niW2BRA).

Popular videos predominantly gave viewers the most extensive set of choices for subtitling. In contrast, most unpopular and average videos merely gave the viewer the choice between switching native subtitling on or off. The difference is statistically significant, respectively χ2 (1, N=48) = 18.7, p < 0.001, and χ2 (1, N=39) = 16.0, p < 0.001. Unpopular and average videos did not significantly differ from one another.

Physical Characteristics: Verbal & Sound – Audio. The term audio refers to what viewers can hear. A distinction is made between four types of auditory information: Narration, music, sound and noise. All the sampled videos included audio information; there were no completely silent videos.

The subcategory narration refers to the spoken words in a video. Nearly all the talk in the videos came from real people. We found only three instances where the talk seemed to have been computer-generated. Several researchers argue that a real human voice is preferable because of greater naturalness and attractiveness (e.g., Baylor, 2011; Mayer, 2005; van der Meij & van der Meij, 2013).

Just as for pictures, it is possible to distinguish between a functional and ‘decorative’ narrative. The first refers to talk that explains the topic. An example of a functional narrative is the text “hormones keep your heart beating and your body heated” in a video discussing the influence of hormones on the human brain. That same video also included instances of decorative talk. For instance, when the narrator stated “Being a teenager is hard and so is living with one, I’m told.”

Eighty-eight percent of the videos contained a narration (see Table 4). Comparisons between the three popularity classes revealed the presence of a significant difference, χ2 (2, N=75) = 9.0, p = 0.011. Narration is significantly less frequently employed in unpopular videos, χ2 (1, N=50) = 5.3, p = 0.021.

The subcategory music refers to all kinds of songs, instrumental works and the like in which rhythm, harmony and melody plays a role. In most of the videos with music, it plays in the background. An example of functional usage of such background music supporting the content was the national anthem of the United Kingdom playing in a video on that country. Likewise, a video on why things are creepy used eerie music to create the right atmosphere for understanding what creepy means. More generally, a good choice of music can contribute to the flow of a video and support the kind of emotion that a video might want to evoke (compare Kämpfe, Sedlmeier, & Renkewitz, 2010).

Music is found in 61 percent of the videos. Comparisons between the three popularity classes revealed the presence of a significant difference, χ2 (2, N=75) = 8.5, p = 0.014. Average videos used music the least often (40%), next are unpopular videos (64%). Music commonly appears in popular videos (80%). Only the difference between popular and average videos was statistically significant, χ2 (1, N=50) = 8.3, p = 0.004.

A clear distinction could be made between videos in which music was played only in the opening and at closure and videos where music was (almost) consistently present. On average about one-third of the music appeared at the beginning and/or end of the videos. The majority (67%) played music throughout the video. In all three popularity classes, almost the same 33-67 ratio was found.

In about thirty percent of all videos, narration and music were jointly presented. In most cases, the music played a background role and should not compete with the spoken message. To avoid such competition, it is advised not to use vocals and piano melodies, among others (“Choosing music for your video,” 2015). As far as we have been able to establish, the music in most videos did not draw the viewer’s attention away from the narrative.

The subcategory sounds refers to audio elements used to create a special sound effect as one might also find in plays (for example, flowing water, or the creaking sound of an opening door). The sounds that we found appeared to be used mainly to achieve a signaling or supportive function. That is, they tended to be used for drawing attention to a change on the screen. The viewer might hear a swoosh sound for a change of pictures in the video, or a ‘‘bleep’ or ‘pling’ would accompany the emergent display of a label with a picture.

The use of sounds in the videos is unusual. On average, twenty-one percent of the videos used a sound effect. There were no differences between popularity classes.

The subcategory noise refers to undesirable mechanical sounds that happen to have been recorded. An example is the presence of a static buzzing or ambient sound presumably stemming from the microphone, in a video narration on Bernoulli’s principle. On average, the audio in twenty-four percent of the videos was slightly noisy to noisy. Comparisons between the three popularity classes revealed the presence of a significant difference, χ2 (2, N=73) = 8.1, p = 0.017. Noise is rare in popular videos and more common in unpopular videos, χ2 (1, N=48) = 6.7, p = 0.010, and average videos χ2 (1, N=49) = 7.6, p = 0.006.

Physical Characteristics: Tempo

The category tempo includes two characteristics that affect the speed of the video. One aspect is narrative speed. Simply put this is the pace of the talk by the presenter or narrator. The dynamics of the pictures in an instructional video should be aligned with this tempo (Plaisant & Shneiderman, 2005; van der Meij & van der Meij, 2013). That is, in normal playing mode the narrative speed should dictate the speed for presenting the visual information. In the present study, we operationalized the narrative speed as the number of words per minute (wpm-rating). For the word count, we turned to the YouTube transcripts, which provided a sufficiently accurate estimate for computing the wpm-rating.

For narrative speed, a mean score of 145 words per minute with a standard deviation of 40 was found for all videos together. This outcome is similar to what was reported in a recent empirical study involving the analysis of 6.9 million video watching sessions in four EdEx courses (Guo et al., 2014). That study found a mean speaking rate of 156 wpm and a standard deviation of 31.

Advice on the recommended narrative speed varies considerably (e.g., Drew, 2015; “How to write,” 2015; “Make a good video script,” 2015). At the lower end, a figure is mentioned between 125-150 words per minute. The boundary between the lower and middle range is also described as “The typical rule of thumb in the industry is 150 words per minute” (“9 Insider tips,” 2015). The middle range is for 150 – 175 words. Narratives with a faster speaking rate constitute the upper range. This includes the figure 180 words per minute, which is considered “a common estimate for broadcasting” (“Timing for a broadcast script,” 2015). The aforementioned empirical study of Guo, Kim and Rubin (2014) found that students engaged more with videos where instructors spoke faster. More specifically, when presenters spoke with a rate of 185 words per minute or more, viewer engagement was found to increase significantly. The authors also found no evidence of more play and pause events in faster videos, which led them to the conclusion that these videos were not more confusing and harder to follow.

A comparison for the three popularity classes on narrative speed yielded a significant result, χ2 (2, N=63) = 16.41, p < 0.001. The difference lies in the contrast between the popular videos with the two other popularity classes. The narrative speed in popular videos is substantially higher.

The difference in narrative speed of the popularity classes also shows up when we look at videos that lie below or above 1 standard deviation from the mean (that is, lower than 105, or higher than 185). For unpopular videos such limits do not matter; all fall within this range. Fifteen of the sixteen videos with a wpm-rating fell below the mean score 145. For average videos, six out of twenty-three videos scored below the minimum wpm-rating, as opposed to a single instance of a video scoring beyond the mean range. In contrast, for popular videos none scored below the mean range. Exactly fifty percent (n=12) had a narrative speed beyond that range. This signals that popular videos not only average a higher narrative speed, but also that this is a common characteristic of a substantial proportion of these videos.

The other temporal feature is the inclusion of natural pauses. During such pauses nothing visibly or audibly happens in the video. There is no talk and the visuals remain unchanged. Research suggests that such pauses help demarcate event boundaries and give the viewer time to reflect and let sink in the just-completed segment (e.g., Ertelt, 2007; Spanjers, van Gog, & van Merriënboer, 2010, 2012). Even pauses that last no longer than 2 to 5 seconds can contribute significantly to knowledge development.

On average, 70% of all videos included natural pauses on a regular basis. A significant negative correlation with speaking rate was found (Kendall’s tau (N=63) = – 0.42, p < 0.001. In other words, where there is a higher speaking rate, there are fewer natural pauses.

A comparison for the three popularity classes yielded a significant result, χ2 (2, N=67) = 14.54, p = 0.001. The difference lies in the contrast between the popular videos with the two other popularity classes. The presence of natural pauses is just over twice as frequent in unpopular and average videos. Unfortunately, as noted in the Method section, coding natural pauses did not reach satisfactory agreement between raters that merit drawing firm conclusions for this feature.


The formula that we constructed for assessing the popularity of YouTube videos served its purpose well. It helped create a meaningful distinction between three popularity classes, using a combination of data from viewer appreciation and viewing rate. The highly diverse frequency data for the main variables in the formula suggest that this is not self-evident.

When we started coding the videos, we had not expected to discover that popular videos would do so well on nearly all physical characteristics that we analyzed. But they did. As we progressed in our analyses, it also became increasingly apparent that what seemed like a straightforward inventory study, turned out to be a quest for meaning that is far from being rounded off.

For several measures that we looked at, questions about meaning emerged. For instance, we have yet to discover how YouTube exactly measures views. Is it, as some claim, enough that people upload a video and keep it opened up for a minimum of five seconds? Or is the view count a more complex measure, and perhaps even changing over time as data mining techniques advance? Questions about measurements also concern functionality. This issue was already raised in the presentation of the results where we made a distinction between pictures playing a functional or decorative role.

These questions aside, we believe that our study yields fruitful insights on what makes an instructional YouTube video popular. Hopefully, it also provides a good starting point for further research on what turns a popular instructional video into an effective one.


9 Insider tips for creating a killer explainer video. (2015). Retrieved February 2, 2015, from https://blog.kissmetrics.com/creating-a-explainer-video/

Anderson, L. W., Krathwohl, D. R., Airasian, P. W., Cruikshank, K. A., Mayer, R. E., Pintrich, P. R., . . . Wittrock, M. C. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. New York, NY: Pearson, Allyn & Bacon.

Arguel, A., & Jamet, E. (2009). Using video and static pictures to improve learning of procedural contents. Computers in Human Behavior, 25, 354-359. doi: 10.1016/j.chb.2008.12.014

Baylor, A. L. (2011). The design of motivational agents and avatars. Educational Technology Research & Development, 59, 291-300. doi: 10.1007/s11423-011-9196-3

Choosing music for your video. (2015). Retrieved February 7, 2015, from http://wistia.com/learning/choosing-music-for-your-video

Drew, P. (2015). Copywriting for instructional design narration and role playing Retrieved February 2, 2015, from http://www.e-learningvoices.com/articles/copywriting.php

Ertelt, A. (2007). On-screen videos as an effective learning tool. The effect of instructional design variants and practice on learning achievements, retention, transfer, and motivation. (Doctoral dissertation), Albert-Ludwigs Universität Freiburg, Germany.

Guo, P. J., Kim, J., & Rubin, R. (2014, March 4–5). How video production affects student engagement: An empirical study of MOOC videos. Paper presented at the The first ACM conference on Learning @ scale conference (L@S ‘14), Atlanta, GA.

Here’s why the view count on new YouTube videos always stops at 301. (2014). Retrieved February 4, 2015, from http://businessetc.thejournal.ie/why-youtube-views-stop-301-1606621-Aug2014/

Höffler, T. N., & Leutner, D. (2007). Instructional animation versus static pictures: A meta-analysis. Learning and Instruction, 17, 722-738. doi: 10.1016/j.learninstruc.2007.09.013

How to write a killer explainer video script. (2015). Retrieved February 2, 2015, from http://www.videobrewery.com/blog/how-to-write-a-killer-explainer-video-script

Kämpfe, J., Sedlmeier, P., & Renkewitz, F. (2010). The impact of background music on adult listeners: A meta-analysis. Psychology of Music, 39(4), 424-448. doi: 10.1177/0305735610376261

Make a good video script great in 5 steps. (2015). Retrieved Febraury 2, 2015, from http://www.sitepoint.com/make-a-good-video-script-great-in-5-steps/

Mayer, R. E. (2005). Principles of multimedia learning on social cues: Personalization, voice and image principles. In R. E. Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 201-212). New York, NY: Cambridge University Press.

Morain, M., & Swarts, J. (2012). YouTutorial: A framework for assessing instructional online video. Technical Communication Quarterly, 21, 6-24. doi: 10.1080/10572252.2012.626690

Plaisant, C., & Shneiderman, B. (2005). Show me! Guidelines for recorded demonstration. Paper presented at the 2005 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’05), Dallas, Texas. http://www.cs.umd.edu/localphp/hcil/tech-reports-search.php?number=2005-02

Ploetzner, R., & Lowe, R. (2012). A systematic characterisation of expository animations. Computers in Human Behavior, 28, 781-794. doi: 10.1016/j.chb.2011.12.001

Smith, P. L., & Ragan, T. J. (2005). Instructional design (3rd ed.). Hoboken, NJ: Wiley

Spanjers, I. A. E., van Gog, T., & van Merriënboer, J. J. G. (2010). A theoretical analysis of how segmentation of dynamic visualizations optimizes students’ learning. Educational Psychology Review, 22, 411-423. doi: 10.1007/s10648-010-9135-6

Spanjers, I. A. E., van Gog, T., & van Merriënboer, J. J. G. (2012). Segmentation of worked examples: effects on cognitive load and learning. Applied Cognitive Psychology, 26, 352-358. doi: 10.1002/acp.1832

Statistics. (2015). Retrieved February 8, 2015, from https://www.youtube.com/yt/press/statistics.html

Swarts, J. (2012). New modes of help: Best practices for instructional video. Technical Communication, 59(3), 195-206.

Timing for a broadcast script. (2015). Retrieved February 2, 2015, from https://jtoolkit.wordpress.com/2008/04/12/timing-for-a-broadcast-script/

Tversky, B., Bauer-Morrison, J., & Bétrancourt, M. (2002). Animation: Can it facilitate? International Journal of Human-Computer Studies, 57, 247-262. doi: 10.1006/ijhc.2002.1017

van der Meij, H., & van der Meij, J. (2013). Eight guidelines for the design of instructional videos for software training. Technical Communication, 60(3), 205-228.

Wistia. (2012). Does length matter? Retrieved from http://wistia.com/blog/does-length-matter-it-does-for-video-2k12-edition

About the Authors

Petra ten Hove is Junior Educational Scientist. She graduated from the University of Twente in 2014 with a Master’s degree in Educational Science and Technology. Her research interests are instructional technology and video designs for education. Contact: Petratenhove@gmail.com

Hans van der Meij is Senior Researcher and Lecturer in Instructional Technology at the University of Twente (The Netherlands). His research interests are: questioning, technical documentation (e.g., instructional design, minimalism, the development of self-study materials), and the functional integration of ICT in education. He received several awards for his articles, including a “Landmark Paper” award by IEEE for a publication on minimalism (with John Carroll). Contact: H.vanderMeij@utwente.nl

Manuscript received 9 February 2015; revised 20 February 2015; accepted 20 February 2015.