The evolution of online news headlines

January 5, 2026

56 8 minutes read

41599 2025 4514 Fig1 HTML.png

Corpora

Several datasets contribute to our collection of headlines: BIG4 combines headlines from four representative news outlets across two decades (thus covering the early years of digital journalism), and the News on the Web corpus (NOW) contains a broader range of news websites, but starting later, in 2010. We also included two reference corpora for comparison: one containing clickbait headlines and another containing scientific preprint titles.

The BIG4 corpus covers a longer range that captures the transition from print to hybrid/online journalism. It includes headlines from The New York Times and The Guardian (both collected using the official API), and two freely available datasets with headlines from The Times of India and ABC News Australia, summing to ca. 9 million headlines. The New York Times API contains both print and digital-only content, and includes publications from as far back as 1851. In this study, we focus on the time range from 2000 onwards in order to cover the transition of the medium to online environments. While there is little documentation for the Times of India and ABC Australia datasets, we verified that searching for early article headlines brings up a digital version of the article, meaning that these articles also appeared online. The ABC Australia corpus raises some questions: For early years (e.g., 2003), some headlines could not be found online, and for the later years (e.g., 2020), the headline corresponded to a URL, but with a different headline for the same article. For years in between (e.g., 2010, 2015), the headlines match the URL and the current titles of that article. This could reflect editorial curation practices, such as selecting a better-performing headline later on, but also raises doubts as to the quality of this particular dataset. As the dataset seems to be preprocessed (lacking punctuation and being all lower-cased), it is possible that the headlines in this corpus were actually extracted directly from the URL, which would be problematic if URLs do not accurately reflect the headlines. With these caveats in mind, and presenting the disaggregated descriptive results, we decided to include this corpus in our analyses.

We also purchased the Corpus of News on the Web (NOW) – a curated collection of English-language news text ranging from 2010 to the present, sourced from news websites across 20 countries where English is used (Davies 2016) and summing to 30 million headlines. Daily, new URLs are retrieved via Google News, and the respective articles are added to the corpus. This sampling strategy is supposed to capture the current discourse across English-speaking media. In contrast to the corpora obtained through the Guardian and New York Times APIs, the NOW corpus does not offer a complete record of the production of any specific outlet. Perhaps due to the sampling strategy, the corpus composition changes over time: Some outlets are heavily sampled from in a specific year, and not at all in another. There is also an issue with the data quality, as for some articles, only part of the headline was scraped. These cases are obvious, as these headlines end in ‘…’. We therefore carried out our analysis on the entire, unfiltered NOW corpus, as well as on various subsets of the NOW corpus in order to ensure robustness of our findings. The figures in this paper report a cleaned version of the NOW corpus with two filtering steps: Incomplete headlines were removed, then the four outlets for which we already had separate corpora were removed. Note that The Times of India is the most frequent source in the NOW corpus and seems to be relatively oversampled. For a robustness check, we include figures in the Supplementary Information for the unfiltered NOW corpus (Supplementary Fig. 1). We also applied an even stricter criterion, including only outlets that have consistently contributed at least 500 headlines per year since 2010, to ensure continuity in our data and to counteract artifacts from the sampling strategy (see Supplementary Table 6 for a list of these outlets, and Supplementary Table 2 for more information about the full set and two subsets).

Besides serving as robustness checks, the full set and two subsets represent different angles to the question of how headlines have changed: Every day, new articles are added to the NOW corpus based on the Google News API. In its entirety, the NOW corpus may paint a representative picture of what the news sound like at any given point in time. On the other hand, this aggregate view obscures the fact that different outlets contribute to this picture. Filtering the NOW down to specific outlets that consistently contributed to this data, we can get a sense of how particular outlets have employed different language over time, providing a view analogous to the BIG4 dataset.

For benchmark comparisons, we included a clickbait-style corpus (Matias et al. 2021) and a corpus of 2,276,611 scientific preprint titles in STEM fields (arXiv dataset). For more information about these corpora, see Supplementary Table 2. The Upworthy dataset is an excellent benchmark for the clickbait features, as Upworthy is commonly considered the prime example of clickbait style (Chakraborty et al. 2016; Munger 2020; Scott 2021). Since this relatively small corpus spans a short time frame (2013–2015), we use it as a static benchmark, without the time dimension.

Natural language processing

We used a Python pipeline to clean, tokenize, part-of-speech-tag, and analyze the headlines in terms of our selected features (see Supplementary Fig. 7 for an overview of the pipeline). For sentiment analysis, we used the Flair package (Akbik et al. 2018) to classify headlines as positive or negative. Flair returns a label (negative or positive) along with a score for each headline. We set a relatively high threshold (0.9), for choosing the label suggested by flair; we labeled headlines with scores of 0.9 or lower as “neutral”.

We also captured the syntactic structure of the headline as an important aspect of style. For this, we used a constituency parser using Python libraries spaCy (Honnibal et al. 2020) and benepar (Kitaev and Klein 2018; Kitaev et al. 2019), which constructs a hierarchical representation of the sentence with labeled parts (constituents). We focused on the top-most label, which captures whether the entire headline is a sentence (S), noun phrase (NP), etc. For a visualization of constituency structures, see Supplementary Fig. 5; for a flowchart detailing the preprocessing and analysis steps, see Supplementary Fig. 7.

Statistical analysis

To quantify the relationship between the individual features across time, we performed linear regressions for the continuous features (e.g., number of words), and logistic regressions for the binary features (e.g., occurrence of a wh-word) with year as the only predictor variable. Due to the prohibitive running times, we abandoned a Bayesian approach, instead running regression models using the lme4 package in R (Bates et al. 2015). While chosen for feasibility, we think a frequentist approach is appropriate given the several million data points at our disposal.

The rise of headline length over time was among the most robust and salient trends we observed. At the same time, we found a clear link between length of the headline and other linguistic features (Fig. 5). This raised the question of whether the increase in other features was just a byproduct of increased headline length. This would be the case if length was the driver of these other features. If so, rather than just regressing features on year as a sole predictor, controlling for the effect of headline length would disentangle their contributions. This reasoning suggests a mediator analysis of the effect of year on linguistic variables mediated by length. Although this is a useful and popular analysis tool, it would be misleading in this context because it would imply a specific causal structure behind the data-generating process. This is problematic because we know this causal structure does not apply and our variables do not lend themselves to causal inference.

Firstly, the variable year is useful to describe a trend over time, but time itself is not a cause. Instead, it contains several causal factors, such as new technologies, sinking production costs and accelerating dynamics of online attention. This already hinders causal inference with the variables at our disposal. Secondly, a mediator analysis would assume that year increases length, which in turn causes certain linguistic features to appear with some (generally higher) frequency (total effect of year on features), while there may be a (probably attenuated) direct effect of year on feature. For causal inference to be possible, the flow of causation would have to be unidirectional between these variables. However, the causal structure in question is more complex than the mediator diagram suggests. For instance, there is mutual causation between length and a linguistic feature: While each additional word provides an opportunity for the inclusion of a specific word or feature (length causes feature), each feature in turn affects the length (feature causes length). Unobserved variables, such as style may also be at work: Journalists and editors do not produce one word after another, which sometimes result in full sentences or noun phrases, clickbaity or dry headlines. Instead, they make holistic decisions on the headline level with consequences for the word count and features used. Because holistic features such as style are difficult to capture, we limited ourselves to tracking the frequency of measurable linguistic features over time.

With these assumptions about the underlying process, we concluded that length would be a bad control, resulting in an uninterpretable and misleading regression (Achen 2005; Rohrer 2018); see Cinelli et al. (2022) for an introduction to Directed Acyclic Graphs. Instead, we used regressions as a means to quantify how features changed over time, without implying causality. To address the relationship between length and features, we include descriptive plots of feature occurrence as a function of number of words in Fig. 5.

Classifying outlets by political leaning and journalistic quality

To classify outlets by political leaning, we relied on the AllSides Media Bias Chart (AllSides Media Bias Chart 2019). Based on online, political U.S. content, this chart categorizes outlets into five categories (left, lean-left, centre, lean-right, right), which we collapsed into the three categories left-leaning, centre, and right-leaning. According to this classification, The New York Times is left-leaning, The Washington Times is right-leaning, and BBC news is a centre outlet.

To classify outlets by journalistic quality, we relied on the Ad Fontes Media Bias Chart (Application Version 2.7.2) (Interactive Media Bias Chart 2024). This chart has two axes: political leaning (left to right on the X axis, in numeric values from −42 to 42, without discrete categories) and journalistic quality (from less reliable to more reliable, on the Y axis, in numeric values from 0 to 64, with demarcations into red, orange, yellow, green). To avoid confounding with political leaning, we focused on the narrow middle section of this dimension (−10 and 10). For journalistic quality, the Ad Fontes Media Bias Chart offers four sections: red (0–16), orange (16–24), yellow (24–40) and green (40–64). We use these cutoffs in our analysis, taking all outlets that fall within the green area as quality journalism, and everything in the yellow area as of lower journalistic quality. According to this partition, the former category includes The Guardian, The New York Times, The LA Times, Forbes, The Wall Street Journal, and Al Jazeera, while the second category includes outlets such as The Mirror, The Daily Mail, Upworthy and The New York Post (for the full list, see Supplementary Table 3). Note that we did not include outlets from the orange and red sections of lowest journalistic quality.

We make all the code for the above analyses available, and provide information about the datasets used in the Data Availability statement.

Source link

January 5, 2026

56 8 minutes read