How does Altmetric text mining work?


In most cases, Altmetric links online attention to research outputs by searching for URLs or unique identifiers across  our tracked attention sources. For example, if a Facebook page or an X post includes a link to a published article, tracked book, data set, etc. we will use the link to match the post with the publication.


While this approach works in the majority of our tracked sources, not all sources tend to use links or scholarly identifiers when discussing research.


To accommodate for this, three of our key sources - News, Policy, and Patents - rely on a combination of link matching (as described above) and text mining to pick up mentions to research.


Our text mining system evaluates the text in the news story, patent, or policy document and determines if it has the appropriate data to make a positive match to a research output.


The trick to making text mining work is being strict enough that we do not incorrectly match random text with random articles (leading to false positives), but allow for enough variation that we do not miss too many of these mentions either (leading to missed mentions). To strike this balance, our text mining technology requires a few basic pieces of metadata to create a successful match.


In order to match successfully, the text of one of these posts must include at least the name of an author, the title of a journal, and a publication date.


In the case of policy and patents, this information can usually be found in the References section or the footnotes of the document. In news articles, this information is often embedded in the text in the middle of an article. If the publication date is not present, we look for articles published within six weeks on either side of the News article.


The text that we extract from the news article, patent, or policy document is compared with metadata in the Crossref database. As such, text mining for this kind of metadata only works for items with DOIs that are registered in Crossref. However, if we do find a valid scholarly identifier (like an ISBN or a Pubmed ID) that can be identified as such, we can typically resolve those to a Google Books page, PubMed landing page, etc.


What languages are supported?


Because text mining relies on references that match with metadata registered in Crossref, this approach can work on sources that are written in many languages. However, it is definitely most effective in English, not least because the majority of references in Crossref are presented in English. Nonetheless, Altmetric does successfully track many non-English sources using both text mining and link matching.