How to participate

Getting hold of the data

The data set can be obtained from Clarivate Analytics by sending an email to Jason Rollins, By using the data, you agree to the following license:

Data Use License (Clarivate Analytics)

While participating in the Web of Science comparative topic identification exercise, you will be provided with access to the Clarivate Analytics Web of Science comparative topic identification exercise dataset. You may access and use this dataset from March 1, 2017 through December 31, 2018 only for the exercise above, subject to the “Clarivate Analytics Terms”, including the 'Web of Science: Custom Data Set Product Terms´ in the 'Product / Service Terms´, available on our Terms of Business site By accessing and/or using our data, you are legally bound by and hereby consent to these terms. If you do not agree to these terms, then you may not access or use our data. Any extension or further use of our data beyond December 31, 2018 is strictly prohibited unless you receive prior written permission from Clarivate Analytics.

How to submit your solution

You can submit your solution to the topic extraction challenge by sending the solution file in csv format to Theresa Velden, To be accepted, the solution file needs to be formatted as described here and accompanied by two additional files, one that describes the solution and one that documents how the solution was generated.

File 1: Solution in csv format

Please provide a file solution.csv with each row referring to a document in the data set, identified by the documents UT number (UT = Web of Science Unique Article Identifier). The second entry in a row specifies the topic the document has been assigned to. If applicable, a third entry specifies the strength of this assignment. For clarity, please include in the file a header row with the column names.



Note: If your solution includes the strength of assignment to a topic, please explain the permissable range of values and interpretation in the documentation. If a solution allows for topic overlap and a document has been assigned to several topics, each of these assignments is to be listed on a separate line. If a document has not been assigned to any topic, the ClusterID is left empty.

File 2: Description of Solution

Please provide a file solution.txt with the following information:

  1. Preferred 2-letter acronym to label the solution
  2. Number of topics obtained
  3. Whether topics are overlapping or disjoint
  4. How strength of assignment of a document to a topic was calculated (if applicable)
  5. Contributors: who contributed to creating the solution
  6. Coverage: How many documents are in the union of all topics generated by your solution?

Note: The question of what documents are covered and not covered by a solution can be pretty involved. Sometimes documents are excluded during preprocessing, or during data modeling, or even later in the process, due to assumptions made e.g. about a reasonable topic size. If your solution covers less than 100% of documents in the original data set, please share your insights on what number of documents were excluded from your solution at what step and for what reasons.

File 3: Description of Approach

Please provide a file approach.txt that describes the approach you used, in particular:

  1. Data Pre-processing steps
  2. The data model
  3. Topic extraction algorithm
  4. Parameters and thresholds used

Further, please write a few paragraphs to describe the background for your approach, e.g. what considerations went into its design or selection of algorithms used, and for what purposes you are using its results.

Note: You can find an example of how to describe topic extraction approaches in a systematic manner in section 3 and table 2 of: Velden, Boyack, Gläser, Koopman, Scharnhorst & Wang. (forthcoming) "Comparison of Topic Extraction Approaches and Their Results" Special Issue of Scientometrics [preprint].


You are invited to present your solution to the topic extraction challenge and to participate in the discussion of how to compare and evaluate solutions through a variety of venues.

Call for Papers Special session at ISSI 2017

We are planning a special session on the topic extraction challenge at the upcoming ISSI conference, from 16-20 October 2017 in Wuhan, China. Please note that the paper submission deadline for the conference is April 10, 2017. We invite your contributions, by submitting a paper on the comparision of topic extraction approaches and results. In particular we encourage you to submit work-in-progress papers that present your solution to the topic extraction challenge using the Astro Data Set.

Topic Extraction Blog

We are looking for authors who would like to discuss their own topic extraction results for the Astro Data Set, share what they learned through the exercise about their own approach and how it compares to other approaches, discuss methods for comparing topic extraction approaches, or provide insights on the challenges of evaluating results given a multiplicity of potential ground truths and purposes of topic extraction. Please contact us if interested in contributing.

Mailing List

You are welcome to subscribe to our mailing list. This way you will be notified when new solutions to the topic extraction challenge get added to the website, or further opportunities to engage with the topic extraction challenge arise. To be added to the mailing list, please contact us, providing your name and email address.