JCDL 2020 Watch List

The 2020 ACM/IEEE Joint Conference on Digital Libraries (JCDL2020) will be held as all virtual event instead of taking place in Wuhan, China. In order to not get confused with different time zones etc., I already prepared a list of the talks that I do not want to miss. So here is my personal watch list for JCDL 2020 including abstracts and links to preprints (in chronological order):

Generate FAIR Literature Surveys with Scholarly Knowledge Graphs. Allard Oelen, Mohamad Yaser Jaradeh, Markus Stocker and Sören Auer.

August 2: 08:30-10:00 (UTC+1)

Reviewing scientific literature is a cumbersome, time consuming but crucial activity in research. Leveraging a scholarly knowledge graph, we present a methodology and a system for comparing scholarly literature, in particular research contributions describing the addressed problem, utilized materials, employed methods and yielded results. The system can be used by researchers to quickly get familiar with existing work in a specific research domain (e.g., a concrete research question or hypothesis). Additionally, it can be used to publish literature surveys following the FAIR Data Principles. The methodology to create a research contribution comparison consists of multiple tasks, specifically: (a) finding similar contributions, (b) aligning contribution descriptions, (c) visualizing and finally (d) publishing the comparison. The methodology is implemented within the Open Research Knowledge Graph (ORKG), a scholarly infrastructure that enables researchers to collaboratively describe, find and compare research contributions. We evaluate the implementation using data extracted from published review articles. The evaluation also addresses the FAIRness of comparisons published with the ORKG.

HybridCite: A Hybrid Model for Context-Aware Citation Recommendation. Michael Färber and Ashwath Sampath.

August 2: 08:30-10:00 (UTC+1)

Citation recommendation systems aim to recommend citations for either a complete paper or a small portion of text called a citation context. The process of recommending citations for citation contexts is called local citation recommendation and is the focus of this paper. Firstly, we develop citation recommendation approaches based on embeddings, topic modeling, and information retrieval techniques. We combine, for the first time to the best of our knowledge, the best-performing algorithms into a semi-genetic hybrid recommender system for citation recommendation. We evaluate the single approaches and the hybrid approach offline based on several data sets, such as the Microsoft Academic Graph (MAG) and the MAG in combination with arXiv and ACL. We further conduct a user study for evaluating our approaches online. Our evaluation results show that a hybrid model containing embedding and information retrieval-based components outperforms its individual components and further algorithms by a large margin.

Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language. Philipp Scharpf, Moritz Schubotz, Abdou Youssef, Felix Hamborg, Norman Meuschke and Bela Gipp.

August 2: 10:30-12:00 (UTC+1)

In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 82.8% and cluster purities up to 69.4% (number of clusters equals number of classes), and 99.9% (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.

An Authoritative Approach to Citation Classification. David Pride and Petr Knoth.

August 2: 17:30-19:00 (UTC+1)

The ability to understand not only that a piece of research has been cited, but why it has been cited has wide-ranging applications in the areas of research evaluation, in tracking the dissemination of new ideas and in better understanding research impact. There have been several studies that have collated datasets of citations anno- tated according to type using a class schema. These have favoured annotation by independent annotators and the datasets produced have been fairly small. We argue that authors themselves are in a primary position to answer the question of why something was cited. No previous study has, to our knowledge, undertaken such a large-scale survey of authors to ascertain their own personal rea- sons for citation. In this work, we introduce a new methodology for annotating citations and a significant new dataset of 11,233 citations annotated by 883 authors. This is the largest dataset of its type compiled to date, the first truly multi-disciplinary dataset and the only dataset annotated by authors. We also demonstrate the scalability of our data collection approach and perform a compari- son between this new dataset and those gathered by two previous studies.

Using Graph-based Visualizations. Corinna Breitinger, Birkan Kolcu, Monique Meuschke, Norman Meuschke and Bela Gipp.

August 3: 10:30-12:00 (UTC+1)

Literature search and recommendation systems have traditionally focused on improvingrecommendation accuracy through new algorithmic approaches. Less researchhas focused on the crucialtask of visualizing the retrievedresults tothe user. Today, themost commonvisualization for literature search and recommendation systems remains the rankedlist. However, this format exhibitsseveralshortcomings, especially foracademic literature.We presentan alternative visual interface for exploring the results of an academic literature retrievalsystemusinga force-directed graphlayout. The interactive information visualization techniques we describeallowfor a higher resolution search and discoveryspacetailoredto theunique feature-based similarity present among academicliterature.RecVis–thevisual interfacewe propose–supports academicsin exploringthescientific literaturebeyond textual similarity alone, since it enables the rapid identification of other forms of similarity, includingthe similarity of citations, figures,and mathematical expressions.

Open Access 2007 - 2017: Country and University Level Perspective. Bikash Gyawali, Nancy Pontika and Petr Knoth.

August 3: 10:30-12:00 (UTC+1)

Each year the number of Open Access (OA) papers is gradually increasing. We carried out a study investigating 400 universities from 8 countries to examine: i) the total number of OA papers per country, ii) proportion of OA papers published by representative universities in each country classified into three tiers of research quality: high, middle and low, iii) how universities within the same country compare to each other and iv) the growth of OA papers in countries per year. We conclude that among the analysed countries the UK and USA rank first and second respectively, while Russia and India are positioned towards the bottom of the list. We observe no link between the proportion of OA papers published by authors at a university and the university ranking, with some universities in the middle university rank tier having a larger proportion of OA papers than those in the high tier.

Towards Knowledge Maintenance in Scientific Digital Libraries with the Keystone Framework. Yuanxi Fu and Jodi Schneider.

August 3: 17:30-19:00 (UTC+1)

Scientific digital libraries speeddissemination of scientific publications, but also thepropagation of invalid or unreliable knowledge. Although many papers with known validity problems are highly cited, noauditing process is currently available to determine whether a citing paper’s findings fundamentally depend on invalid or unreliable knowledge. To address this, we introduce a new framework, the keystone framework, designed to identify when and how citing unreliable findings impacts a paper, using argumentation theory and citation context analysis.Through twopilot case studies, wedemonstratehow the keystone framework can be applied to knowledge maintenance tasks for digital libraries, including addressing citations of a non-reproducible paper and identifying statementsmost needingvalidation in ahigh-impact paper. We identify roles for librarians, database maintainers, knowledgebase curators, and research software engineers in applying the framework to scientific digital libraries.

Segmenting Scientific Abstracts into Discourse Categories: A Deep Learning-Based Approach for Sparse Labeled Data. Soumya Banerjee, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick and Partha Pratim Das.

August 4: 06:30-08:00 (UTC+1)

The abstract of a scientific paper distills the contents of the paper into a short paragraph. In the biomedical literature, it is customary to structure an abstract into discourse categories like BACKGROUND, OBJECTIVE, METHOD, RESULT, and CONCLUSION, but this segmentation is uncommon in other fields like computer science. Explicit categories could be helpful for more granular, that is, discourse-level search and recommendation. The sparsity of labeled data makes it challenging to construct supervised machine learning solutions for automatic discourse-level segmentation of abstracts in non-bio domains. In this paper, we address this problem using transfer learning. In particular, we define three discourse categories BACKGROUND, TECHNIQUE, OBSERVATION-for an abstract because these three categories are the most common. We train a deep neural network on structured abstracts from PubMed, then fine-tune it on a small hand-labeled corpus of computer science papers. We observe an accuracy of 75% on the test corpus. We perform an ablation study to highlight the roles of the different parts of the model. Our method appears to be a promising solution to the automatic segmentation of abstracts, where the labeled data is sparse.

Mining Semantic Subspaces to Express Discipline-Specific Similarities. Janus Wawrzinek, Jose Maria Gonzalez Pinto and Wolf-Tilo Balke.

August 4: 10:30-12:00 (UTC+1)

Word embeddings enablestate-of-the-art NLP workflows in im-portanttasks including semantic similarity matching, NER, ques-tion answering,and document classification. Recentlyalso the biomedical field startedto useword embeddings to provide new access paths for abetter understanding ofpharmaceutical enti-tiesandtheir relationships,as well as to predict certain chemical properties. The central ideais to gainaccess to knowledge em-bedded, but not explicatedinbiomedicalliterature. However, acore challenge is the interpretabilityof the underlying embed-dings model. Previous work has attempted to interpret the se-mantics of dimensions inword embeddings modelsto ease mod-el interpretation when applied to semantic similarity task. To do so, the original embedding space istransformedto a sparse or a more condensed space, whichthen has to be interpretedin an exploratory (and hence time-consuming) fashion.However, little has been doneto assessin real-time whetherspecificuser-provided semanticsare actually reflectedin the original embed-ding space. Wesolve thisproblem by extractinga semantic sub-space fromlargeembeddingspacesthatbetter fits the queryse-mantics defined by a user.Our method buildson least-angle re-gression to rank dimensions according to given semanticsproperly, i.e. to uncover a subspace to ease both interpretationand explorationof the embedding space. We compare our meth-odology to querying the original space as well as to several other recent approaches andshow that our method consistently out-performs all competitors.

For all accepted papers please see the JCDL website. https://2020.jcdl.org/AcceptedPapers.html

Workshops

Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2020)

My papers

To be updated