Some of my projects from the last few years:

  • llm-datasets: A collection of datasets for large language model pretraining including scripts for downloading, preprocesssing, and sampling.
  • DFKI Chat: A research prototype of a chat-optimized LLM with retrieval augmentation.
  • Open Legal Data: Free Access to Legal Information.
  • Open Redact: Semi-automatic anonymization of documents.
  • Citolytics: Citation Analysis for Wikipedia Articles
  • Arms Trade Visualization: An interactive visualization of EU arms trade.
  • Leaflet.Sim: Visualize moving elements on a Leaflet-based map.

Some of the language models that I published: