Lightnews — Scholar-powered news

Ben Lee

@bcgl.bsky.social

Assistant Professor @ the University of Washington iSchool | formerly an Innovator in Residence @ Library of Congress | essays in WIRED, Gawker, The New Republic, Current Affairs, etc.

🌐 www.bcglee.com

Posts Replies Media Videos

Ben Lee

@bcgl.bsky.social

4/ What does visual search do? Here’s a visual search for “redacted documents”

A visual search for "redacted documents" showing a number of documents with heavy redactions.

November 18, 2025 at 8:19 PM

Ben Lee

@bcgl.bsky.social

3/ The full GovScape architecture is detailed in this figure, showing how the client interacts with the server, DBs, and indices. We utilize FAISS for text embeddings and for CLIP embeddings, and SQLite FTS5 for keyword indexing.

A diagram showing the GovScape architecture, including the client, server, and databases.

November 18, 2025 at 8:19 PM

Ben Lee

@bcgl.bsky.social

2/ The pre-processing pipeline ingests PDFs, renders them, generates CLIP and BGE embeddings of individual pages, and indexes the text. The total compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500. Our code is available at: github.com/bcglee/govsc....

An diagram showing the GovScape PDF pre-processing pipeline, including PDF identification, rendering, and semantification (including embedding generation).

November 18, 2025 at 8:19 PM

Ben Lee

@bcgl.bsky.social

2/ GovScape is built on top of the End of Term Web Archive (eotarchive.org) and currently contains all renderable PDFs (50 pages or fewer) from the 2020 crawl, documenting the first Trump administration. An overview of GovScape’s search functionality can be found in this diagram.

A diagram showing the three central query methods within GovScape: semantic text search, visual search, and keyword search.

November 18, 2025 at 8:19 PM

Ben Lee

@bcgl.bsky.social

The public demo (digital-collections-explorer.com) enables searching over 500,000 map images from the Library of Congress's API. For example, search for "tattered and worn map"

Search results for "tattered and worn map" showing that the results are accurate (most are tattered and worn).

July 2, 2025 at 8:56 PM

Ben Lee

@bcgl.bsky.social

With our Digital Collections Explorer, a collection steward can spin up a local viewer with just a few lines of code. An overview of the Digital Collections Explorer architecture can be found in this overview figure.

The full codebase is available here: github.com/hinxcode/dig...

https://github.com/hinxcode/digital-collections-explorer

July 2, 2025 at 8:56 PM

Ben Lee

@bcgl.bsky.social

Today, I’ll be sharing about the public symposium, “AI and the Future of Holocaust Research and Memory,” hosted at @ischool.uw.edu on Udub’s campus in Seattle. Wonderful to be here with such incredible colleagues across the world and across disciplines.

Title slide for the public symposium “AI and the future of Holocaust research and memory,” hosted by the University of Washington.

May 20, 2025 at 5:01 PM

Ben Lee

@bcgl.bsky.social

In consultation with the Geography and Map Division at the Library of Congress, we demonstrate the utility of these embeddings for a range of search & discovery tasks, including natural language search, reverse image search, and multimodal search, like this one:

An example of a mixed-input search, showing a map + the text "more grayscale," surfacing results with similar content, but grayscale.

December 3, 2024 at 8:29 PM

Ben Lee

@bcgl.bsky.social

In this paper, we introduce CLIP embeddings for these 562,842 images, as well as a dataset of 10,504 map-caption pairs. Here's an overview of our search implementation, which returns results nearly instantaneously on an M3 Macbook Pro:

A diagram showing our implementation of search for the maps, with developer input, user input, and outputs specified.

December 3, 2024 at 8:29 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news