|
EN / PT
← Back to projects

BookHub | Self-Hosted Book Search with VirusTotal and gVisor

BookHub | Self-Hosted Book Search with VirusTotal and gVisor

BookHub

Search, Scan and Convert Books Without Trusting a Single File

I read on an e-reader, and getting books in EPUB usually means visiting sites where half the buttons are ads and every download is a lottery. Doing that on my main PC, on the same network as everything else I run at home, never felt right.

So, together with Claude, I built BookHub: a self-hosted PWA that runs on a dedicated VM in my homelab and treats every file as hostile until proven otherwise.

The search side is the easy part. It queries Anna's Archive, Libgen and the Internet Archive at the same time, filters to EPUB/PDF, deduplicates the results and shows them with covers, sorted by size, smallest first. There was a VK provider too, but VK killed the API it depended on, so that one is retired.

archive.org was the nice surprise: clean public API, no Cloudflare in the way, and a good stock of Portuguese titles. When an item has several files, the app picks the smallest, so a 1 MB reflowable EPUB wins over a 137 MB scan.

When I pick a result, the file never touches my browser directly:

  • the server downloads it into a quarantine folder
  • verifies the format (zip-bomb and polyglot checks, nothing gets extracted)
  • hashes it and sends it to VirusTotal
  • only a fresh "clean" verdict gets served; malicious, suspicious or unverifiable files are deleted

Files are temporary too: a TTL cleaner sweeps everything after 60 minutes.

Real use taught me one thing here: old books carry VirusTotal verdicts that are clean but ancient, and the app was deleting them as unverified. Now a clean verdict counts for up to 5 years, and there's a Re-scan button for the rest.

bookhub1.png

There is also a PDF to EPUB converter (Calibre + OCR), and this is where the paranoia starts for real. Parsing an untrusted PDF is the most dangerous thing the app does, so each conversion runs in a disposable gVisor (runsc) container: no network, read-only, only that job's folder mounted, no secrets in the environment. gVisor puts a user-space kernel between the container and the real one, so even if a malicious PDF exploits Calibre, it lands in a sandbox with nothing to steal and no way out. The app never touches the raw Docker socket to launch these workers either; that goes through a socket-proxy that only allows creating and starting containers; everything else is denied by default. The same page also converts images to 8-bit grayscale BMP, the format my e-reader wants for screensavers.

bookhub3.png
bookhub4.png

The main app is containerized with the same mindset: non-root UID, read-only filesystem, /tmp mounted noexec, every Linux capability dropped, no-new-privileges, capped at 2 GB of RAM and 256 processes, bound to loopback only. If something breaks out of the app code, there isn't much to grab inside the container.

The layer I enjoyed building most was the network. The VM sits on its own VLAN, and OPNsense blocks all RFC1918 egress: even if the whole thing gets compromised, it can't reach Proxmox, the NAS, or anything else on the LAN. gVisor and the VLAN are separate walls: one stops a parser escaping to the kernel, the other stops the app moving through the network. You need both. Access from outside goes through a Cloudflare Tunnel with an identity gate before the app's own login, and the app itself went through a security audit pass after the build: CSP and security headers, generic errors to the client, API docs off in production.

Stack: FastAPI, vanilla JS PWA, SQLite. No framework, no build step.

It comes with limits: files over 32 MB can't be scanned on VirusTotal's free tier, so they don't download at all (for archive.org the app at least hands you the direct public link instead of just failing). Calibre reflows text well but won't rebuild tables or multi-column layouts. And Libgen mirrors die every other week, so the app fails soft and leans on whichever source is alive.

One last thing: these sources distribute copyrighted material, so this stays private, behind the tunnel, for a few users. The responsibility is on whoever runs it.

My e-reader gets clean EPUBs, and my main PC never touches those sites again.