wrote a small http server in c++ that searches youtube video transcripts stored in sqlite and the binary is 2.1MB

6 days ago 18

i work at a robotics company and we have about 160 youtube videos. recorded design reviews, firmware walkthroughs, test procedure demos, safety training recordings, conference presentations. they're all in a shared google drive folder as links and the only way to find anything is scrolling through filenames like "DR_2024_03_14.txt" which tells you literally nothing about what was discussed.

i wanted a fast search tool with minimal dependencies so i wrote it in c++.

the server uses cpp-httplib which is a single header file for an http server. one GET endpoint for the search query, one for serving the static html page. the html is embedded as a string literal in the source so the binary is completely self-contained. no external files to deploy.

the database is sqlite with FTS5 for full text search. i have a metadata table with video_id, title, date, presenter, tags, and youtube_url, and an FTS5 virtual table that indexes the transcript text. the search runs a MATCH query on the FTS5 table and uses snippet() for the excerpt with the match highlighted. joined back to the metadata table for the video info.

for pulling the transcripts i wrote a separate ingestion tool. it reads urls from a text file, calls transcript api for each one using libcurl, parses the json with nlohmann/json, and inserts into sqlite. the ingestion tool is about 150 lines. the server is about 200 lines.

the search query returns in under 2ms for 160 transcripts. i measured it with std::chrono out of habit and then realized it was pointless because the bottleneck is the network round trip not the query. the FTS5 index makes the sqlite part basically free.

the build uses cmake. three dependencies: cpp-httplib (header only, vendored), nlohmann/json (header only, vendored), and sqlite3 (system library). the final binary is 2.1MB statically linked. i copied it to our internal tools server along with the sqlite database file and that was the deployment. added a systemd unit file and it's been running for 3 months.

memory usage at idle is about 4MB. under load with a few concurrent searches it barely moves. the sqlite database with 160 transcripts is about 45MB on disk. the whole thing fits on basically any machine.

the team uses it before design reviews to search for what was discussed in previous reviews. the firmware engineers search for specific peripheral names to find the walkthrough where someone demonstrated the configuration. one of the test engineers told me she searches for failure mode descriptions from past test reviews which is a use case i hadn't thought of.

the part i like about this project is that it'll run unchanged for years. no package manager updates, no runtime version changes, no dependency hell. it's a binary and a database file. if the server dies i copy two files to a new machine and start it.

submitted by /u/straightedge23 to r/cpp
[link] [comments]
Read Entire Article