Opinion

I built a semantic search engine for LessWrong

​For a while now, I’ve wanted something like Connected Papers for LessWrong & the broader rationalist community.I built a working prototype that has, so far, been pretty useful to me. You can find it at: situate.info/rationalUsageYou give it a LessWrong or Substack URL and receive a list of the top 100 similar documents from the database.Notes:It was updated on 4/19, newer posts won’t work.I do not currently have the list of scraped Substacks published (should add it to the website) but it includes Astral Codex Ten, Overcoming Bias, and a bunch of other rationalist-adjacent publications.How it worksI scraped ~372k documents:LessWrong: 36,220 posts, 6,329 top level shortforms, 289,721 comments. Filtered by karma ≥ 5.Substack: 40,289 posts across 173 publications.Each document was split into ~360-token chunks (heading-aware, sliding window, or whole-doc for short pieces), embedded with Voyage-3.5 (1024-d), then averaged into a single doc vector. Similarity is calculated via cosine distance for a total of 100 matches per query.What’s nextI’m excited to receive feedback/ideas from you all, for further iteration.I have a personal setup of this tool that allows for open-ended custom queries (embedded on-demand), and BGE reranking for slightly improved accuracy. At the moment, this would be too expensive/complicated for me to implement at scale, but I do currently plan to build a polished setup for the ingestion -> chunking & embedding -> search pipeline, that I can open source for others to use.A semantic search on LessWrong alone would probably be an easy project for the Lightcone team, so I expect that specific function to eventually become obsolete. However, I plan to keep ingesting information from many different corners of the internet, so I think building an effective third party system is still useful.In general, I’m interested in working on interfaces that improve A) individual navigation and synthesis of information and B) collective epistemics (or both at the same time). I see it as a promising direction, especially amidst the many intense efforts to raise x-risk awareness that are likely to get somewhat tangled up in complicated debates. I will be sharing more ideas soon.Also, I think it’s really important that talented newcomers to AI safety “situate” new posts in relevant past context. This is a broad and rapidly-changing field of thought, with no standardized curriculum to get people up to speed. Many of the cutting edge discussions on this forum happen within a small bubble of longstanding members, and it takes a while before newcomers could comfortably follow along. I think it’s worth investing in speeding up that process for any potential increase in high-quality/strategic thinking.Any broad advice/direction is appreciated too!Discuss ​Read More

​For a while now, I’ve wanted something like Connected Papers for LessWrong & the broader rationalist community.I built a working prototype that has, so far, been pretty useful to me. You can find it at: situate.info/rationalUsageYou give it a LessWrong or Substack URL and receive a list of the top 100 similar documents from the database.Notes:It was updated on 4/19, newer posts won’t work.I do not currently have the list of scraped Substacks published (should add it to the website) but it includes Astral Codex Ten, Overcoming Bias, and a bunch of other rationalist-adjacent publications.How it worksI scraped ~372k documents:LessWrong: 36,220 posts, 6,329 top level shortforms, 289,721 comments. Filtered by karma ≥ 5.Substack: 40,289 posts across 173 publications.Each document was split into ~360-token chunks (heading-aware, sliding window, or whole-doc for short pieces), embedded with Voyage-3.5 (1024-d), then averaged into a single doc vector. Similarity is calculated via cosine distance for a total of 100 matches per query.What’s nextI’m excited to receive feedback/ideas from you all, for further iteration.I have a personal setup of this tool that allows for open-ended custom queries (embedded on-demand), and BGE reranking for slightly improved accuracy. At the moment, this would be too expensive/complicated for me to implement at scale, but I do currently plan to build a polished setup for the ingestion -> chunking & embedding -> search pipeline, that I can open source for others to use.A semantic search on LessWrong alone would probably be an easy project for the Lightcone team, so I expect that specific function to eventually become obsolete. However, I plan to keep ingesting information from many different corners of the internet, so I think building an effective third party system is still useful.In general, I’m interested in working on interfaces that improve A) individual navigation and synthesis of information and B) collective epistemics (or both at the same time). I see it as a promising direction, especially amidst the many intense efforts to raise x-risk awareness that are likely to get somewhat tangled up in complicated debates. I will be sharing more ideas soon.Also, I think it’s really important that talented newcomers to AI safety “situate” new posts in relevant past context. This is a broad and rapidly-changing field of thought, with no standardized curriculum to get people up to speed. Many of the cutting edge discussions on this forum happen within a small bubble of longstanding members, and it takes a while before newcomers could comfortably follow along. I think it’s worth investing in speeding up that process for any potential increase in high-quality/strategic thinking.Any broad advice/direction is appreciated too!Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *