Tomorrow at the May 2012 DC Python meetup, I’m giving a talk on gensim, a Python framework for topic modeling that I use at work and on my own for semantic similarity comparisons. (I’ll post the slides and example code for the talk soon.) I’ve found gensim to be a useful and well-designed tool, and pretty much all credit for it goes to its creator, Radim Rehurek. Radim was kind enough to answer a few questions I sent him about gensim’s history and goals, and about his background and interests.
WB: Why did you create gensim?
RR: Consulting gig for a digital library project (Czech Digital
Mathematics Library, dml.cz), some 3 years ago. It started off as a
few loosely connected Python scripts to support the “show similar
articles” functionality. We wanted to use some of the statistical
methods, like latent semantic analysis. Originally, gensim only
contained wrappers around existing Fortran libraries for SVD, like
Propack and Svdpack.
But there were issues with that, and it scaled badly (all documents in
RAM), so I started looking for more scalable, online algorithms.
Running these popular methods shouldn’t be so hard, I thought!
In the end, I developed new algorithms for these methods for gensim.
The theoretical part of this research later turned into a part of my
PhD thesis.