Stack Exchange, the company that owns Stack Overflow, made a data dump available on http://archive.org. I downloaded the stackoverflow.com-Posts.7z file from https://archive.org/details/stackexchange. I wrote these scripts to get the posts by tag, extract the code snippets, and index the snippets by word.
My Ruby:
- ruby 2.2.1p85
My Gems:
- nokogiri (1.6.7)
- sqlite3 (1.3.11)
./create_tables.rb LANG TABLE
Example 1: Get the posts, snippets, and words used in posts tagged with c.
$ ./create_tables.rb c posts
$ ./create_tables.rb c snippets
$ ./create_tables.rb c words - The first command creates the
c_posts.dbdatabase file and apoststable. - The second command creates the
c_snippets.dbdatabase file and asnippetstable. - The third command creates the
wordsandword_snippetstables in thec_snippets.dbdatabase.