Wikipedia corpus file

For our traffic classification based on DNS domain names, we created a corpus of words as found in the 2012 English Wikipedia.

Installation and usage

$ sudo pip install wikiwords
$ python
>>> import wikiwords
>>> wikiwords.freq("
monty")
6.348454761413523e-06
>>> wikiwords.occ("python")
18972
>>> wikiwords.freq("no such word", lambda x: 1./len(x))
0.08333333333333333

Exemplary application

This data can be used e.g. for word segmentation in domain names or for other applications in Natural Language Processing - for example:

$ python
>>> import segment
>>> segment.segment("campaignmonitor")
['campaign', 'monitor']
>>> segment.segment("officesnapshots")
['office', 'snapshots']
>>> segment.segment("nauticaldictionary")
['nautical', 'dictionary']

References