For our traffic classification based on DNS domain names, we created a corpus of words as found in the 2012 English Wikipedia. Installation and usage$ sudo pip install wikiwords
$ python
>>> import wikiwords
>>> wikiwords.freq(" monty ")
6.348454761413523e-06
>>> wikiwords.occ(" python ")
18972
>>> wikiwords.freq(" no such word ", lambda x: 1./ len (x))
0.08333333333333333
Exemplary applicationThis data can be used e.g. for word segmentation in domain names or for other applications in Natural Language Processing - for example:
$ python
>>> segment.segment(" campaignmonitor ")
>>> segment.segment(" officesnapshots ")
>>> segment.segment(" nauticaldictionary ")
['nautical', 'dictionary']
References |
|