中贝叶斯分类中程导弹入ENVISIONSS源例子,汉语调节和测试

词云是个很逸事物。

中文词云代码调节和测试,中文调节和测试

词云是个很有趣的东西。

用jieba断词,小说文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,停用词很关键,需求持续调控补充。

from wordcloud import WordCloud
import jieba

f = open(u'mori.txt','r').read()
##cuttext=" ".join(jieba.cut(f))
cuttext= jieba.cut(f) 
final= [] 
stopwords=open(u'stopword.txt','r').read() 

for seg in cuttext:
    ##seg = seg.encode('utf-8')
    if seg[0] not  in ['0','1','2','3','4','5','6','7','8','9']:##忽略数字
        if seg not in stopwords:
            final.append( seg) ## 列表添加   

font=r"c:/Windows/Fonts/simsun.ttc"##中文显示必须加
wordcloud = WordCloud(font_path=font,background_color="white",width=1000, height=860, margin=2).generate(" ".join(final))

import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

  wordcloud.to_file(‘test.png’)

效果图:

亚洲必赢官网 1

上边是词频总括排序,词长排序的代码。

##统计词频
freqD2 = {}
for word2 in final:
  freqD2[word2] = freqD2.get(word2, 0) + 1 

##按词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: x[1], reverse=True) 
_2000=counter_list[0][1] + 1
print(_2000)##用于词长词频排序用
fp = open('sort.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

##按词长词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: len(x[0])*_2000+x[1], reverse=True) 
fp = open('sortlen.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

排序代码很便利,也值得借鉴,Python是个好东西,强大,易重用。

 

词云是个很有趣的东西。
用jieba断词,小说文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,…

《机器学习实战》中贝叶斯分类中程导弹入揽胜极光SS源例子,机器学习实战

跟着书中代码往下写在此处卡住了,恐怕还会有别的同学也超越了那般的难题,记下来分享。

 

随着书中代码往下写在此处卡住了,思索到恐怕还会有别的同学也境遇了那样的标题,记下来分享。

用jieba断词,随笔文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,停用词很要紧,必要不停调控补充。

怎么设置feedparser?

按书中提供的网站直接设置feedparser会提示错误说未有setuptools,然后去找setuptools,官方的说教是windows最棒用ez_setup.py安装,小编确实下载不下去官方网址的那么些ez_etup.py,这一个帖子给出了缓解方案:

ez_setup.py

将以此文件一贯拷贝到C:\\python二七文件夹中,输入命令行:python
ez_setup.py install

下一场转到放feedparser安装文件的文本夹中,命令行输入:python setup.py
install

 

先嘲讽一下,相信超越4分之三网络朋友在此地卡住的根本原因是宏大的GFW,所以不管软件FQ依然肉身FQ的伴儿们估量是无论如何也看不到那篇博文的,不想往下看的请自觉运用FQ手艺。

from wordcloud import WordCloud
import jieba

f = open(u'mori.txt','r').read()
##cuttext=" ".join(jieba.cut(f))
cuttext= jieba.cut(f) 
final= [] 
stopwords=open(u'stopword.txt','r').read() 

for seg in cuttext:
    ##seg = seg.encode('utf-8')
    if seg[0] not  in ['0','1','2','3','4','5','6','7','8','9']:##忽略数字
        if seg not in stopwords:
            final.append( seg) ## 列表添加   

font=r"c:/Windows/Fonts/simsun.ttc"##中文显示必须加
wordcloud = WordCloud(font_path=font,background_color="white",width=1000, height=860, margin=2).generate(" ".join(final))

import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

小编提供的讴歌MDXSS源链接“

书中小编的乐趣是的话自源
中的文章作为分类为一的稿子,以来自源
中的文章作为分类为0的稿子

为了能够跑通示例代码,能够找两可用的昂科拉SS源作为代替。

自家用的是那八个源:

NASA Image of the
Day:

Yahoo Sports – NBA – Houston Rockets
News:

也正是说,假若算法运转正确的话,全数来自于 nasa
的稿子将会被分类为壹,全数来自于yahoo sports的休斯顿火箭队(休斯敦 罗克ets)音信将会分类为0

 

 

  wordcloud.to_file(‘test.png’)

运用本身定义的中华VSS源,当程序运转到trainNB0(array(trainMat),array(trainClasses))时会报错,怎么做?

从书中小编的例证来看,小编利用的源中文章数量较多,len(ny[‘entries’])
为 100,作者找的多少个 CRUISERSS 源唯有十-1几个左右。

>>> import feedparser
>>>ny=feedparser.parse(”)
>>> ny[‘entries’]
>>> len(ny[‘entries’])
100

因为我的一个大切诺基SS源有100篇小说,所以他得以在代码中剔除了二16个“停用词”,随机选用20篇小说作为测试集,可是当大家接纳替代HummerH贰SS源时大家唯有十篇作品却要收取20篇作品作为测试集,那样明显是会出错的。只要本身调控下测试集的数码就能够让代码跑通;假若文章中的词太少,缩短剔除的“停用词”数量能够升高算法的准确度。

 

怎么设置feedparser?

按书中提供的网站直接设置feedparser会提醒错误说并未有setuptools,然后去找setuptools,官方的传教是windows最棒用ez_setup.py安装,笔者真正下载不下来官方网站的尤其ez_etup.py,那些帖子给出了消除方案:

ez_setup.py

将以此文件一贯拷贝到C:\\python二七文件夹中,输入命令行:python
ez_setup.py install

下一场转到放feedparser安装文件的文件夹中,命令行输入:python setup.py
install

 

效果图:

假设不想将出现频率排序最高的二十7个单词移除,该如何去掉“停用词”?

可以把要去掉的停用词存放到txt文件中,使用时读取(替代移除高频词的代码)。具体要求停用哪些词能够参照那里

以下代码想健康运维须求将停用词保存至stopword.txt中。

本身的txt中保留了以下单词,效果还不易:

a
中贝叶斯分类中程导弹入ENVISIONSS源例子,汉语调节和测试。about
above
after
again
against
all
am
an
and
any
are
aren’t
as
at
be
because
been
before
being
below
between
both
but
by
can’t
cannot
could
couldn’t
did
didn’t
do
does
doesn’t
doing
don’t
down
during
each
few
for
from
further
had
hadn’t
has
hasn’t
have
haven’t
having
he
he’d
he’ll
he’s
her
here
here’s
hers
herself
him
himself
his
how
how’s
i
i’d
i’ll
i’m
i’ve
if
in
into
is
isn’t
it
it’s
its
itself
let’s
me
more
most
mustn’t
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan’t
she
she’d
she’ll
she’s
should
shouldn’t
so
some
such
than
that
that’s
the
their
theirs
them
themselves
then
there
there’s
these
they
they’d
they’ll
they’re
they’ve
this
those
through
to
too
under
until
up
very
was
wasn’t
we
we’d
we’ll
we’re
we’ve
were
weren’t
what
what’s
when
when’s
where
where’s
which
while
who
who’s
whom
why
why’s
with
won’t
would
wouldn’t
you
you’d
you’ll
you’re
you’ve
your
yours
yourself
yourselves

 

'''
Created on Oct 19, 2010

@author: Peter
'''
from numpy import *

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

def testingNB():
    print '*** load dataset for training ***'
    listOPosts,listClasses = loadDataSet()
    print 'listOPost:\n',listOPosts
    print 'listClasses:\n',listClasses
    print '\n*** create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'myVocabList:\n',myVocabList
    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))
    print 'train matrix:',trainMat
    print '\n*** train P0V p1V pAb ***'
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
    print '\n*** classify ***'
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]       

def stopWords():
    import re
    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords
    listOfTokens = re.split(r'\W*', wordList)
    return [tok.lower() for tok in listOfTokens] 
    print 'read stop word from \'stopword.txt\':',listOfTokens
    return listOfTokens

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries'])
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    print '\nmin Length: ', minLen
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    print '\nVocabList is ',vocabList
    print '\nRemove Stop Word:'
    stopWordList = stopWords()
    for stopWord in stopWordList:
        if stopWord in vocabList:
            vocabList.remove(stopWord)
            print 'Removed: ',stopWord
##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
##    print '\nTop 30 words: ', top30Words
##    for pairW in top30Words:
##        if pairW[0] in vocabList:
##            vocabList.remove(pairW[0])
##            print '\nRemoved: ',pairW[0]
    trainingSet = range(2*minLen); testSet=[]           #create test set
    print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet
    for i in range(5):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    print '\ntrainMat length:',len(trainMat)
    print '\ntrainClasses',trainClasses
    print '\n\ntrainNB0:'
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    #print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)
        originalClass = classList[docIndex]
        result =  classifiedClass != originalClass
        if result:
            errorCount += 1
        print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result
    print '\nthe error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

def testRSS():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    vocabList,pSF,pNY = localWords(ny,sf)

def testTopWords():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    getTopWords(ny,sf)

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

def test42():
    print '\n*** Load DataSet ***'
    listOPosts,listClasses = loadDataSet()
    print 'List of posts:\n', listOPosts
    print 'List of Classes:\n', listClasses

    print '\n*** Create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'Vocab List from posts:\n', myVocabList

    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))
    print 'Train Matrix:\n', trainMat

    print '\n*** Train ***'
    p0V,p1V,pAb = trainNB0(trainMat,listClasses)
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb

作者提供的KoleosSS源链接“

书中作者的意趣是的话自源
中的小说作为分类为一的小说,以来自源
中的文章作为分类为0的小说

为了能够跑通示例代码,能够找两可用的PAJEROSS源作为代表。

本人用的是那三个源:

NASA Image of the
Day:

Yahoo Sports – NBA – Houston Rockets
News:

也正是说,假诺算法运营正确的话,全体来自于 nasa
的篇章将会被分门别类为一,全数来自于yahoo sports的休斯顿火箭队(休斯敦 罗克ets)情报将会分类为0

 

亚洲必赢官网 2

应用本身定义的本田UR-VSS源,当程序运营到trainNB0(array(trainMat),array(trainClasses))时会报错,如何做?

从书中作者的例子来看,我运用的源中文章数量较多,len(ny[‘entries’])
为 十0,小编找的多少个 奇骏SS 源唯有十-十八个左右。

>>> import feedparser
>>>ny=feedparser.parse(”)
>>> ny[‘entries’]
>>> len(ny[‘entries’])
100

因为小编的2个CRUISERSS源有拾0篇小说,所以她得以在代码中剔除了二十七个“停用词”,随机选择20篇文章作为测试集,但是当大家运用取代PAJEROSS源时大家唯有10篇文章却要抽出20篇作品作为测试集,那样鲜明是会出错的。只要自个儿调整下测试集的数额就足以让代码跑通;假若小说中的词太少,减弱剔除的“停用词”数量得以提升算法的准确度。

 

上面是词频计算排序,词长排序的代码。

跟着书中代码往下写在此处卡住了,只怕还会有别的同学也凌驾了那般的问…

设若不想将应运而生频率排序最高的三二十个单词移除,该怎么去掉“停用词”?

能够把要去掉的停用词存放到txt文件中,使用时读取(替代移除高频词的代码)。具体须求停用哪些词能够参见那里

以下代码想通常运维需求将停用词保存至stopword.txt中。

自家的txt中保留了以下单词,效果尚可:

a
about
above
after
again
against
all
am
an
and
any
are
aren’t
as
at
be
because
been
before
being
below
between
both
but
by
can’t
cannot
could
couldn’t
did
didn’t
do
does
doesn’t
doing
don’t
down
during
each
few
for
from
further
had
hadn’t
has
hasn’t
have
haven’t
having
亚洲必赢官网 ,he
he’d
he’ll
he’s
her
here
here’s
hers
herself
him
himself
his
how
how’s
i
i’d
i’ll
i’m
i’ve
if
in
into
is
isn’t
it
it’s
its
itself
let’s
me
more
most
mustn’t
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan’t
she
she’d
she’ll
she’s
should
shouldn’t
so
some
such
than
that
that’s
the
their
theirs
them
themselves
then
there
there’s
these
they
they’d
they’ll
they’re
they’ve
this
those
through
to
too
under
until
up
very
was
wasn’t
we
we’d
we’ll
we’re
we’ve
were
weren’t
what
what’s
when
when’s
where
where’s
which
while
who
who’s
whom
why
why’s
with
won’t
would
wouldn’t
you
you’d
you’ll
you’re
you’ve
your
yours
yourself
yourselves

 

'''
Created on Oct 19, 2010

@author: Peter
'''
from numpy import *

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

def testingNB():
    print '*** load dataset for training ***'
    listOPosts,listClasses = loadDataSet()
    print 'listOPost:\n',listOPosts
    print 'listClasses:\n',listClasses
    print '\n*** create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'myVocabList:\n',myVocabList
    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))
    print 'train matrix:',trainMat
    print '\n*** train P0V p1V pAb ***'
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
    print '\n*** classify ***'
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]       

def stopWords():
    import re
    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords
    listOfTokens = re.split(r'\W*', wordList)
    return [tok.lower() for tok in listOfTokens] 
    print 'read stop word from \'stopword.txt\':',listOfTokens
    return listOfTokens

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries'])
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    print '\nmin Length: ', minLen
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    print '\nVocabList is ',vocabList
    print '\nRemove Stop Word:'
    stopWordList = stopWords()
    for stopWord in stopWordList:
        if stopWord in vocabList:
            vocabList.remove(stopWord)
            print 'Removed: ',stopWord
##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
##    print '\nTop 30 words: ', top30Words
##    for pairW in top30Words:
##        if pairW[0] in vocabList:
##            vocabList.remove(pairW[0])
##            print '\nRemoved: ',pairW[0]
    trainingSet = range(2*minLen); testSet=[]           #create test set
    print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet
    for i in range(5):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    print '\ntrainMat length:',len(trainMat)
    print '\ntrainClasses',trainClasses
    print '\n\ntrainNB0:'
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    #print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)
        originalClass = classList[docIndex]
        result =  classifiedClass != originalClass
        if result:
            errorCount += 1
        print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result
    print '\nthe error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

def testRSS():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    vocabList,pSF,pNY = localWords(ny,sf)

def testTopWords():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    getTopWords(ny,sf)

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

def test42():
    print '\n*** Load DataSet ***'
    listOPosts,listClasses = loadDataSet()
    print 'List of posts:\n', listOPosts
    print 'List of Classes:\n', listClasses

    print '\n*** Create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'Vocab List from posts:\n', myVocabList

    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))
    print 'Train Matrix:\n', trainMat

    print '\n*** Train ***'
    p0V,p1V,pAb = trainNB0(trainMat,listClasses)
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
##统计词频
freqD2 = {}
for word2 in final:
  freqD2[word2] = freqD2.get(word2, 0) + 1 

##按词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: x[1], reverse=True) 
_2000=counter_list[0][1] + 1
print(_2000)##用于词长词频排序用
fp = open('sort.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

##按词长词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: len(x[0])*_2000+x[1], reverse=True) 
fp = open('sortlen.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

排序代码很有益,也值得借鉴,Python是个好东西,壮大,易重用。

 

网站地图xml地图