前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >用python讲故事(中)

用python讲故事(中)

作者头像
哒呵呵
发布2018-08-06 17:25:43
6290
发布2018-08-06 17:25:43
举报
文章被收录于专栏:鸿的学习笔记鸿的学习笔记

What just happened?

  1. 读取一些文本并将其拆分成单词。 这样就定义了情感分析的上下文的粒度,所以如果你想使用不同的抽样策略,可以通过分割不同的分隔符来改变它的地方。 你也可以使用fancier tokenizer或lemmatizer代替“split on whitespace”策略。
  2. 将单词列表合并成文本块。 我们采用“分割和合并”策略是因为有两个原因:大多数任何由单词组成的文本都可以因此可靠地运行,并且当创建用于情绪分析的样本时,我们可以独立控制长度和步长 - 使得实验不同大小上下文窗口,不同粒度(即块大小),不同的采样分辨率(即步幅)变得容易。
  3. 作为人类,我们倾向于解释上下文而不是思考它...但是机器需要在上下文之间的样本来模仿这种行为。 为了确保我们能对合理稳定的上下文进行抽样,我们需要在合并的单词列表上滑动一个窗口,以生成一堆重叠的抽样。
  4. 将样本列表发送到indicoAPI进行打分,获取分数列表。

5. Assemble a corpus of data to validate subjective(human) interpretation

现在我们有了一个描绘故事大意的管道(pipeline),我们需要一个更好的测试。第一步是找到数据。为了验证Vonnegut的假设,我最初想给他描述的同样的故事打分。 但我只读了一次哈姆雷特,这是足够了。 Vonnegut的故事可能是原型小说,但对我来说,当我不记得这些故事的背景和事件顺序时,很难验证其性能。 等一下......他提到过灰姑娘的故事,大家都知道,对吧?

我在网上搜索到一个常规版本的灰姑娘的故事,但很快发现,神话会有几十种变化。 因为有了许多版本,不可能将我对灰姑娘故事的解释归因于某个全文或某个版本。我们想要权威的版本。

最后,想想“灰姑娘最流行的版本是什么?”...我绝对记得迪斯尼的灰姑娘的版本!电影剧本会比书面故事更好吗?

事实证明,电影对目前的任务有着很多有用的限制。 书面故事通常在许多环境描写,但电影是:

· time-boxedand consumed all in a single sitting/context

· sequencesof events are more memorable when they occur on-screen as audio-visual mediavs. written as text (for me, at least)

· havesimilar lengths

· everymovie has a script, either the original or transcriptions produced by fans(这个感觉原文更有味道)

不幸的是,我找不到一个在网上免费提供的好的脚本。 然而,粉丝们已经转录了许多其他电影,包括狮子王,阿拉丁,小美人鱼,睡美人等等:

· The web’s largest movie script resource

· Top Google hit for “Disney Movie Scripts”

Extend code to iterate over each story in a corpus

现在我们有了多个文本,我们需要抽象上面的简单代码来遍历一个文本文件的语料库。 数据帧是用于在此存储和操纵结果的良好数据结构。我们还需要添加一些干净/蒙版(cleaning/munging)代码,因为来自互联网的电影脚本可能是混乱的。

# define your corpus here as a list of text files
corpus = ["aladdin.txt",
         "lionking.txt",
         "mulan.txt",
         "hunchback.txt",
         "rescuersdownunder.txt",
         "sleepingbeauty.txt",
         "littlemermaid.txt"]
# New dict to hold data
d = {}
# Map names to input files on filesystem
root_fp = os.getcwd()
corpus_fp = os.path.join(root_fp, "texts")   # put your text files in ./texts
# print("Looking for input text files: '%s'" %corpus_fp)
for t in corpus:
    fp =os.path.join(corpus_fp, t)
    print("Reading '%s'" % t)
    with open(fp,'rb') as f:
        text_name =t.split(".")[0]  # strip .txtfile extensions
        sample_col= text_name + "_sample"
        score_col =text_name + "_sentiment"
        lines =[]  # list to receive cleaned lines oftext
        # Quicktext cleaning and transformations
        for line inf:
            ifstr(line) == str(""): # there are many blank lines in movie scripts,ignore them
               continue
            else:
               line = line.replace("\n", "").lower().strip().strip('*')  #chain any other text transformations here
               lines.append(line)
       print("  %i lines read from'%s' with size: %5.2f kb" % (len(lines), t, sys.getsizeof(lines)/1024.))
        # Constructa big string of clean text
        text =" ".join(line for line in lines)
        # split onsentences (period + space)
        delim =". "
        sentences =[_ + delim for _ in text.split(delim)]  #regexes are the more robust (but less readable) way to do this...
       merged_sentences = [delim.join(s) for s in merge(sentences, 10)]  # merge sentences into chunks
        # split onwords (whitespace)
        delim =" "
        words = [_for _ in text.split()]
       merged_words = [" ".join(w) for w in merge(words, 120)]  # merge words into chunks
        # Generate samples by sliding contextwindow
        delim =" "
        samples =[delim.join(s) for s in sample_window(merged_words, 10, 1)]
       d[sample_col] = samples
       print("   submitting %isamples for '%s'" % (len(samples), text_name))
        # API toget scores
        scores =indicoio.batch_sentiment(samples)
       d[score_col] = scores
print("\n...complete!")
 Reading'aladdin.txt'
  2639 lines readfrom 'aladdin.txt' with size: 23.18 kb
   submitting 143samples for 'aladdin'
 Reading'lionking.txt'
  3506 lines readfrom 'lionking.txt' with size: 29.42 kb
   submitting 135samples for 'lionking'
 Reading'mulan.txt'
  1231 lines read from'mulan.txt' with size:  9.97 kb
   submitting 78samples for 'mulan'
 Reading'hunchback.txt'
  2659 lines readfrom 'hunchback.txt' with size: 23.18 kb
   submitting 106samples for 'hunchback'
 Reading'rescuersdownunder.txt'
  882 lines readfrom 'rescuersdownunder.txt' with size: 7.80 kb
   submitting 82samples for 'rescuersdownunder'
 Reading'sleepingbeauty.txt'
  1084 lines readfrom 'sleepingbeauty.txt' with size: 8.82 kb
   submitting 58samples for 'sleepingbeauty'
 Reading'littlemermaid.txt'
  1103 lines readfrom 'littlemermaid.txt' with size:  8.82kb
   submitting 69samples for 'littlemermaid'
...complete!
df = pd.DataFrame()
# for k,v in d.iteritems():
for k,v in sorted(d.iteritems()):  # sort to ensure dataframe is defined bylongest sequence, which happens to be Aladdin
    df[k] =pd.Series(v)  # keys -> columns; rows-> columns
print(len(df))
df.head(5)  #inspect the first 5 rows...looks OK?
143
Out[12]:

aladdin_sample

aladdin_sentiment

hunchback_sample

hunchback_sentiment

lionking_sample

lionking_sentiment

littlemermaid_sample

littlemermaid_sentiment

mulan_sample

mulan_sentiment

rescuersdownunder_sample

rescuersdownunder_sentiment

sleepingbeauty_sample

sleepingbeauty_sentiment

0

aladdin: the complete script peddler: oh i com…

0.540695

disney’s the hunchback of notre dame (as the w…

0.699814

the lion king {open, black screen} {start natu…

0.421786

the little mermaid an ocean. birds are flying …

0.971097

mulan the complete script a chinese painting o…

0.935301

the rescuers down under opening: the camera sl…

0.812134

walt disney’s sleeping beauty [the book opens …

0.932075

1

dunes. ah, salaam and good evening to you wort…

0.555306

of sounds, so many changing moods. because, yo…

0.319084

ms: siyo nqoba [we’re going to conquer] bs: in…

0.724241

merpeople, lad. thought every good sailor knew…

0.842828

the hun leader; other signals go on all the wa…

0.837544

the crocodile falls area and some of the surro…

0.893503

wishes too. we pledge our loyalty anew. hail t…

0.886473

2

peddler hurries to catch it.) wait, don’t go! …

0.523802

docks near notre dame gypsy 1: shut it up, wil…

0.386786

the sun there’s more to see than can ever be s…

0.834280

triton: i’m really looking forward to this per…

0.834772

up reserves, and as many new recruits as possi…

0.792785

[cody slides through a log, picks up a stick, …

0.954304

these monarchs dreamed one day their kingdoms …

0.813945

3

sand from the lamp into his hand.) it begins o…

0.499568

stolen goods, no doubt. take them from her. cl…

0.426293

despair and hope through faith and love {appea…

0.799421

song sebastian wrote, her voice is like a bell…

0.740849

smart boy! can you help me with my chores toda…

0.773323

of animals and the forest.] [they arrive at th…

0.922784

walk with springtime wherever she goes fauna: …

0.785493

4

get what’s coming to you. iago: what’s coming …

0.443469

to it. he is about to drop the baby down the w…

0.214561

for the crowd to view.} fs: it’s the circle of…

0.811837

with something. yeah, i got this cough. [floun…

0.709885

the doctor said three cups of tea in the morni…

0.849055

cuts two ropes. cody cuts the last rope to fre…

0.874259

merryweather: you weren’t wanted! maleficent: …

0.751146

# inspect the last 5 rows;

# since sequences are of unequal length, there should bea bunch of NaN's

# at the end for all but the longest sequence

df.tail(5)

aladdin_sample

aladdin_sentiment

hunchback_sample

hunchback_sentiment

lionking_sample

lionking_sentiment

littlemermaid_sample

littlemermaid_sentiment

mulan_sample

mulan_sentiment

rescuersdownunder_sample

rescuersdownunder_sentiment

sleepingbeauty_sample

sleepingbeauty_sentiment

138

they hold hands, but both look sad.) aladdin: …

0.363406

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

139

aladdin: jasmine, i do love you, but i’ve got …

0.752767

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

140

the nile. try that! aladdin: i wish for the ni…

0.923298

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

141

forth, the princess shall marry whomever she d…

0.787534

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

142

the blue sky leaving a trail of sparkles behin…

0.880062

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2016-12-18,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 鸿的学习笔记 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档