What just happened?
5. Assemble a corpus of data to validate subjective(human) interpretation
现在我们有了一个描绘故事大意的管道(pipeline),我们需要一个更好的测试。第一步是找到数据。为了验证Vonnegut的假设,我最初想给他描述的同样的故事打分。 但我只读了一次哈姆雷特,这是足够了。 Vonnegut的故事可能是原型小说,但对我来说,当我不记得这些故事的背景和事件顺序时,很难验证其性能。 等一下......他提到过灰姑娘的故事,大家都知道,对吧?
我在网上搜索到一个常规版本的灰姑娘的故事,但很快发现,神话会有几十种变化。 因为有了许多版本,不可能将我对灰姑娘故事的解释归因于某个全文或某个版本。我们想要权威的版本。
最后,想想“灰姑娘最流行的版本是什么?”...我绝对记得迪斯尼的灰姑娘的版本!电影剧本会比书面故事更好吗?
事实证明,电影对目前的任务有着很多有用的限制。 书面故事通常在许多环境描写,但电影是:
· time-boxedand consumed all in a single sitting/context
· sequencesof events are more memorable when they occur on-screen as audio-visual mediavs. written as text (for me, at least)
· havesimilar lengths
· everymovie has a script, either the original or transcriptions produced by fans(这个感觉原文更有味道)
不幸的是,我找不到一个在网上免费提供的好的脚本。 然而,粉丝们已经转录了许多其他电影,包括狮子王,阿拉丁,小美人鱼,睡美人等等:
· The web’s largest movie script resource
· Top Google hit for “Disney Movie Scripts”
Extend code to iterate over each story in a corpus
现在我们有了多个文本,我们需要抽象上面的简单代码来遍历一个文本文件的语料库。 数据帧是用于在此存储和操纵结果的良好数据结构。我们还需要添加一些干净/蒙版(cleaning/munging)代码,因为来自互联网的电影脚本可能是混乱的。
# define your corpus here as a list of text files
corpus = ["aladdin.txt",
"lionking.txt",
"mulan.txt",
"hunchback.txt",
"rescuersdownunder.txt",
"sleepingbeauty.txt",
"littlemermaid.txt"]
# New dict to hold data
d = {}
# Map names to input files on filesystem
root_fp = os.getcwd()
corpus_fp = os.path.join(root_fp, "texts") # put your text files in ./texts
# print("Looking for input text files: '%s'" %corpus_fp)
for t in corpus:
fp =os.path.join(corpus_fp, t)
print("Reading '%s'" % t)
with open(fp,'rb') as f:
text_name =t.split(".")[0] # strip .txtfile extensions
sample_col= text_name + "_sample"
score_col =text_name + "_sentiment"
lines =[] # list to receive cleaned lines oftext
# Quicktext cleaning and transformations
for line inf:
ifstr(line) == str(""): # there are many blank lines in movie scripts,ignore them
continue
else:
line = line.replace("\n", "").lower().strip().strip('*') #chain any other text transformations here
lines.append(line)
print(" %i lines read from'%s' with size: %5.2f kb" % (len(lines), t, sys.getsizeof(lines)/1024.))
# Constructa big string of clean text
text =" ".join(line for line in lines)
# split onsentences (period + space)
delim =". "
sentences =[_ + delim for _ in text.split(delim)] #regexes are the more robust (but less readable) way to do this...
merged_sentences = [delim.join(s) for s in merge(sentences, 10)] # merge sentences into chunks
# split onwords (whitespace)
delim =" "
words = [_for _ in text.split()]
merged_words = [" ".join(w) for w in merge(words, 120)] # merge words into chunks
# Generate samples by sliding contextwindow
delim =" "
samples =[delim.join(s) for s in sample_window(merged_words, 10, 1)]
d[sample_col] = samples
print(" submitting %isamples for '%s'" % (len(samples), text_name))
# API toget scores
scores =indicoio.batch_sentiment(samples)
d[score_col] = scores
print("\n...complete!")
Reading'aladdin.txt'
2639 lines readfrom 'aladdin.txt' with size: 23.18 kb
submitting 143samples for 'aladdin'
Reading'lionking.txt'
3506 lines readfrom 'lionking.txt' with size: 29.42 kb
submitting 135samples for 'lionking'
Reading'mulan.txt'
1231 lines read from'mulan.txt' with size: 9.97 kb
submitting 78samples for 'mulan'
Reading'hunchback.txt'
2659 lines readfrom 'hunchback.txt' with size: 23.18 kb
submitting 106samples for 'hunchback'
Reading'rescuersdownunder.txt'
882 lines readfrom 'rescuersdownunder.txt' with size: 7.80 kb
submitting 82samples for 'rescuersdownunder'
Reading'sleepingbeauty.txt'
1084 lines readfrom 'sleepingbeauty.txt' with size: 8.82 kb
submitting 58samples for 'sleepingbeauty'
Reading'littlemermaid.txt'
1103 lines readfrom 'littlemermaid.txt' with size: 8.82kb
submitting 69samples for 'littlemermaid'
...complete!
df = pd.DataFrame()
# for k,v in d.iteritems():
for k,v in sorted(d.iteritems()): # sort to ensure dataframe is defined bylongest sequence, which happens to be Aladdin
df[k] =pd.Series(v) # keys -> columns; rows-> columns
print(len(df))
df.head(5) #inspect the first 5 rows...looks OK?
143
Out[12]:
aladdin_sample | aladdin_sentiment | hunchback_sample | hunchback_sentiment | lionking_sample | lionking_sentiment | littlemermaid_sample | littlemermaid_sentiment | mulan_sample | mulan_sentiment | rescuersdownunder_sample | rescuersdownunder_sentiment | sleepingbeauty_sample | sleepingbeauty_sentiment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | aladdin: the complete script peddler: oh i com… | 0.540695 | disney’s the hunchback of notre dame (as the w… | 0.699814 | the lion king {open, black screen} {start natu… | 0.421786 | the little mermaid an ocean. birds are flying … | 0.971097 | mulan the complete script a chinese painting o… | 0.935301 | the rescuers down under opening: the camera sl… | 0.812134 | walt disney’s sleeping beauty [the book opens … | 0.932075 |
1 | dunes. ah, salaam and good evening to you wort… | 0.555306 | of sounds, so many changing moods. because, yo… | 0.319084 | ms: siyo nqoba [we’re going to conquer] bs: in… | 0.724241 | merpeople, lad. thought every good sailor knew… | 0.842828 | the hun leader; other signals go on all the wa… | 0.837544 | the crocodile falls area and some of the surro… | 0.893503 | wishes too. we pledge our loyalty anew. hail t… | 0.886473 |
2 | peddler hurries to catch it.) wait, don’t go! … | 0.523802 | docks near notre dame gypsy 1: shut it up, wil… | 0.386786 | the sun there’s more to see than can ever be s… | 0.834280 | triton: i’m really looking forward to this per… | 0.834772 | up reserves, and as many new recruits as possi… | 0.792785 | [cody slides through a log, picks up a stick, … | 0.954304 | these monarchs dreamed one day their kingdoms … | 0.813945 |
3 | sand from the lamp into his hand.) it begins o… | 0.499568 | stolen goods, no doubt. take them from her. cl… | 0.426293 | despair and hope through faith and love {appea… | 0.799421 | song sebastian wrote, her voice is like a bell… | 0.740849 | smart boy! can you help me with my chores toda… | 0.773323 | of animals and the forest.] [they arrive at th… | 0.922784 | walk with springtime wherever she goes fauna: … | 0.785493 |
4 | get what’s coming to you. iago: what’s coming … | 0.443469 | to it. he is about to drop the baby down the w… | 0.214561 | for the crowd to view.} fs: it’s the circle of… | 0.811837 | with something. yeah, i got this cough. [floun… | 0.709885 | the doctor said three cups of tea in the morni… | 0.849055 | cuts two ropes. cody cuts the last rope to fre… | 0.874259 | merryweather: you weren’t wanted! maleficent: … | 0.751146 |
# inspect the last 5 rows;
# since sequences are of unequal length, there should bea bunch of NaN's
# at the end for all but the longest sequence
df.tail(5)
aladdin_sample | aladdin_sentiment | hunchback_sample | hunchback_sentiment | lionking_sample | lionking_sentiment | littlemermaid_sample | littlemermaid_sentiment | mulan_sample | mulan_sentiment | rescuersdownunder_sample | rescuersdownunder_sentiment | sleepingbeauty_sample | sleepingbeauty_sentiment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
138 | they hold hands, but both look sad.) aladdin: … | 0.363406 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
139 | aladdin: jasmine, i do love you, but i’ve got … | 0.752767 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
140 | the nile. try that! aladdin: i wish for the ni… | 0.923298 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
141 | forth, the princess shall marry whomever she d… | 0.787534 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
142 | the blue sky leaving a trail of sparkles behin… | 0.880062 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |