前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >用python讲故事(下)

用python讲故事(下)

作者头像
哒呵呵
发布2018-08-06 17:26:51
9540
发布2018-08-06 17:26:51
举报
文章被收录于专栏:鸿的学习笔记
代码语言:javascript
复制
ax = df['lionking_sentiment'].plot(colormap = 'jet', figsize=(16,8))
ax.set_xlabel("sample")
ax.set_ylabel("sentiment_score")
ax.set_title("Lion King")
代码语言:javascript
复制
# Pick out a few stories to compare visually
combo = pd.DataFrame()
combo['lionking_sentiment'] = df['lionking_sentiment']
combo['aladdin_sentiment'] = df['aladdin_sentiment']
# combo['littlemermaid_sentiment'] =df['littlemermaid_sentiment']
ax2 = combo.plot(colormap='jet', figsize = (16,8))  # ignore mistmatched sequence length
ax2.set_xlabel("sample")
ax2.set_ylabel("sentiment_score")

从排序(歌曲,然后发生了点什么)到英雄的幸福的大驼峰,随后是阴暗的山谷。 我们检测到某种电影/故事的迪斯尼公式吗? 当绘制线在不同长度上发生时,很难比较这些曲线。 我们需要更好的方法来比较,某种类型的相似性度量是不稳定的不等长度。注意:由于我们使用滑动窗口对上下文进行抽样的方式,当故事结束时,情绪分数会趋于中性,并且开始丢失全文。

Smoothing kernels

平滑内核是一件很明显的事情,这会过滤掉一些高频噪声,并在pandas内使用几种合理的方法。对于不等长的序列,现在我们将忽略点NaN并绘制与每个故事可用序列一样多的序列。

代码语言:javascript
复制
# Pull out a single story here to test smoothing methods
df_sentiment =df["lionking_sentiment"].dropna()
In [17]:
df_roll = pd.rolling_mean(df_sentiment, 10)
ax = df_roll.plot(colormap = 'jet', figsize = (16, 8))
ax.set_xlabel("sample")
ax.set_ylabel("sentiment_score")
ax.set_title("Lion King, smoothed with rollingmean")
ax.set_xlim((0, 110))

Alternative smoothing method: Lowess

可移动的平均平滑度不错,我们会越来越接近Vonnegut的大意!但是由于我们使用了滑动窗口对文本进行重新取样以进行情绪分析,所以另一个滑动窗口方法可能不是这里的最佳选择,因为它可能错误地传达比由分数证明的更强的置信度或稳定性。我们还平衡了平滑度的敏感性。 情绪往往对正负权重的平衡敏感,因此噪声可能是一个有用的数量跟踪,特别是因为我们还不知道它如何变化的故事。 此外,较大的内核需要一段时间来积累样本,进入一个故事的开始,有趣的事情可能发生。另一种方法可能是更好的选择 - 不消耗数据来建立统计矩。 让我们使用Lowess平滑,并与原始分数进行比较。

代码语言:javascript
复制
import scipy.stats as stats
import statsmodels.api as sm
lowess = sm.nonparametric.lowess(df_sentiment,df_sentiment.index, frac = 0.05)
fig = plt.gcf()
plt.plot(df_sentiment.index, df_sentiment, '.')  # plot the values as dots
plt.plot(lowess[:, 0], lowess[:, 1])       # plot the smoothed output as solid line
fig.set_size_inches(16,8)
plt.xlabel('sample')
plt.ylabel('sentiment_score')
plt.title('Lion King, smoothed with Lowess')
plt.xlim((0, 110))

Hack 2: Dynamic Time Warping

动态时间扭曲(dynamic time warping)的方法对于比较在其它类似数据之间具有任意插入的序列是很好的。它也可以解决我们比较不等长序列的问题。 直觉上,DTW似乎是一个很好的解决方案。 让我们测试一下。

How it works

要点是动态时间扭曲给我们一种方法来映射一个序列到另一个序列,使用动态规划和每个序列中的元素之间的距离度量。 从一个序列到另一个序列的最佳映射是使成对距离最小化的路径。较短的距离表示具有高相似性的序列。

Interpreting the path graph.

如果我们在x轴上放置一个序列(例如,“lionking_sentiment”),在y轴上放置另一个序列(例如,“aladdin_sentiment”),则从左下到右上的对角路径示出了最佳映射 x序列到y序列上。 对于两个相同的序列,路径将是完美的对角线。 对于差分序列,该路径揭示了每个序列被“弯曲”以容纳另一序列。

代码语言:javascript
复制
For the knowledge hounds, here’s a link to the original paper that introduced DTW and Rakthanmanon et al.
import dtw   # `pipinstall dtw`
lionking = df['lionking_sentiment'].dropna()
aladdin = df['aladdin_sentiment'].dropna()
print(len(lionking), len(aladdin))
dist, cost, path = dtw.dtw(lionking, aladdin)  # compute the best DTW mapping
print("Minimum distance found: %-8.4f" % dist)
(135, 143)
Minimum distance found: 0.0571  
In [20]:
from matplotlib import cm # custom colormaps
from matplotlib.pyplot import imshow
imshow(cost.T, origin = 'lower', cmap = cm.hot,interpolation = 'nearest')
plt.plot(path[0], path[1], 'w')  # white line shows the best path
plt.xlim((-0.5, cost.shape[0]-0.5))
plt.ylim((-0.5, cost.shape[1]-0.5))
plt.xlabel("lion king")
plt.ylabel("aladdin")

这个故事其他部分呢?

代码语言:javascript
复制
mermaid = df['littlemermaid_sentiment'].dropna()
print(len(lionking), len(mermaid))
dist, cost, path = dtw.dtw(lionking, mermaid)
print("Minimum distance found: %-8.4f" % dist)
(135, 69)
Minimum distance found: 0.1134  
In [22]:
from matplotlib import cm # import custom colormaps
from matplotlib.pyplot import imshow
imshow(cost.T, origin = 'lower', cmap = cm.hot,interpolation = 'nearest')
plt.plot(path[0], path[1], 'w')  # white line for the best path
plt.xlim((-0.5, cost.shape[0]-0.5))
plt.ylim((-0.5, cost.shape[1]-0.5))
plt.xlabel("lion king")
plt.ylabel("little mermaid")

狮子王和小美人鱼似乎有类似的情节,但有一些间隔发生在狮子王,但没有相应的特征在小美人鱼上。 这个不同的故事节奏可能是因为狮子王的角色是完全拟人化,在整个电影里说很多行,而小美人鱼的角色倾向于通过行动和视觉 - 告诉消失她的声音的故事! 或者它可能是转录本长度或质量的差异显示通过...一些更深入的调查。我们可以从DTW路径看到绘制线对于电影的第一部分是不同的,但是后半部分是非常相似的。

5. Compare many stories to find similar plotlines

Since we have a distance metric, can we find plotlines basedon a query story?

使用DTW距离度量,是直接比较我们语料库中所有的故事对。 使用这些距离来对相似(或不同)故事进行排序(或搜索)只是个练习。

代码语言:javascript
复制
for i in corpus:
    for j incorpus:    
        (dist,cost, path) = dtw.dtw(df[i.split(".")[0] +"_sentiment"].dropna(), 
                                    df[j.split(".")[0]   +"_sentiment"].dropna())
       print("DTW distance from %s to %s: '%-6.3f'" %(i.split(".")[0], j.split(".")[0], dist))
DTW distance from aladdin to aladdin: '0.000 '
DTW distance from aladdin to lionking: '0.057 '
DTW distance from aladdin to mulan: '0.091 '
DTW distance from aladdin to hunchback: '0.067 '
DTW distance from aladdin to rescuersdownunder: '0.086 '
DTW distance from aladdin to sleepingbeauty: '0.101 '
DTW distance from aladdin to littlemermaid: '0.101 '
DTW distance from lionking to aladdin: '0.057 '
DTW distance from lionking to lionking: '0.000 '
DTW distance from lionking to mulan: '0.081 '
DTW distance from lionking to hunchback: '0.072 '
DTW distance from lionking to rescuersdownunder: '0.072 '
DTW distance from lionking to sleepingbeauty: '0.082 '
DTW distance from lionking to littlemermaid: '0.113 '
DTW distance from mulan to aladdin: '0.091 '
DTW distance from mulan to lionking: '0.081 '
DTW distance from mulan to mulan: '0.000 '
DTW distance from mulan to hunchback: '0.086 '
DTW distance from mulan to rescuersdownunder: '0.034 '
DTW distance from mulan to sleepingbeauty: '0.060 '
DTW distance from mulan to littlemermaid: '0.060 '
DTW distance from hunchback to aladdin: '0.067 '
DTW distance from hunchback to lionking: '0.072 '
DTW distance from hunchback to mulan: '0.086 '
DTW distance from hunchback to hunchback: '0.000 '
DTW distance from hunchback to rescuersdownunder: '0.077'
DTW distance from hunchback to sleepingbeauty: '0.086 '
DTW distance from hunchback to littlemermaid: '0.064 '
DTW distance from rescuersdownunder to aladdin: '0.086 '
DTW distance from rescuersdownunder to lionking: '0.072 '
DTW distance from rescuersdownunder to mulan: '0.034 '
DTW distance from rescuersdownunder to hunchback: '0.077'
DTW distance from rescuersdownunder to rescuersdownunder:'0.000 '
DTW distance from rescuersdownunder to sleepingbeauty:'0.044 '
DTW distance from rescuersdownunder to littlemermaid:'0.059 '
DTW distance from sleepingbeauty to aladdin: '0.101 '
DTW distance from sleepingbeauty to lionking: '0.082 '
DTW distance from sleepingbeauty to mulan: '0.060 '
DTW distance from sleepingbeauty to hunchback: '0.086 '
DTW distance from sleepingbeauty to rescuersdownunder:'0.044 '
DTW distance from sleepingbeauty to sleepingbeauty:'0.000 '
DTW distance from sleepingbeauty to littlemermaid: '0.073'
DTW distance from littlemermaid to aladdin: '0.101 '
DTW distance from littlemermaid to lionking: '0.113 '
DTW distance from littlemermaid to mulan: '0.060 '
DTW distance from littlemermaid to hunchback: '0.064 '
DTW distance from littlemermaid to rescuersdownunder:'0.059 '
DTW distance from littlemermaid to sleepingbeauty: '0.073'
DTW distance from littlemermaid to littlemermaid: '0.000'

6. The Disney movie script formula, by the plotlines

Or, “How would Vonnegut draw the shape of a Disney moviescript?”

代码语言:javascript
复制
lowess_frac = 0.05 # same smoothing as above, balances detail and smoothness
lionking_lowess = sm.nonparametric.lowess(df['lionking_sentiment'],df['lionking_sentiment'].index, frac = lowess_frac)
aladdin_lowess =sm.nonparametric.lowess(df['aladdin_sentiment'], df['aladdin_sentiment'].index,frac = lowess_frac)
rescuers_lowess =sm.nonparametric.lowess(df['rescuersdownunder_sentiment'],df['rescuersdownunder_sentiment'].index, frac = lowess_frac)
hunchback_lowess =sm.nonparametric.lowess(df['hunchback_sentiment'],df['hunchback_sentiment'].index, frac = lowess_frac)
fig = plt.gcf()
plt.plot()
plt.plot(lionking_lowess[:, 0], lionking_lowess[:, 1])
plt.plot(aladdin_lowess[:, 0], aladdin_lowess[:, 1])
plt.plot(rescuers_lowess[:, 0], rescuers_lowess[:, 1])
plt.plot(hunchback_lowess[:, 0], hunchback_lowess[:, 1])
plt.xlabel('sample')
plt.ylabel('sentiment_score')
plt.title('4 similar Disney movies: [The Lion King,Aladdin, Rescuers Down Under, Hunchback of Notre Dame]')
fig.set_size_inches(16,8)

What if we dial up the smoothing to compare vs.Vonnegut’s shapes of stories?

代码语言:javascript
复制
lowess_frac = 0.25 # heavy smoothing here to compare to Vonnegut
lionking_lowess =sm.nonparametric.lowess(df['lionking_sentiment'],df['lionking_sentiment'].index, frac = lowess_frac)
aladdin_lowess =sm.nonparametric.lowess(df['aladdin_sentiment'], df['aladdin_sentiment'].index,frac = lowess_frac)
rescuers_lowess =sm.nonparametric.lowess(df['rescuersdownunder_sentiment'],df['rescuersdownunder_sentiment'].index, frac = lowess_frac)
hunchback_lowess =sm.nonparametric.lowess(df['hunchback_sentiment'],df['hunchback_sentiment'].index, frac = lowess_frac)
fig = plt.gcf()
plt.plot()
plt.plot(lionking_lowess[:, 0], lionking_lowess[:, 1])
plt.plot(aladdin_lowess[:, 0], aladdin_lowess[:, 1])
plt.plot(rescuers_lowess[:, 0], rescuers_lowess[:, 1])
plt.plot(hunchback_lowess[:, 0], hunchback_lowess[:, 1])
plt.xlabel('sample')
plt.ylabel('sentiment_score')
plt.title('4 similar Disney movies: [The Lion King,Aladdin, Rescuers Down Under, Hunchback of Notre Dame]')
fig.set_size_inches(16,8)

Beginnings vary, but the last half of a Disney movie isquite predictable!

在我们比较了许多迪斯尼电影剧本之后,一个清晰的模式出现了。 也许这并不直观,但事实上,我们发现“迪斯尼公式”是直接从电影剧本的文本得到的,这非常酷!迪斯尼以各种方式介绍人物并设置场景,但每个故事以类似的方式结束:

1. 在故事的中间,有一个的驼峰,即带着正能量的英雄发现了朋友,实现幸福,并发现新的技能或权力。

2. 然后大幅下降,英雄经历失落,悲剧和困难。 通常由恶棍引起的!

3. 大约75%的故事,英雄会决定爬出悲剧的山谷。 这通常由英雄的附属角色或朋友催化。 好事再次发生!

4. 世界是快乐的,故事以几分钟的正能量(和一首歌)结束。

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2016-12-19,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 鸿的学习笔记 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档