首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >将RTF文件解析为R?

将RTF文件解析为R?
EN

Stack Overflow用户
提问于 2014-05-13 14:37:10
回答 1查看 8.5K关注 0票数 4

无法找到对R的支持,我试图将大量RTF文件读入R中以构造数据框架,但我很难找到解析RTF文件并忽略文件的结构/格式的好方法。实际上,我只想从每个文件中提取两行文本--但它是嵌套在文件结构中的。

我在下面粘贴了一个RTF文件样本。我想要捕捉的两个字符串是:

  1. “今天买一台26英寸液晶电视还是下个月购买32英寸高科技耐用产品?”
  2. “技术水平.和管理方面的影响”(整段)

对于如何有效地解析这一点,有什么想法吗?我认为正则表达式可能对我有帮助,但我正在努力形成正确的表达式来完成工作。

代码语言:javascript
运行
复制
{\rtf1\ansi\ansicpg1252\cocoartf1265
{\fonttbl\f0\fswiss\fcharset0 ArialMT;\f1\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red109\green109\blue109;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\deftab720

\itap1\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil 
\clvertalt \clshdrawnil \clwWidth15680\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640

\itap2\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil 
\clmgf \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx4320
\clmrg \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640
\pard\intbl\itap2\pardeftab720

\f0\b\fs26 \cf0 Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products\nestcell 
\pard\intbl\itap2\nestcell \lastrow\nestrow
\pard\intbl\itap1\pardeftab720

\f1\b0\fs24 \cf0 \
\pard\intbl\itap1\pardeftab720

\f0\fs26 \cf0 The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers\'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications.\cell \lastrow\row
\pard\pardeftab720

\f1\fs24 \cf0 \
}
EN

Stack Overflow用户

发布于 2014-05-13 15:09:36

1)如果您在上,一种简单的方法是使用WordPad或Word读取它,然后将其保存为纯文本文档。

2)交替地,直接在R中解析它,在rtf文件中读取,用给定的模式查找行,pat生成g。然后将任何\\'字符串替换为生成noq的单引号。最后,删除pat和任何尾随的垃圾。这在示例中是可行的,但是如果除了我们已经处理的\‘之外还有其他嵌入式\字符串,则可能需要修改模式:

代码语言:javascript
运行
复制
Lines <- readLines("myfile.rtf")
pat <- "^\\\\f0.*\\\\cf0 "
g <- grep(pat, Lines, value = TRUE)
noq <- gsub("\\\\'", "'", g)
sub("\\\\.*", "", sub(pat, "", noq))

对于指定的文件,这是输出:

代码语言:javascript
运行
复制
[1] "Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[2] "The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications."

修改了好几次。添加了Wordpad/Word解决方案。

票数 3
EN
查看全部 1 条回答
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/23634298

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档