无法找到对R的支持,我试图将大量RTF文件读入R中以构造数据框架,但我很难找到解析RTF文件并忽略文件的结构/格式的好方法。实际上,我只想从每个文件中提取两行文本--但它是嵌套在文件结构中的。
我在下面粘贴了一个RTF文件样本。我想要捕捉的两个字符串是:
对于如何有效地解析这一点,有什么想法吗?我认为正则表达式可能对我有帮助,但我正在努力形成正确的表达式来完成工作。
{\rtf1\ansi\ansicpg1252\cocoartf1265
{\fonttbl\f0\fswiss\fcharset0 ArialMT;\f1\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red109\green109\blue109;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\deftab720
\itap1\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil
\clvertalt \clshdrawnil \clwWidth15680\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640
\itap2\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil
\clmgf \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx4320
\clmrg \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640
\pard\intbl\itap2\pardeftab720
\f0\b\fs26 \cf0 Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products\nestcell
\pard\intbl\itap2\nestcell \lastrow\nestrow
\pard\intbl\itap1\pardeftab720
\f1\b0\fs24 \cf0 \
\pard\intbl\itap1\pardeftab720
\f0\fs26 \cf0 The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers\'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications.\cell \lastrow\row
\pard\pardeftab720
\f1\fs24 \cf0 \
}发布于 2014-05-13 15:09:36
1)如果您在上,一种简单的方法是使用WordPad或Word读取它,然后将其保存为纯文本文档。
2)交替地,直接在R中解析它,在rtf文件中读取,用给定的模式查找行,pat生成g。然后将任何\\'字符串替换为生成noq的单引号。最后,删除pat和任何尾随的垃圾。这在示例中是可行的,但是如果除了我们已经处理的\‘之外还有其他嵌入式\字符串,则可能需要修改模式:
Lines <- readLines("myfile.rtf")
pat <- "^\\\\f0.*\\\\cf0 "
g <- grep(pat, Lines, value = TRUE)
noq <- gsub("\\\\'", "'", g)
sub("\\\\.*", "", sub(pat, "", noq))对于指定的文件,这是输出:
[1] "Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products"
[2] "The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications."修改了好几次。添加了Wordpad/Word解决方案。
https://stackoverflow.com/questions/23634298
复制相似问题