blocks|key|1599483|text|一种可能的方法是：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1599484|test[,+{
++++++++si+<-+which(class=="start"+&+shift(class,+-1L)=="end")
++++++++.(id,+start=time[si],+end=time[si+%2B+1L])
++++},+by=.(id)]|code-block|syntax|javascript|1599485|输出：|1599486|+++id+++++++++++++++++start+++++++++++++++++end
1:++1+1+2019-06-20+00:00:00+2019-06-20+00:05:00
2:++1+1+2019-06-20+00:10:00+2019-06-20+00:15:00
3:++2+2+2019-06-20+00:25:00+2019-06-20+00:30:00
4:++3+3+2019-06-20+00:45:00+2019-06-20+00:50:00|1599487|数据：|1599488|library(data.table)
test+<-+fread("id,class,time
1,start,2019-06-20+00:00:00
1,end,2019-06-20+00:05:00
1,start,2019-06-20+00:10:00
1,end,2019-06-20+00:15:00
2,end,2019-06-20+00:20:00
2,start,2019-06-20+00:25:00
2,end,2019-06-20+00:30:00
2,start,2019-06-20+00:35:00
3,end,2019-06-20+00:40:00
3,start,2019-06-20+00:45:00
3,end,2019-06-20+00:50:00
3,start,2019-06-20+00:55:00")|1599489|entityMap^0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|R|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|S|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|U|8|@]|9|@]|A|$]]|$1|M|3|N|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|O|3|-4|5|6|7|W|8|@]|9|@]|A|$]]]|P|$]]

One possible approach:

<pre><code>test[, {
 si &lt;- which(class=="start" &amp; shift(class, -1L)=="end")
 .(id, start=time[si], end=time[si + 1L])
 }, by=.(id)]
</code></pre>

output:

<pre><code> id start end
1: 1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
2: 1 1 2019-06-20 00:10:00 2019-06-20 00:15:00
3: 2 2 2019-06-20 00:25:00 2019-06-20 00:30:00
4: 3 3 2019-06-20 00:45:00 2019-06-20 00:50:00
</code></pre>

data:

<pre><code>library(data.table)
test &lt;- fread("id,class,time
1,start,2019-06-20 00:00:00
1,end,2019-06-20 00:05:00
1,start,2019-06-20 00:10:00
1,end,2019-06-20 00:15:00
2,end,2019-06-20 00:20:00
2,start,2019-06-20 00:25:00
2,end,2019-06-20 00:30:00
2,start,2019-06-20 00:35:00
3,end,2019-06-20 00:40:00
3,start,2019-06-20 00:45:00
3,end,2019-06-20 00:50:00
3,start,2019-06-20 00:55:00")
</code></pre>

blocks|key|1813860|text|使用dplyr和tidyr，我们可以首先对遵循"start"和"end"模式的行进行filter，创建两行一组，然后将其转换为长格式。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1813861|library(dplyr)
library(tidyr)

test+%25>%25
++group_by(id)+%25>%25
++filter(class+==+"start"+&+lead(class)+==+"end"+%7C+
+++++++++class+==+"end"+&+lag(class)+==+"start")+%25>%25
++group_by(group+=+gl(n()/2,+2))+%25>%25
++spread(class,+time)+%25>%25
++ungroup()+%25>%25
++select(-group)+%25>%25
++select(id,+start,+end)

#+++++id++start++++++++++++++end+++++++++++++++
#+++<int>+<dttm>++++++++++++++<dttm>+++++++++++++
#1+++++1+2019-06-20+00:00:00+2019-06-20+00:05:00
#2+++++1+2019-06-20+00:10:00+2019-06-20+00:15:00
#3+++++2+2019-06-20+00:25:00+2019-06-20+00:30:00
#4+++++3+2019-06-20+00:45:00+2019-06-20+00:50:00|code-block|syntax|javascript|1813862|entityMap^0|2|5|8|5|N|7|V|5|16|6|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@$9|N|A|O|B|C]|$9|P|A|Q|B|C]|$9|R|A|S|B|C]|$9|T|A|U|B|C]|$9|V|A|W|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|X|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|Y|8|@]|D|@]|E|$]]]|L|$]]

Using <code>dplyr</code> and <code>tidyr</code>, we can first <code>filter</code> the rows which follow the <code>"start"</code> and <code>"end"</code> pattern, create groups of 2 rows and <code>spread</code> to long format. 

<pre><code>library(dplyr)
library(tidyr)

test %&gt;%
 group_by(id) %&gt;%
 filter(class == "start" &amp; lead(class) == "end" | 
 class == "end" &amp; lag(class) == "start") %&gt;%
 group_by(group = gl(n()/2, 2)) %&gt;%
 spread(class, time) %&gt;%
 ungroup() %&gt;%
 select(-group) %&gt;%
 select(id, start, end)

# id start end 
# &lt;int&gt; &lt;dttm&gt; &lt;dttm&gt; 
#1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
#2 1 2019-06-20 00:10:00 2019-06-20 00:15:00
#3 2 2019-06-20 00:25:00 2019-06-20 00:30:00
#4 3 2019-06-20 00:45:00 2019-06-20 00:50:00
</code></pre>

blocks|key|1599539|text|我通常使用cumsum()是这样的情况|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1599540|test+%25>%25+
++group_by(id)+%25>%25
++arrange(time,+.by_group+=+TRUE)+%25>%25+++#+should+use+.by_group+arg
++mutate(flag+=+cumsum(class+==+"start"))+%25>%25
++group_by(id,+flag)+%25>%25
++filter(n()+==+2L)+%25>%25
++ungroup()+%25>%25
++spread(class,+time)+%25>%25
++select(-flag)|code-block|syntax|javascript|1599541|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

I usually use cumsum() is these cases

<pre><code>test %&gt;% 
 group_by(id) %&gt;%
 arrange(time, .by_group = TRUE) %&gt;% # should use .by_group arg
 mutate(flag = cumsum(class == "start")) %&gt;%
 group_by(id, flag) %&gt;%
 filter(n() == 2L) %&gt;%
 ungroup() %&gt;%
 spread(class, time) %&gt;%
 select(-flag)
</code></pre>

blocks|key|1599646|text|您可以将每个start行加上紧跟其后的end+(如果有)，然后使用dcast从长格式切换为宽格式：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1599647|test[,+
++if+(.N+>=+2)+head(.SD,+2)
,+by=.(g+=+rleid(id,+cumsum(class=="start"))),+.SDcols=names(test)][,+
++dcast(.SD,+id+%2B+g+~+factor(class,+levels=c("start",+"end")),+value.var="time")
]

+++id+g+++++++++++++++start+++++++++++++++++end
1:++1+1+2019-06-20+00:00:00+2019-06-20+00:05:00
2:++1+2+2019-06-20+00:10:00+2019-06-20+00:15:00
3:++2+4+2019-06-20+00:25:00+2019-06-20+00:30:00
4:++3+7+2019-06-20+00:45:00+2019-06-20+00:50:00|code-block|syntax|javascript|1599648|rleid和cumsum用于查找序列；factor用于告诉dcast列顺序。|1599649|附注:这基本上与@cheetahfly的答案相同(我在发帖时没有意识到)：由于cumsum正在增加，按id+%2B+cumsum分组就足够了，并且不需要使用rleid+(用于跟踪值的运行)。唯一不同的是，我的方法将保持像开始、结束、结束这样的运行；而另一个答案将使用n()+==+2检查将其过滤掉。|1599650|entityMap^0|6|5|J|3|X|5|0|0|0|5|6|6|J|6|T|5|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@$9|R|A|S|B|C]|$9|T|A|U|B|C]|$9|V|A|W|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|X|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|Y|8|@$9|Z|A|10|B|C]|$9|11|A|12|B|C]|$9|13|A|14|B|C]|$9|15|A|16|B|C]]|D|@]|E|$]]|$1|M|3|N|5|6|7|17|8|@]|D|@]|E|$]]|$1|O|3|-4|5|6|7|18|8|@]|D|@]|E|$]]]|P|$]]

You can keep each <code>start</code> row plus the <code>end</code> immediately after it (if any), then use <code>dcast</code> to switch from long to wide form:

<pre><code>test[, 
 if (.N &gt;= 2) head(.SD, 2)
, by=.(g = rleid(id, cumsum(class=="start"))), .SDcols=names(test)][, 
 dcast(.SD, id + g ~ factor(class, levels=c("start", "end")), value.var="time")
]

 id g start end
1: 1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
2: 1 2 2019-06-20 00:10:00 2019-06-20 00:15:00
3: 2 4 2019-06-20 00:25:00 2019-06-20 00:30:00
4: 3 7 2019-06-20 00:45:00 2019-06-20 00:50:00
</code></pre>

<code>rleid</code> and <code>cumsum</code> are used to find the sequences; and <code>factor</code> is needed to tell <code>dcast</code> the column order.

Side note: This is essentially the same as @cheetahfly's answer (I didn't realize when I posted): since the cumsum is increasing, it is sufficient to group by id + cumsum and there's no need to use rleid (which is for tracking runs of values). The only difference is that my approach woudl keep a run like start, end, end; while the other answer would filter it out with the n() == 2 check.

I have a raw data frame that looks like this:

<pre><code>test
 id class time
1 1 start 2019-06-20 00:00:00
2 1 end 2019-06-20 00:05:00
3 1 start 2019-06-20 00:10:00
4 1 end 2019-06-20 00:15:00
5 2 end 2019-06-20 00:20:00
6 2 start 2019-06-20 00:25:00
7 2 end 2019-06-20 00:30:00
8 2 start 2019-06-20 00:35:00
9 3 end 2019-06-20 00:40:00
10 3 start 2019-06-20 00:45:00
11 3 end 2019-06-20 00:50:00
12 3 start 2019-06-20 00:55:00
</code></pre>

My goal is to map the values to an output table for each id only where there is a <code>start</code> and an <code>end</code> in consecutive order (time). Therefore, the output would look like:

<pre><code>output
 id start end
1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
2 1 2019-06-20 00:10:00 2019-06-20 00:15:00
3 2 2019-06-20 00:25:00 2019-06-20 00:30:00
4 3 2019-06-20 00:45:00 2019-06-20 00:50:00
</code></pre>

I have tried with the <code>dplyr</code> package, but 

<pre><code>test %&gt;% group_by(id) %&gt;% arrange(time) %&gt;% starts_with("start")
Error in starts_with(., "start") : is_string(match) is not TRUE
</code></pre>

<code>starts_with</code> always throws an error. I would like to avoid writing a for loop because I am sure this can be handled by a few chain operations. Any ideas for a workaround in <code>dplyr</code> or <code>data.table</code>?

Value mapping by condition in R

我有一个原始数据框，看起来像这样：test   id class                time1   1 start 2019-06-20 00:00:002   1   end 2019-06-20 00:05:003   1 start 2019-06-20 00:10:004   1   end 201...

问R中按条件的值映射
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R中按条件的值映射EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R中按条件的值映射
EN