blocks|key|2035657|text|由于您还有data.table标记，所以我喜欢将data.table::rleid函数用于这些任务，即|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|2035658|library(dplyr)

df+%25>%25+
+group_by(grp+=+data.table::rleid(b),+b)+%25>%25+
+filter(n()+>+1)|code-block|syntax|javascript|2035659|这给了，|2035660|A+tibble:+9+x+3#组:+grp，b+4+a+a+b+grp+1+10+A+1+2+2+0+A+1+2+2+0+A+1+40+C+3+4+50+C+3+5+70+B+5+80+B+5+5+7+90+B+5+5+140+E+10+9+150+E+10|blockquote|2035661|entityMap^0|5|A|O|H|0|0|0|0^^$0|@$1|2|3|4|5|6|7|R|8|@$9|S|A|T|B|C]|$9|U|A|V|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|W|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|X|8|@]|D|@]|E|$]]|$1|M|3|N|5|O|7|Y|8|@]|D|@]|E|$]]|$1|P|3|-4|5|6|7|Z|8|@]|D|@]|E|$]]]|Q|$]]

Since you also have the <code>data.table</code> tag, i like using the <code>data.table::rleid</code> function for such tasks, i.e.

<pre><code>library(dplyr)

df %&gt;% 
 group_by(grp = data.table::rleid(b), b) %&gt;% 
 filter(n() &gt; 1)
</code></pre>

which gives,

<blockquote>
<pre><code># A tibble: 9 x 3
# Groups: grp, b [4]
 a b grp
 &lt;dbl&gt; &lt;chr&gt; &lt;int&gt;
1 10 A 1
2 20 A 1
3 40 C 3
4 50 C 3
5 70 B 5
6 80 B 5
7 90 B 5
8 140 E 10
9 150 E 10
</code></pre>
</blockquote>

blocks|key|16728|text|在dplyr中，我们可以使用lag创建组并选择多于1行的组。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|16729|library(dplyr)

df+%25>%25
++group_by(group+=+cumsum(b+!=+lag(b,+default+=+first(b))))+%25>%25
++filter(n()+>+1)+%25>%25
++ungroup()+%25>%25
++select(-group)

#+++++a++b++++
#++<dbl>+<chr>
#1++++10+A++++
#2++++20+A++++
#3++++40+C++++
#4++++50+C++++
#5++++70+B++++
#6++++80+B++++
#7++++90+B++++
#8+++140+E++++
#9+++150+E++|code-block|syntax|javascript|16730|在基本R中，我们可以使用rle和ave从df中的subset行|16731|subset(df,+ave(b,+with(rle(b),+rep(seq_along(values),+lengths)),+FUN+=+length)+>+1)++|16732|entityMap^0|1|5|E|3|0|0|C|3|G|3|K|2|O|6|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@$9|R|A|S|B|C]|$9|T|A|U|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|V|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|W|8|@$9|X|A|Y|B|C]|$9|Z|A|10|B|C]|$9|11|A|12|B|C]|$9|13|A|14|B|C]]|D|@]|E|$]]|$1|M|3|N|5|H|7|15|8|@]|D|@]|E|$I|J]]|$1|O|3|-4|5|6|7|16|8|@]|D|@]|E|$]]]|P|$]]

In <code>dplyr</code> we can use <code>lag</code> to create groups and select groups with more than 1 row. 

<pre><code>library(dplyr)

df %&gt;%
 group_by(group = cumsum(b != lag(b, default = first(b)))) %&gt;%
 filter(n() &gt; 1) %&gt;%
 ungroup() %&gt;%
 select(-group)

# a b 
# &lt;dbl&gt; &lt;chr&gt;
#1 10 A 
#2 20 A 
#3 40 C 
#4 50 C 
#5 70 B 
#6 80 B 
#7 90 B 
#8 140 E 
#9 150 E 
</code></pre>

<hr>

In base R, we can use <code>rle</code> and <code>ave</code> to <code>subset</code> rows from <code>df</code>

<pre><code>subset(df, ave(b, with(rle(b), rep(seq_along(values), lengths)), FUN = length) &gt; 1) 
</code></pre>

blocks|key|16770|text|使用注释末尾所示的data.table输入，将N设为每组连续元素中的元素数，然后保持其大于1的组。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|16771|DT[,+N+:=.N,+by+=+rleid(b)][N+>+1,+.(a,+b)]|code-block|syntax|javascript|16772|给予：|16773|+++++a+b
1:++10+A
2:++20+A
3:++40+C
4:++50+C
5:++70+B
6:++80+B
7:++90+B
8:+140+E
9:+150+E|16774|备注|16775|我们假定可复制形式的输入是：|16776|library(data.table)
a+<-+seq(10,150,10)
b+<-+c("A",+"A",+"B",+"C",+"C",+"A",+"B",+"B",+"B",+"C",+"A",+"C",+"D",+"E",+"E")
DT+<-+data.table(a,+b)|16777|entityMap^0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|U|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|W|8|@]|9|@]|A|$]]|$1|M|3|N|5|6|7|X|8|@]|9|@]|A|$]]|$1|O|3|P|5|D|7|Y|8|@]|9|@]|A|$E|F]]|$1|Q|3|-4|5|6|7|Z|8|@]|9|@]|A|$]]]|R|$]]

Using the data.table input shown in the Note at the end, set N to be the number of elements in each group of consecutive elements and then keep groups for which it is greater than 1.

<pre><code>DT[, N :=.N, by = rleid(b)][N &gt; 1, .(a, b)]
</code></pre>

giving:

<pre><code> a b
1: 10 A
2: 20 A
3: 40 C
4: 50 C
5: 70 B
6: 80 B
7: 90 B
8: 140 E
9: 150 E
</code></pre>

<h2>Note</h2>

We assume the input in reproducible form is:

<pre><code>library(data.table)
a &lt;- seq(10,150,10)
b &lt;- c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C", "D", "E", "E")
DT &lt;- data.table(a, b)
</code></pre>

blocks|key|16827|text|要删除重复，除非是连续的:下面的代码标志重复值和连续值，然后只保留不重复的行或连续一组重复值的一部分。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|16828|df+%25>%25
++mutate(duplicate+=+duplicated(b),+
+++++++++consecutive+=+c(NA,+diff(as.integer(factor(b))))+==+0)+%25>%25
++filter(!duplicate+%7C+consecutive)+%25>%25
++select(-duplicate,+-consecutive)|code-block|syntax|javascript|16829|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

You want to remove duplicate except when consecutive: the following code flags duplicate values and consecutive values, then keeps only rows that are not duplicate or that are part of a consecutive set of duplicates.

<pre><code>df %&gt;%
 mutate(duplicate = duplicated(b), 
 consecutive = c(NA, diff(as.integer(factor(b)))) == 0) %&gt;%
 filter(!duplicate | consecutive) %&gt;%
 select(-duplicate, -consecutive)
</code></pre>

blocks|key|16835|text|使用rle获取运行长度。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|16836|假设是df+<-+data.frame(a=a,b=b)，那么下面的内容就可以实现|16837|df[-cumsum(rle(b)$lengths)[rle(b)$lengths==1],]|code-block|syntax|javascript|16838|entityMap^0|2|3|0|3|P|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@$9|P|A|Q|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|R|8|@$9|S|A|T|B|C]]|D|@]|E|$]]|$1|H|3|I|5|J|7|U|8|@]|D|@]|E|$K|L]]|$1|M|3|-4|5|6|7|V|8|@]|D|@]|E|$]]]|N|$]]

Use <code>rle</code> to get the run length. 

Assuming <code>df &lt;- data.frame(a=a,b=b)</code>, then the following can make it

<pre><code>df[-cumsum(rle(b)$lengths)[rle(b)$lengths==1],]
</code></pre>

blocks|key|16866|text|下面是另一个选项(应该更快)：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|16867|D[-D[,+{
++++x+<-+rowid(rleid(b))+<+2
++++.I[x+&+shift(x,+-1L,+fill=TRUE)]
}]]|code-block|syntax|javascript|16868|计时码：|16869|library(data.table)
set.seed(0L)
nr+<-+1e7
nb+<-+1e4
DT+<-+data.table(b=sample(nb,+nr,+TRUE))
#DT+<-+data.table(b=c("A",+"A",+"B",+"C",+"C",+"A",+"B",+"B",+"B",+"C",+"A",+"C",+"D",+"E",+"E"))
DT2+<-+copy(DT)

mtd1+<-+function(df)+{
++++df[-cumsum(rle(b)$lengths)[rle(b)$lengths==1],]
}

mtd2+<-+function(D)+{
++++D[,+N+:=.N,+by+=+rleid(b)][N+>+1,+.(b)]
}

mtd3+<-+function(D)+{
++++D[-D[,+{
++++++++x+<-+rowid(rleid(b))+<+2
++++++++.I[x+&+shift(x,+-1L,+fill=TRUE)]
++++}]]
}

bench::mark(mtd1(DT),+mtd2(DT2),+mtd3(DT),+check=FALSE)|16870|计时：|16871|#+A+tibble:+3+x+13
++expression++++++min+++median+`itr/sec`+mem_alloc+`gc/sec`+n_itr++n_gc+total_time+result+++++++++++++memory++++++++++time++++gc++++++++++++
++<bch:expr>+<bch:tm>+<bch:tm>+++++<dbl>+<bch:byt>++++<dbl>+<int>+<dbl>+++<bch:tm>+<list>+++++++++++++<list>++++++++++<list>++<list>++++++++
1+mtd1(DT)+++++++1.1s+++++1.1s+++++0.908++++1.98GB++++10.9++++++1++++12+++++++1.1s+<df[,1]+[2,014+x+~+<df[,3]+[59+x+~+<bch:t~+<tibble+[1+x+~
2+mtd2(DT2)+++++2.88s++++2.88s+++++0.348++267.12MB+++++0++++++++1+++++0++++++2.88s+<df[,1]+[2,014+x+~+<df[,3]+[23+x+~+<bch:t~+<tibble+[1+x+~
3+mtd3(DT)+++639.91ms+639.91ms+++++1.56+++505.48MB+++++4.69+++++1+++++3+++639.91ms+<df[,1]+[2,014+x+~+<df[,3]+[24+x+~+<bch:t~+<tibble+[1+x+~|16872|entityMap^0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|R|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|S|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|U|8|@]|9|@]|A|$]]|$1|M|3|N|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|O|3|-4|5|6|7|W|8|@]|9|@]|A|$]]]|P|$]]

Here is another option (which should be faster):

<pre><code>D[-D[, {
 x &lt;- rowid(rleid(b)) &lt; 2
 .I[x &amp; shift(x, -1L, fill=TRUE)]
}]]
</code></pre>

timing code:

<pre><code>library(data.table)
set.seed(0L)
nr &lt;- 1e7
nb &lt;- 1e4
DT &lt;- data.table(b=sample(nb, nr, TRUE))
#DT &lt;- data.table(b=c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C", "D", "E", "E"))
DT2 &lt;- copy(DT)

mtd1 &lt;- function(df) {
 df[-cumsum(rle(b)$lengths)[rle(b)$lengths==1],]
}

mtd2 &lt;- function(D) {
 D[, N :=.N, by = rleid(b)][N &gt; 1, .(b)]
}

mtd3 &lt;- function(D) {
 D[-D[, {
 x &lt;- rowid(rleid(b)) &lt; 2
 .I[x &amp; shift(x, -1L, fill=TRUE)]
 }]]
}

bench::mark(mtd1(DT), mtd2(DT2), mtd3(DT), check=FALSE)
</code></pre>

timings:

<pre><code># A tibble: 3 x 13
 expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc 
 &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:byt&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;bch:tm&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; 
1 mtd1(DT) 1.1s 1.1s 0.908 1.98GB 10.9 1 12 1.1s &lt;df[,1] [2,014 x ~ &lt;df[,3] [59 x ~ &lt;bch:t~ &lt;tibble [1 x ~
2 mtd2(DT2) 2.88s 2.88s 0.348 267.12MB 0 1 0 2.88s &lt;df[,1] [2,014 x ~ &lt;df[,3] [23 x ~ &lt;bch:t~ &lt;tibble [1 x ~
3 mtd3(DT) 639.91ms 639.91ms 1.56 505.48MB 4.69 1 3 639.91ms &lt;df[,1] [2,014 x ~ &lt;df[,3] [24 x ~ &lt;bch:t~ &lt;tibble [1 x ~
</code></pre>

I have a data frame where one column contains some consecutive duplicates. I want to keep the rows with consecutive duplicates (any length >1). I would prefer a solution in <code>dplyr</code> or <code>data.table</code>.

Example data :

<pre><code>a &lt;- seq(10,150,10)
b &lt;- c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C", "D", "E", "E")

df &lt;- tibble(a, b)
</code></pre>

Data:

<pre><code># A tibble: 15 x 2
 a b 
 &lt;dbl&gt; &lt;chr&gt;
 1 10 A 
 2 20 A 
 3 30 B 
 4 40 C 
 5 50 C 
 6 60 A 
 7 70 B 
 8 80 B 
 9 90 B 
10 100 C 
11 110 A 
12 120 C 
13 130 D 
14 140 E 
15 150 E 
</code></pre>

So I would like to keep the rows with consecutive duplicates in column <code>b</code>.

Expected outcome:

<pre><code># A tibble: 9 x 2
 a b 
 &lt;dbl&gt; &lt;chr&gt;
 1 10 A 
 2 20 A 
 4 40 C 
 5 50 C 
 7 70 B 
 8 80 B 
 9 90 B 
14 140 E 
15 150 E 
</code></pre>

Thanks!

Keep consecutive duplicates

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我有一个数据框架，其中一个列包含一些连续的重复项。我希望保留连续重复的行(任意长度>1)。我更喜欢dplyr或data.table中的解决方案。示例数据：a <- seq(10,150,10)b <- c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C...

问保持连续副本
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问保持连续副本EN