文章/答案/技术大牛

发布

社区首页 >问答首页 >加快确定一行中所有列是否相同的脚本

问加快确定一行中所有列是否相同的脚本
EN

Unix & Linux用户

提问于 2018-08-06 19:11:07

回答 4查看 97关注 0票数 4

我需要加快脚本的速度，确定每个行的所有“列”是否都是相同的，然后编写一个包含相同元素或"no_match“的新文件。该文件以逗号分隔，由大约15,000行组成，并包含不同数量的“列”。

例如：

1-69
4-59,4-59,4-59,4-61,4-61,4-61
1-46,1-46
4-59,4-59,4-59,4-61,4-61,4-61
6-1,6-1
5-51,5-51
4-59,4-59

写入一个新文件：

1-69
no_match
1-46
no_match
6-1
5-51
4-59

删除第二行和第四行，因为它们包含不同的列。

这是我的远不优雅的剧本：

#!/bin/bash

ind=$1 #file in
num=`wc -l "$ind"|cut -d' ' -f1` #number of lines in 'file in'
echo "alleles" > same_alleles.txt #new file to write to

#loop over every line of 'file in'
for (( i =2; i <= "$num"; i++));do
    #take first column of row being looped over (string to check match of other columns with)
    match=`awk "FNR=="$i" {print}" "$ind"|cut -d, -f1`
    #counts how many matches there are in the looped row
    match_num=`awk "FNR=="$i" {print}" "$ind"|grep -o "$match"|wc -l|cut -d' ' -f1`
    #counts number of commas in each looped row
    comma_num=`awk "FNR=="$i" {print}" "$ind"|grep -o ","|wc -l|cut -d' ' -f1`
    #number of columns in each row
    tot_num=$((comma_num + 1))
    #writes one of the identical elements if all contents of row are identical, or writes "no_match" otherwise
    if [ "$tot_num" == "$match_num" ]; then
            echo $match >> same_alleles.txt
    else
            echo "no_match" >> same_alleles.txt
    fi
done

#END

目前，脚本大约需要11分钟来完成所有的15000行。我真的不知道该如何加快速度(老实说，我很惊讶我能让它开始工作)。任何时间被击倒都会很棒。下面是可以使用的100行较小的节选：

allele
4-39
1-46,1-46,1-46
4-39
4-4,4-4,4-4,4-4
3-23,3-23,3-23
3-21,3-21
4-34,4-34
3-33
4-4,4-4,4-4
4-59,4-59
3-23,3-23,3-23
1-45
1-46,1-46
3-23,3-23,3-23
4-61
1-8
3-7
4-4
4-59,4-59,4-59
1-18,1-18
3-21,3-21
3-23,3-23,3-23
3-23,3-23,3-23
3-30,3-30-3
4-39,4-39
4-61
2-70
4-38-2,4-38-2
1-69,1-69,1-69,1-69,1-69
1-69
4-59,4-59,4-59,4-61,4-61,4-61
1-46,1-46
4-59,4-59,4-59,4-61,4-61,4-61
6-1,6-1
5-51,5-51
4-59,4-59
1-18
3-7
1-69
4-30-4
4-39
1-69
1-69
4-39
3-23,3-23,3-23
4-39
2-5
3-30-3
4-59,4-59,4-59
3-21,3-21
4-59,4-59
3-9
4-59,4-59,4-59
4-31,4-31
1-46,1-46
1-46,1-46,1-46
5-51,5-51
3-48
4-31,4-31
3-7
4-61
4-59,4-59,4-59,4-61,4-61,4-61
4-38-2,4-38-2
3-21,3-21
1-69,1-69,1-69
3-23,3-23,3-23
4-59,4-59
3-48
3-48
1-46,1-46
3-23,3-23,3-23
3-30-3,3-30-3
1-46,1-46,1-46
3-64
3-73,3-73
4-4
1-18
3-7
1-46,1-46
1-3
4-61
2-70
4-59,4-59
5-51,5-51
3-49,3-49
4-4,4-4,4-4
4-31,4-31
1-69
1-69,1-69,1-69
4-39
3-21,3-21
3-33
3-9
3-48
4-59,4-59
4-59,4-59
4-39,4-39
3-21,3-21
1-18

我的脚本需要7秒才能完成。

arithmetic

shell-script

text-processing

scripting

回答 4

Unix & Linux用户

回答已采纳

发布于 2018-08-06 19:15:30

$ awk -F, '{ for (i=2; i<=NF; ++i) if ($i != $1) { print "no_match"; next } print $1 }' file
1-69
no_match
1-46
no_match
6-1
5-51
4-59

对不起，我甚至没有看过你的代码，太多了。当您发现自己在同一个数据的循环体中三次调用awk时，您必须考虑其他更有效的方法。另外，如果涉及到awk，则不需要grep和cut，因为awk可以轻松地完成它们的任务(但在本例中并不需要)。

上面的awk脚本一次读取一个逗号分隔行，并将每个字段与第一个字段进行比较。如果任何测试失败，则打印字符串no_match，脚本将继续使用下一行。如果循环完成(没有发现不匹配)，则打印第一个字段。

作为剧本：

#!/usr/bin/awk -f

BEGIN { FS = "," }

{
    for (i=2; i<=NF; ++i)
        if ($i != $1) {
            print "no_match"
            next
        }

    print $1
}

FS是输入字段分隔符，也可以在命令行上使用-F选项进行设置。awk将分割该字符上的每一行以创建字段。
NF是当前记录中的字段数(“行上的列”)。
$i引用当前记录中的i:th字段，其中i可能是变量或常量(如$1中的)。

Unix & Linux用户

发布于 2018-08-06 19:21:53

Awk是一种完整的编程语言。你已经用过了。但是，不要只将它用于每一行多个调用的简单任务，而是将其用于整个任务。在awk中使用字段分隔符，不要使用剪切。用awk完成全部处理。

awk -F',' '
{ 
  eq=1; 
  for (i = 2; i <= NF; i++)
    if ($1 != $i)
      eq=0;
  print eq ? $1 : "no_match";
}
' $1

票数 1

Unix & Linux用户

发布于 2018-08-06 20:29:35

在perl List::MoreUtils中，通过在标量上下文中计算distinct / uniq元素：

perl -MList::MoreUtils=distinct -F, -lne '
  print( (distinct @F) > 1 ? "no_match" : $F[0])
' example 
1-69
no_match
1-46
no_match
6-1
5-51
4-59

票数 1

页面原文内容由Unix & Linux提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://unix.stackexchange.com/questions/460887

复制

相似问题

问加快确定一行中所有列是否相同的脚本
EN

回答 4

Unix & Linux用户

Unix & Linux用户

Unix & Linux用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问加快确定一行中所有列是否相同的脚本EN

回答 4

Unix & Linux用户

Unix & Linux用户

Unix & Linux用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问加快确定一行中所有列是否相同的脚本
EN