假设我们有一个数据文件,格式如下:
$ cat data.txt
a:23 b:25 c:76 d:45
a:21 b:24 c:25
a:20 d:52 e:75 f:75 g:52
...
(many lines)
...
假设此文件太大,无法读取到内存中,那么将此数据转换为csv格式的最快方法是什么?
输出应该包含一个标头,其中包含文件中所有可能的“键”;如果某一行缺少特定的键,那么该行上的键的值应该等于零。例如:
$ cat csv.txt
//a,b,c,d,e,f,g
23,25,76,45,0,0,0
21,24,25,0,0,0,0
20,0,0,52,75,75,52
...
(many lines)
...
这是我尝试过的。它起作用了,但我感觉所有的循环都在减慢我的速度。有没有一种更快、更优化的方法来做到这一点?我使用的是Perl,但我当然愿意切换到Python或其他东西。
# transform_test.pl
# build set of all used keys.
my %usedKey;
open FILE, "data.txt";
while(<FILE>) {
chomp $_;
my @fields = split;
foreach my $field (@fields) {
my ($key,$value) = split(":",$field);
$usedKey{$key} = 1;
}
}
close FILE;
# build array of all used keys, but sorted.
my @sorted_keys = sort keys %usedKey;
# print header
my $header = "//";
foreach my $key (@sorted_keys) { $header .= "$key,"; }
chop $header;
print "$header\n";
# read through file again to transform the data;
open FILE, "data.txt";
while(<FILE>) {
chomp $_;
# build current line hash
my @fields = split;
my %currentData;
foreach my $field (@fields) {
my ($key,$value) = split(":",$field);
$currentData{$key} = $value;
}
# build string by looping over all sorted keys.
my $toPrint = "";
foreach my $key (@sorted_keys) {
$toPrint .= defined $currentData{$key} ? "$currentData{$key}," : "0,";
}
chop $toPrint;
print "$toPrint\n";
}
https://stackoverflow.com/questions/38461066
复制相似问题