文章/答案/技术大牛

发布

社区首页 >问答首页 >排序的二进制搜索(内存映射?)Java中的文件

问排序的二进制搜索(内存映射?)Java中的文件
EN

Stack Overflow用户

提问于 2009-04-10 02:39:27

回答 8查看 17.9K关注 0票数 33

我正在努力将Perl程序移植到Java上，并边走边学Java。原始程序的一个核心组件是一个Perl module，它使用二进制搜索在+500 GB排序的文本文件中执行字符串前缀查找(本质上是，"seek“查找文件中间的字节偏移量，回溯到最近的换行符，比较行前缀和搜索字符串，"seek”查找字节偏移量的一半/两倍，重复直到找到...)

我已经尝试过几种数据库解决方案，但我发现，对于这种大小的数据集，没有任何解决方案能在绝对的查找速度上与之匹敌。您是否知道有任何现有的Java库实现了这种功能？如果做不到这一点，您能给我一些惯用的示例代码来执行随机访问读取文本文件吗？

或者，我不熟悉新的(?)Java I/O库，但是是否可以选择对500 GB的文本文件进行内存映射(我在一台64位机器上，有多余的内存)，并在内存映射的字节数组上执行二进制搜索？我非常有兴趣听到你关于这个问题和类似问题的任何经验。

nio

large-files

binary-search

memory-mapping

java

回答 8

Stack Overflow用户

回答已采纳

发布于 2009-04-10 07:03:44

在这种情况下，我是Java的的铁杆粉丝。它的速度非常之快。下面是我为您整理的一段代码片段，它将缓冲区映射到文件，查找到中间，然后向后搜索到换行符。这应该足够让你上路了吧？

我在自己的应用程序中有类似的代码(查找、阅读、重复，直到完成)，在生产环境中根据MappedByteBuffer对java.io streams进行基准测试，并将结果发布到我的博客(Geekomatic posts tagged 'java.nio' )上，其中包含原始数据、图表等。

两秒总结？我的基于MappedByteBuffer__的实现大约快了275%。YMMV.

为了处理大于2 2GB的文件，这是一个问题，因为cast和.position(int pos)，我制作了一个由MappedByteBuffer数组支持的分页算法。你需要在64位系统上工作才能处理大于2-4 2GB的文件，因为MBB使用操作系统的虚拟内存系统来发挥他们的魔力。

public class StusMagicLargeFileReader  {
    private static final long PAGE_SIZE = Integer.MAX_VALUE;
    private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
    private final byte raw[] = new byte[1];

    public static void main(String[] args) throws IOException {
        File file = new File("/Users/stu/test.txt");
        FileChannel fc = (new FileInputStream(file)).getChannel(); 
        StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
        long position = file.length() / 2;
        String candidate = buffer.getString(position--);
        while (position >=0 && !candidate.equals('\n')) 
            candidate = buffer.getString(position--);
        //have newline position or start of file...do other stuff    
    }
    StusMagicLargeFileReader(FileChannel channel) throws IOException {
        long start = 0, length = 0;
        for (long index = 0; start + length < channel.size(); index++) {
            if ((channel.size() / PAGE_SIZE) == index)
                length = (channel.size() - index *  PAGE_SIZE) ;
            else
                length = PAGE_SIZE;
            start = index * PAGE_SIZE;
            buffers.add(index, channel.map(READ_ONLY, start, length));
        }    
    }
    public String getString(long bytePosition) {
        int page  = (int) (bytePosition / PAGE_SIZE);
        int index = (int) (bytePosition % PAGE_SIZE);
        raw[0] = buffers.get(page).get(index);
        return new String(raw);
    }
}

票数 29

Stack Overflow用户

发布于 2009-06-15 21:17:45

我也有同样的问题。我正在尝试查找排序文件中以某个前缀开头的所有行。

下面是我编写的一个方法，它很大程度上是Python代码的一个端口，可以在这里找到：http://www.logarithmic.net/pfh/blog/01186620415

我已经测试过了，但还不是很彻底。不过，它不使用内存映射。

public static List<String> binarySearch(String filename, String string) {
    List<String> result = new ArrayList<String>();
    try {
        File file = new File(filename);
        RandomAccessFile raf = new RandomAccessFile(file, "r");

        long low = 0;
        long high = file.length();

        long p = -1;
        while (low < high) {
            long mid = (low + high) / 2;
            p = mid;
            while (p >= 0) {
                raf.seek(p);

                char c = (char) raf.readByte();
                //System.out.println(p + "\t" + c);
                if (c == '\n')
                    break;
                p--;
            }
            if (p < 0)
                raf.seek(0);
            String line = raf.readLine();
            //System.out.println("-- " + mid + " " + line);
            if (line.compareTo(string) < 0)
                low = mid + 1;
            else
                high = mid;
        }

        p = low;
        while (p >= 0) {
            raf.seek(p);
            if (((char) raf.readByte()) == '\n')
                break;
            p--;
        }

        if (p < 0)
            raf.seek(0);

        while (true) {
            String line = raf.readLine();
            if (line == null || !line.startsWith(string))
                break;
            result.add(line);
        }

        raf.close();
    } catch (IOException e) {
        System.out.println("IOException:");
        e.printStackTrace();
    }
    return result;
}

票数 4

Stack Overflow用户

发布于 2009-04-10 11:56:42

我不知道有任何库具有该功能。但是，Java中外部二进制搜索的正确代码应该类似于：

class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
    this.file = new RandomAccessFile(f, "r");
    this.test = test;
}
public String search(String element) throws IOException {
    long l = file.length();
    return search(element, -1, l-1);
}
/**
 * Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
 * In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
 */
private String search(String element, long low, long high) throws IOException {
    if(high - low < 1024) {
        // search directly
        long p = low;
        while(p < high) {
            String line = nextLine(p);
            int r = test.compare(line,element);
            if(r > 0) {
                return null;
            } else if (r < 0) {
                p += line.length();
            } else {
                return line;
            }
        }
        return null;
    } else {
        long m  = low + ((high - low) / 2);
        String line = nextLine(m);
        int r = test.compare(line, element);
        if(r > 0) {
            return search(element, low, m);
        } else if (r < 0) {
            return search(element, m, high);
        } else {
            return line;
        }
    }
}
private String nextLine(long low) throws IOException {
    if(low == -1) { // Beginning of file
        file.seek(0);           
    } else {
        file.seek(low);
    }
    int bufferLength = 65 * 1024;
    byte[] buffer = new byte[bufferLength];
    int r = file.read(buffer);
    int lineBeginIndex = -1;

    // search beginning of line
    if(low == -1) { //beginning of file
        lineBeginIndex = 0;
    } else {
        //normal mode
        for(int i = 0; i < 1024; i++) {
        if(buffer[i] == '\n') {
            lineBeginIndex = i + 1;
            break;
        }
        }
    }
    if(lineBeginIndex == -1) {
        // no line begins within next 1024 bytes
        return null;
    }
    int start = lineBeginIndex;
        for(int i = start; i < r; i++) {
            if(buffer[i] == '\n') {
                // Found end of line
                return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
                return line.toString();
            }
        }
        throw new IllegalArgumentException("Line to long");
}
}

请注意:我临时编写了这段代码:角例测试不够好，代码假设没有一行大于64K，等等。

我还认为，构建行开始处的偏移量的索引可能是一个好主意。对于500 GB的文件，该索引应存储在索引文件中。使用该索引，您应该获得一个不是很小的常量因子，因为不需要在每个步骤中搜索下一行。

我知道这不是问题所在，但是构建一个像(Patrica) Tries (在磁盘/SSD上)这样的前缀树数据结构可能是执行前缀搜索的一个好主意。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/736556

复制

相似问题

问排序的二进制搜索(内存映射?)Java中的文件
EN

回答 8

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问排序的二进制搜索(内存映射?)Java中的文件EN

回答 8

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问排序的二进制搜索(内存映射?)Java中的文件
EN