最近看到一道经典面试题: 在40亿的unsigned int数据中(乱序),给定一个数字target, 判断该target是否存在于这40亿的数据中?
准备工作: 如下代码随机生成[1, 2147483648)的整数集保存在D盘根目录下a.txt,生成数据(一行一个整数)之后(约占磁盘40G),用代码再统计一下生成的数字有3999999040(嗯?有点奇怪,不是生成40亿的数据吗?怎么是39亿多?鬼知道呢?不过够用了)
public static void build4BNumberLists() {
try {
BufferedWriter bw = new BufferedWriter(new FileWriter("D:/a.txt"));
int maxValue = 2147483647;
for (long i = 0; i < 4000000000L; i ++ ) {
int value = new Random().nextInt(maxValue) + 1;
System.out.println(value);
if (i % 1000000 == 0 ) {
System.out.println("遍历到:" + i);
}
bw.write(value + "\n");
}
} catch (Exception e) {
e.printStackTrace();
}
}
public static void totalNumberLists() {
try {
BufferedReader br = new BufferedReader(new FileReader("D:/a.txt"));
String line;
long count = 0;
while((line = br.readLine()) != null) {
System.out.println(line);
count ++;
}
System.out.println("total number of numbers in this file is:" + count);
} catch (Exception e) {
}
}
i. 使用set集合add操作,将40亿的数据一次性加载进内存,然后只需要使用contains方法判断target是否存在即可
问题: 一个unsigned int的元素,需要占4B的空间,按照最坏的打算,40亿个整数不重复,那么需要40亿*4B(1.6 * 10^10B 约等 16GB)的内存,采用这种方法的话,用我自己电脑16G内存跑的话,跑到add 6500w的数据的时候,CPU直接100%,内存就已经飙升到90%了,我还是赶紧关停了程序,缺点就是需要很大的内存
public static boolean findTargetInHashSet(int target) {
try {
BufferedReader br = new BufferedReader(new FileReader("D:/a.txt"));
String line;
HashSet<Integer> set = new HashSet<>();
while ((line = br.readLine()) != null) {
int value = Integer.parseInt(line);
set.add(value);
++ count;
if (count % 1000000 == 0) {
System.out.println("load the " + count + "th number");
}
}
return set.contains(target);
} catch (Exception e) {
e.printStackTrace();
}
return false;
}
ii. 不适用set集合,直接将读过来的数据与target进行比较
public static boolean findTargetInLine(long target) {
try {
BufferedReader br = new BufferedReader(new FileReader("D:/a.txt"));
String line;
while((line = br.readLine()) != null) {
long value = (Long.parseLong(line));
if (value == target) {
return true;
}
}
return false;
} catch (Exception e) {
e.printStackTrace();
}
return false;
}
iii. 使用bitmap算法(引用下维基百科的解释: In computing, a bitmap is a mapping from some domain (for example, a range of integers) to bits. It is also called a bit array or bitmap index. 在计算机中,bitmap是用作某个值(例如: 给定范围的整数),映射为位(bit), 也被叫做位数组或位图)。总而言之,就是一种映射关系,即用一个bit来代表一位unsigned long int,按照正常的空间来算1个unsigned long int占8个bit,8个字节,即64bit。 用1bit可以表示原来64bit的数据,这样就节省64倍的空间。
举个例子: 给定一个long类型的数组,向其中如下的一些数据,以下是具体的位图展示
long类型是8Byte = 8 * 8 = 64bit, 让每一个位代表一个值,假设这批数字的最大值max = 40亿, 这样我们可以开辟一个
(400000000 / 64 + 1)空间的大小, 数组中每一个long类型的值是64bit, 实际代表了64个long值:
a[0]: 0~63
a[1]: 64~127
.......
arr[62500000]:40亿~40亿+63, 开辟+1个空间大小的原因是下标是从0开始的,不+1的话,无法表示[40亿, 40亿+63]
开辟62500001个空间大小的long类型数组,为了好算,即625000000*8=5*10^8约等于0.5G = 512MB
如下是arr数组中的值,存储为bitmap之后的映射关系如下,存在即1不存在则0:
long[] arr = new long[]{4000000000L, 3999999999L, 3999999998L, 3999999997L, 0, 1, 32, 63, 64, 65, 128, 256, 1999};
a[0][1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
a[1][1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
a[2][1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
a[4][1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
a[31][0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
a[62499999][0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
a[62500000][1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
按照如上的原理分析, 写出如下几个函数:
// 开辟一个long数组,限定该代码中
public class BitLongMap {
private long maxSize;
private long arr[];
public BitLongMap() {
maxSize = 4000000000L;
int length = (int)(maxSize / 64) + 1;
arr = new long[length];
}
}
add(long value): 将数据集插入到分配的数组中,完成映射
public void add(long value) {
// 注意这里的1L一定要加L,不加L的话,默认是按照int 32位来处理的,得不到预期的结果
arr[(int)(value / 64)] |= 1L << (value % 64);
}
exist(long target): 判断给定的target是否存在于数据集中
public boolean exist(long target) {
return (arr[(int)(target / 64) ] & (1L << (target % 64))) != 0;
}
sort(): 给定的数据集进行排序
public void sort(long max) {
int row = (int)(max / 64L) + 1;
for (int i = 0; i < row; i ++ ) {
long value = arr[i];
for (int j = 0; j < 64; j ++ ) {
if ((value & (1L << j)) != 0) {
System.out.print(i * 64L + j + " ");
}
}
}
System.out.println();
}
完整代码如下
public class BitLongMap {
private long maxSize;
private long arr[];
public BitLongMap() {
maxSize = 4000000000L;
int length = (int)(maxSize / 64) + 1;
arr = new long[length];
}
public void add(long value) {
arr[(int)(value / 64)] |= 1L << (value % 64);
display(1);
System.out.println();
}
public boolean exist(long target) {
return (arr[(int)(target / 64) ] & (1L << (target % 64))) != 0;
}
// 打印对应的位信息
public void display(int row) {
for (int i = 0; i < row; i ++ ) {
long value = arr[i];
if (value != 0) {
List<Long> ans = new ArrayList<>();
for (int j = 0; j < 64; j ++ ) {
ans.add(value & 1);
value >>= 1;
}
System.out.println("a[" + i + "]" + ans.toString());
}
}
}
// 61 62 63 64 128 198 1111 1999 19999 2147483647 2147483648 3999999999 4000000000
public void sort(long max) {
int row = (int)(max / 64L) + 1;
for (int i = 0; i < row; i ++ ) {
long value = arr[i];
for (int j = 0; j < 64; j ++ ) {
if ((value & (1L << j)) != 0) {
System.out.print(i * 64L + j + " ");
}
}
}
System.out.println();
}
// 4B 不重复的正整数数中求出中位数
public void middleNumber() {
int row = (int)(maxSize / 64L) + 1;
long count = 0;
for (int i = 0; i < row; i ++ ) {
long value = arr[i];
for (int j = 0; j < 64; j ++ ) {
if ((value & (1L << j)) != 0) {
count ++;
}
}
}
System.out.println(count);
long index = 0;
// even
long leftValue = 0, middleValue = 0, rightValue = 0;
// index = count / 2, count / 2 - 1 求和 / 2
for (int i = 0; i < row; i ++ ) {
long value = arr[i];
for (int j = 0; j < 64; j ++ ) {
if ((value & (1L << j)) != 0) {
++ index;
if (index - 1 == count / 2 - 1) {
leftValue = i * 64L + j;
}
if (index - 1 == count / 2) {
rightValue = i * 64L + j;
middleValue = rightValue;
}
if (index > count / 2) {
break;
}
}
}
}
if ((count & 1) == 0) {
// 不过这里还是要考虑要爆long
System.out.println((leftValue + rightValue) / 2);
} else {
System.out.println(middleValue);
}
}
public static void main(String[] args) {
BitLongMap bitLongMap = new BitLongMap();
// long[] arr = new long[]{64, 63, 62, 61, 128, 198, 1111, 19999, 2147483647, 2147483648L, 1999, 3999999999L, 4000000000L};
// long[] arr = new long[]{64, 63, 62, 61, 128, 198, 1111, 1222, 19999, 2147483647, 2147483648L, 1999, 3999999999L, 4000000000L};
// long[] arr = new long[]{64, 32, 198};
// long[] arr = new long[]{4000000000L, 3999999999L, 3999999998L};
// long[] arr = new long[]{4000000000L, 3999999999L, 3999999998L, 3999999997L, 0, 1, 32, 63, 64, 65, 128, 256, 1999};
long[] arr = new long[]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 31, 32, 33, 63};
for (int i = 0; i < arr.length; i ++ ) {
bitLongMap.add(arr[i]);
}
// System.out.println(bitLongMap.exist(65));
// System.out.println(bitLongMap.exist(64));
// System.out.println(bitLongMap.exist(63));
// System.out.println(bitLongMap.exist(60));
// System.out.println(bitLongMap.exist(55));
// System.out.println(bitLongMap.exist(32));
//
// System.out.println(1L << 63);
// System.out.println(1 << 31);
// bitLongMap.display((int)(4000000000L / 64) + 1);
// bitLongMap.sort(4000000000L);
// bitLongMap.middleNumber();
}
}
看了网上一些别人的文章,说40亿数据中查找target使用bitmap像是优秀的解法,但是我个人觉得这个bitmap算法在这个场景没什么用?原因是: bitmap虽然节省很多内存,但是为什么一定要把数据一次性加载进内存呢?比如使用上述第二种方法bufferedreader按行读取即可, 时间复杂度也是O(n)。
当然我认为bitmap是在如下的场景下会更适用些(请注意题目的约束条件这里只描述了大致意思):
等等......
最后为了更好的学习bitmap的相关算法,JDK中提供了BitSet类以及Google的EWAHCompressedBitmap都有具体的源代码参考,创作不易,点个赞再走~
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有