判断一个数是否在40亿个整数中？

原创

Ritchie

修改于 2023-03-03 09:31:26

1.3K00

代码可运行

文章被收录于专栏：Ritchie的专栏Ritchie的专栏

运行总次数：0

代码可运行

最近看到一道经典面试题: 在40亿的unsigned int数据中(乱序)，给定一个数字target，判断该target是否存在于这40亿的数据中？

准备工作: 如下代码随机生成[1, 2147483648)的整数集保存在D盘根目录下a.txt，生成数据(一行一个整数)之后（约占磁盘40G），用代码再统计一下生成的数字有3999999040（嗯？有点奇怪，不是生成40亿的数据吗？怎么是39亿多？鬼知道呢？不过够用了）

public static void build4BNumberLists() {
		try {
			BufferedWriter bw = new BufferedWriter(new FileWriter("D:/a.txt"));
			int maxValue = 2147483647;
			for (long i = 0; i < 4000000000L; i ++ ) {
				int value = new Random().nextInt(maxValue) + 1;
				System.out.println(value);
				if (i % 1000000 == 0 ) {
					System.out.println("遍历到：" + i);
				}
				bw.write(value + "\n");
			}
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

public static void totalNumberLists() {
   try {
      BufferedReader br = new BufferedReader(new FileReader("D:/a.txt"));
      String line;
      long count = 0;
      while((line = br.readLine()) != null) {
         System.out.println(line);
         count ++;
      }
      System.out.println("total number of numbers in this file is:" + count);
   } catch (Exception e) {

   }
}

i. 使用set集合add操作，将40亿的数据一次性加载进内存，然后只需要使用contains方法判断target是否存在即可

问题: 一个unsigned int的元素，需要占4B的空间，按照最坏的打算，40亿个整数不重复，那么需要40亿*4B（1.6 * 10^10B 约等 16GB）的内存，采用这种方法的话，用我自己电脑16G内存跑的话，跑到add 6500w的数据的时候，CPU直接100%，内存就已经飙升到90%了，我还是赶紧关停了程序，缺点就是需要很大的内存

public static boolean findTargetInHashSet(int target) {
		try {
			BufferedReader br = new BufferedReader(new FileReader("D:/a.txt"));
			String line;

			HashSet<Integer> set = new HashSet<>();
			while ((line = br.readLine()) != null) {
				int value = Integer.parseInt(line);
				set.add(value);
				++ count;
				if (count % 1000000 == 0) {
					System.out.println("load the " + count + "th number");
				}
			}
			return set.contains(target);
		} catch (Exception e) {
			e.printStackTrace();
		}
		return false;
	}

ii. 不适用set集合，直接将读过来的数据与target进行比较

public static boolean findTargetInLine(long target) {
		try {
			BufferedReader br = new BufferedReader(new FileReader("D:/a.txt"));
			String line;
			while((line = br.readLine()) != null) {
				long value = (Long.parseLong(line));
				if (value == target) {
					return true;
				}
			}
			return false;
		} catch (Exception e) {
			e.printStackTrace();
		}
		return false;
	}

iii. 使用bitmap算法（引用下维基百科的解释: In computing, a bitmap is a mapping from some domain (for example, a range of integers) to bits. It is also called a bit array or bitmap index. 在计算机中，bitmap是用作某个值（例如: 给定范围的整数），映射为位(bit)，也被叫做位数组或位图）。总而言之，就是一种映射关系，即用一个bit来代表一位unsigned long int，按照正常的空间来算1个unsigned long int占8个bit，8个字节，即64bit。用1bit可以表示原来64bit的数据，这样就节省64倍的空间。

举个例子:  给定一个long类型的数组，向其中如下的一些数据，以下是具体的位图展示
long类型是8Byte = 8 * 8 = 64bit， 让每一个位代表一个值，假设这批数字的最大值max = 40亿, 这样我们可以开辟一个
(400000000 / 64 + 1)空间的大小， 数组中每一个long类型的值是64bit, 实际代表了64个long值:
a[0]: 0~63
a[1]: 64~127
.......
arr[62500000]:40亿~40亿+63, 开辟+1个空间大小的原因是下标是从0开始的，不+1的话，无法表示[40亿, 40亿+63]

开辟62500001个空间大小的long类型数组，为了好算，即625000000*8=5*10^8约等于0.5G = 512MB


如下是arr数组中的值，存储为bitmap之后的映射关系如下，存在即1不存在则0:
long[] arr = new long[]{4000000000L, 3999999999L, 3999999998L, 3999999997L, 0, 1, 32, 63, 64, 65, 128, 256, 1999};

a[0][1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
a[1][1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
a[2][1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
a[4][1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
a[31][0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
a[62499999][0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
a[62500000][1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

按照如上的原理分析, 写出如下几个函数:

// 开辟一个long数组，限定该代码中
public class BitLongMap {
	
	private long maxSize;
	private long arr[];
	
	public BitLongMap() {
		maxSize = 4000000000L;
		int length = (int)(maxSize / 64) + 1;
		arr = new long[length];
	}
}

add(long value): 将数据集插入到分配的数组中，完成映射

public void add(long value) {
		// 注意这里的1L一定要加L，不加L的话，默认是按照int 32位来处理的，得不到预期的结果
		arr[(int)(value / 64)] |= 1L << (value % 64);
	}

exist(long target): 判断给定的target是否存在于数据集中

public boolean exist(long target) {
		return (arr[(int)(target / 64) ] & (1L << (target % 64))) != 0;
	}

sort(): 给定的数据集进行排序

public void sort(long max) {
		int row = (int)(max / 64L) + 1;
		for (int i = 0; i < row; i ++ ) {
			long value = arr[i];
			for (int j = 0; j < 64; j ++ ) {
				if ((value & (1L << j)) != 0) {
					System.out.print(i * 64L + j + " ");
				}
			}
		}
		System.out.println();
	}

完整代码如下

public class BitLongMap {
	
	private long maxSize;
	private long arr[];
	
	public BitLongMap() {
		maxSize = 4000000000L;
		int length = (int)(maxSize / 64) + 1;
		arr = new long[length];
	}

	public void add(long value) {
		arr[(int)(value / 64)] |= 1L << (value % 64);
		display(1);
		System.out.println();

	}
	
	public boolean exist(long target) {
		return (arr[(int)(target / 64) ] & (1L << (target % 64))) != 0;
	}
	
	// 打印对应的位信息
	public void display(int row) {
		for (int i = 0; i < row; i ++ ) {
			long value = arr[i];
			if (value != 0) {
				List<Long> ans = new ArrayList<>();
				for (int j = 0; j < 64; j ++ ) {
					ans.add(value & 1);
					value >>= 1;
				}
				System.out.println("a[" + i + "]" + ans.toString());
			}
			
		}
	}
	
	// 61 62 63 64 128 198 1111 1999 19999 2147483647 2147483648 3999999999 4000000000
	public void sort(long max) {
		int row = (int)(max / 64L) + 1;
		for (int i = 0; i < row; i ++ ) {
			long value = arr[i];
			for (int j = 0; j < 64; j ++ ) {
				if ((value & (1L << j)) != 0) {
					System.out.print(i * 64L + j + " ");
				}
			}
		}
		System.out.println();
	}
	
	// 4B 不重复的正整数数中求出中位数
	public void middleNumber() {
		int row = (int)(maxSize / 64L) + 1;
		long count = 0;
		for (int i = 0; i < row; i ++ ) {
			long value = arr[i];
			for (int j = 0; j < 64; j ++ ) {
				if ((value & (1L << j)) != 0) {
					count ++;
				}
			}
		}
		System.out.println(count);
		long index = 0;
		// even
		long leftValue = 0, middleValue = 0, rightValue = 0;
		// index = count / 2, count / 2 - 1 求和 / 2
		for (int i = 0; i < row; i ++ ) {
			long value = arr[i];
			for (int j = 0; j < 64; j ++ ) {
				
				if ((value & (1L << j)) != 0) {
					++ index;
					if (index - 1 == count / 2 - 1) {
						leftValue = i * 64L + j;
					}
					if (index - 1 == count / 2) {
						rightValue = i * 64L + j;
						middleValue = rightValue;
					}
					
					if (index > count / 2) {
						break;
					}
				}
			}
		}
		
		if ((count & 1) == 0) {
			// 不过这里还是要考虑要爆long
			System.out.println((leftValue + rightValue) / 2);
		} else {
			System.out.println(middleValue);
		}
		
	}
	
	
	public static void main(String[] args) {
		BitLongMap bitLongMap = new BitLongMap();
//		long[] arr = new long[]{64, 63, 62, 61, 128, 198, 1111, 19999, 2147483647, 2147483648L, 1999, 3999999999L, 4000000000L};
//		long[] arr = new long[]{64, 63, 62, 61, 128, 198, 1111, 1222, 19999, 2147483647, 2147483648L, 1999, 3999999999L, 4000000000L};
//		long[] arr = new long[]{64, 32, 198};
//		long[] arr = new long[]{4000000000L, 3999999999L, 3999999998L};
		
//		long[] arr = new long[]{4000000000L, 3999999999L, 3999999998L, 3999999997L, 0, 1, 32, 63, 64, 65, 128, 256, 1999};
		long[] arr = new long[]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 31, 32, 33, 63};
		for (int i = 0; i < arr.length; i ++ ) {
			bitLongMap.add(arr[i]);
		}
		
//		System.out.println(bitLongMap.exist(65));
//		System.out.println(bitLongMap.exist(64));
//		System.out.println(bitLongMap.exist(63));
//		System.out.println(bitLongMap.exist(60));
//		System.out.println(bitLongMap.exist(55));
//		System.out.println(bitLongMap.exist(32));
//
//		System.out.println(1L << 63);
//		System.out.println(1 << 31);
//		bitLongMap.display((int)(4000000000L / 64) + 1);
//		bitLongMap.sort(4000000000L);
//		bitLongMap.middleNumber();
	}
}

看了网上一些别人的文章，说40亿数据中查找target使用bitmap像是优秀的解法，但是我个人觉得这个bitmap算法在这个场景没什么用？原因是: bitmap虽然节省很多内存，但是为什么一定要把数据一次性加载进内存呢？比如使用上述第二种方法bufferedreader按行读取即可, 时间复杂度也是O(n)。

当然我认为bitmap是在如下的场景下会更适用些（请注意题目的约束条件这里只描述了大致意思）：