imageFortunately, we can use hashing to number the elements of . We’re going to represent a set using a bit array containing elements. And we can test whether is an element of by checking whether bit number in the bit array is set. More explicitly, we can add an element to the set by setting bit number in the bit array. Suppose is an -bit hash function. In particular, for each we set the th element in the bit array, where we regard as a number in the range .

I used redis (and the Python bindings) to store this information in a fashion that was both persistent and fast to look up. For each domain being crawled by the thread a redis key-value pair was used to keep track of the current position in the url frontier file for that domain. The persistence was important because it meant that the crawler could be stopped and started at will, without losing track of where it was in the url frontier. Each thread maintained a connection to a redis server.

That’s exponentially more memory! The problem is that with an -bit hash function, the basic hashing scheme used bits of memory, while hashing into a bit array uses bits, but doesn’t change the probability of failure. Intuitively, it’s not hard to see why this approach is so memory inefficient compared to the basic hashing scheme.

BigNumber.from("0x2a") // // From a negative HexString. // From a decimal string. BigNumber.from("42") // // From a HexString. BigNumber.from([ 42 ]) // // From an existing BigNumber. BigNumber.from("-0x2a") // // From an Array (or Uint8Array). BigNumber.from(42) // // From a ES2015 BigInt. which returns the same instance one1 === one2 // true // From a (safe) number. let one1 = constants.One; let one2 = BigNumber.from(one1) one2 // // . (only on platforms with BigInt support) BigNumber.from(42n) // // Numbers outside the safe range fail: BigNumber.from(Number.MAX_SAFE_INTEGER); // [Error: overflow [ See: ]]

Second, the Ethers BigNumber provides all the functionality required internally and should generally be sufficient for most developers while not exposing some of the more advanced and rare functionality. So it will be easier to swap out the underlying library without impacting consumers.

The reason for using threads is that the Python standard library uses blocking I/O to handle http network connections. I chose the number of crawler threads (141) empirically: I kept increasing the number of threads until the speed of the crawler started to saturate. This means that a single-threaded crawler would spend most of its time idling, usually waiting on the network connection of the remote machine being crawled. Memory useage was never an issue. It’s much better to use a multi-threaded crawler, which can make fuller use of the resources available on an EC2 instance. With this number of threads the crawler was using a considerable fraction of the CPU capacity available on the EC2 instance. My informal testing suggested that it was CPU which was the limiting factor, but that I was not so far away from the network and disk speed becoming bottlenecks; in this sense, the EC2 extra large instance was a good compromise. It’s possible that for this reason EC2’s high-CPU extra large instance type would have been a better choice; I only experimented with this instance type with early versions of the crawler, which were more memory-limited.

