2020-06-04

How HashMap internally works? HashMap in-depth analysis.

Concept analysis

HashMap class diagram structure


The class diagram here is drawn according to JDK1.6 version.
As shown in Figure:

HashMap storage structure


The use of HashMap is so simple, then the question is, how is it stored, what is its storage structure, many programmers do not know, in fact, when you put and get, a little step forward, What you see is its true face. In fact, it is simply said that the storage structure of HashMap is completed by arrays and linked lists.
As shown:


It can be seen from the above figure that HashMap is an array in the Y-axis direction, and the X-axis direction is the storage method of the linked list. Everyone knows that the storage method of the array in the memory address is continuous, the size is fixed, once allocated can not be occupied by other references. Its characteristics are fast query, time complexity is O(1), insert and delete operations are relatively slow, time complexity is O(n), the storage method of the linked list is non-continuous, the size is not fixed, the characteristics are opposite to the array, Insertion and deletion are fast, and query speed is slow. HashMap can be said to be a compromise solution.

The basic principle of HashMap


1. First determine whether the Key is Null. If it is null, directly search for Entry[0]. If it is not Null, first calculate the HashCode of the Key, and then go through the second Hash. Get the Hash value, where the Hash feature value is an int value.

2. According to the Hash value, you need to find the corresponding array, so find the remainder of the length of Entry[], and you get the index of the Entry array.

3. Find the corresponding array, that is, find the linked list, and then insert, delete, and query Value according to the operation of the linked list.

Introduction to HashMap concept

variable(the term)
size(size)
HashMap storage size

threshold(Critical value)
The HashMap size reaches a critical value, and the size needs to be reallocated.

loadFactor(Load factor)
HashMap size load factor, the default is 75%.

modCount(Unified modification)
The total number of times the HashMap has been modified or deleted.

Entry(entity)
The actual entity of the HashMap storage object is composed of Key, value, hash, and next.

HashMap initialization

By default, most people call new HashMap() to initialize, I analyze the constructor of new HashMap(int initialCapacity, float loadFactor) here, the code is as follows:

public HashMap(int initialCapacity, float loadFactor) {
     // initialCapacity HashMap MAXIMUM_CAPACITY = 1 << 30。
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;

     // loadFactor DEFAULT_LOAD_FACTOR=0.75,threshold
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);

        // Find a power of 2 >= initialCapacity
        int capacity = 1;
        while (capacity < initialCapacity)
            capacity <<= 1;

        this.loadFactor = loadFactor;
        threshold = (int)(capacity * loadFactor);
        table = new Entry[capacity];
        init();
    }
It can be seen from the above code that the size of the initialization capacity needs to be known during initialization, because the index of the Entry array needs to be calculated by the bitwise AND Hash algorithm, then the length of the Entry array is required to be the 2nd power.

Hash calculation and collision in HashMap

HashMap's hash calculation first calculates hashCode(), and then performs the second hash code show as below:

// Hash   
int hash = hash(key.hashCode());

int i = indexFor(hash, table.length);
Without being busy learning HashMap's Hash algorithm, let's take a look at JDK's String Hash algorithm first. code show as below:

 /**
     * Returns a hash code for this string. The hash code for a
     * <code>String</code> object is computed as
     * <blockquote><pre>
     * s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
     * </pre></blockquote>
     * using <code>int</code> arithmetic, where <code>s[i]</code> is the
     * <i>i</i>th character of the string, <code>n</code> is the length of
     * the string, and <code>^</code> indicates exponentiation.
     * (The hash value of the empty string is zero.)
     *
     * @return  a hash code value for this object.
     */
    public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }
As can be seen from the JDK API, its algorithm equation is s[0] 31^(n-1) + s[1] 31^(n-2) + ... + s[n-1], where s[i] is the character with index i, and n is the length of the string. Why is there a fixed constant 31 here? There is a lot of discussion about this 31, which is basically the optimized number. The main reference is Joshua Bloch's Effective Ja va 's quote as follows:

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5)-i. Modern VMs do this sort of optimization automatically.

In general, it means that 31 is selected because it is an odd prime number. If it overflows during multiplication, the information will be lost, and when multiplying with 2 is equivalent to shifting, the advantages are still unclear when using it, but It has become a traditional choice. A good feature of 31 is that it can be replaced by shifting and subtraction when doing multiplication. It has better performance. For example, 31 i is equivalent to shifting i left by 5 bits minus i, that is, 31 i == (i<<5)-i. Modern virtual memory systems all use this automatic optimization.

Now entering the topic, why does HashMap have to do a second hash? The code is as follows:

static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }
Before answering this question, let's take a look at how HashMap uses Hash to find the index of an array.

/**
     * Returns index for hash code h.
     */
    static int indexFor(int h, int length) {
        return h & (length-1);
    }
Where h is the hash value, and length is the length of the array. This bitwise AND algorithm is actually the h%length remainder. In general, when using this algorithm, the typical grouping. For example, how to group 100 numbers into 16 groups, this is the meaning. It is widely used.

Now that we know the principle of grouping, let us look at a few examples, the code is as follows:

        int h=15,length=16;
        System.out.println(h & (length-1));
        h=15+16;
        System.out.println(h & (length-1));
        h=15+16+16;
        System.out.println(h & (length-1));
        h=15+16+16+16;
        System.out.println(h & (length-1));
The running results are all 15, why? We convert to binary to see.

System.out.println(Integer.parseInt("0001111", 2) & Integer.parseInt("0001111", 2));

System.out.println(Integer.parseInt("0011111", 2) & Integer.parseInt("0001111", 2));

System.out.println(Integer.parseInt("0111111", 2) & Integer.parseInt("0001111", 2));

System.out.println(Integer.parseInt("1111111", 2) & Integer.parseInt("0001111", 2));
Here you will find that when doing bitwise operations, the latter is always calculated by the low bit, and the high bit is not involved in the calculation, because the high bit is 0. The result of this is that as long as the low bit is the same, no matter what the high bit is, the final result is the same. If you rely on this, the hash collision is always on an array, which causes the linked list at the beginning of this array to be infinite. The speed is very slow, how can it be regarded as high performance. So hashmap must solve this problem, try to make the key as evenly as possible to the array. Avoid causing Hash accumulation.

Back to the topic, how does HashMap deal with this problem, and how to do the secondary Hash.

 static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }
Here is the function to solve Hash conflicts. There are several methods to solve Hash conflicts:
1. Open addressing method:
linear detection and re-hashing, secondary detection and re-hashing, pseudo-random detection and re-hashing)
 2. Re-hashing Method
3. Chain address method
4. Establish a public overflow area

The HashMap uses the chain address method. These methods will be introduced separately in future blogs, so I will not introduce them here.

HashMap put() analysis

The above mentioned some basic concepts, it is time to enter the topic, how to store an object in HashMap, the code is as follows:

 /**
     * Associates the specified value with the specified key in this map.
     * If the map previously contained a mapping for the key, the old
     * value is replaced.
     *
     * @param key key with which the specified value is to be associated
     * @param value value to be associated with the specified key
     * @return the previous value associated with <tt>key</tt>, or
     *         <tt>null</tt> if there was no mapping for <tt>key</tt>.
     *         (A <tt>null</tt> return can also indicate that the map
     *         previously associated <tt>null</tt> with <tt>key</tt>.)
     */
    public V put(K key, V value) {
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key.hashCode());
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }
As you can see from the code, the steps are as follows:

(1) First determine whether the key is null. If it is null, call putForNullKey(value) separately. code show as below:

 /**
     * Offloaded version of put for null keys
     */
    private V putForNullKey(V value) {
        for (Entry<K,V> e = table[0]; e != null; e = e.next) {
            if (e.key == null) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }
        modCount++;
        addEntry(0, null, value, 0);
        return null;
    }
It can be seen from the code that if the key is a null value, it is stored in the linked list at the beginning of table[0] by default. Then traverse each node Entry of the linked list of table[0], if the key of the node entry is found to be null, replace the new value, and then return the old value, if no node entry with key equal to null is found, add a new one Node.

(2) Calculate the hashcode of the key, and then use the result of the calculation to hash twice, and find the index i of the Entry array through indexFor(hash, table.length);.

(3) Then traverse the linked list with table[i] as the head node. If a node with the same hash and key is found, it is replaced with the new value, and then the old value is returned.

(4) What does modCount do? Let me answer it for you. As we all know, HashMap is not thread safe, but in some applications with better fault tolerance, if you don’t want to bear the synchronization overhead of hashTable just because of 1% probability, HashMap uses Fail-Fast mechanism to deal with this problem. You will find that modCount is declared like this in the source code.

The volatile keyword declares modCount, which represents access to modCount in a multi-threaded environment. According to the JVM specification, as long as modCount changes, other threads will read the latest value. In fact, in Hashmap, modCount only plays a key role in iteration.

private abstract class HashIterator<E> implements Iterator<E> {
        Entry<K,V> next;    // next entry to return
        int expectedModCount;    // For fast-fail
        int index;        // current slot
        Entry<K,V> current;    // current entry

        HashIterator() {
            expectedModCount = modCount;
            if (size > 0) { // advance to first entry
                Entry[] t = table;
                while (index < t.length && (next = t[index++]) == null)
                    ;
            }
        }

        public final boolean hasNext() {
            return next != null;
        }

        final Entry<K,V> nextEntry() {
        // 这里就是关键
            if (modCount != expectedModCount)
                throw new ConcurrentModificationException();
            Entry<K,V> e = next;
            if (e == null)
                throw new NoSuchElementException();

            if ((next = e.next) == null) {
                Entry[] t = table;
                while (index < t.length && (next = t[index++]) == null)
                    ;
            }
        current = e;
            return e;
        }

        public void remove() {
            if (current == null)
                throw new IllegalStateException();
            if (modCount != expectedModCount)
                throw new ConcurrentModificationException();
            Object k = current.key;
            current = null;
            HashMap.this.removeEntryForKey(k);
            expectedModCount = modCount;
        }

    }
When using Iterator to start iteration, modCount will be assigned to expectedModCount. During the iteration process, it is determined whether the HashMap is internal or modified by other threads by comparing whether the two are equal each time. If the values ​​of modCount and expectedModCount are different, it proves that there are other When the thread is modifying the structure of HashMap, an exception will be thrown.

Therefore, HashMap's put, remove and other operations are calculated by modCount++.

(5) If no node with the same key hash is found, add a new node addEntry(), the code is as follows:

void addEntry(int hash, K key, V value, int bucketIndex) {
    Entry<K,V> e = table[bucketIndex];
        table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
        if (size++ >= threshold)
            resize(2 * table.length);
    }
It's a coincidence when adding nodes here, each newly added node is added to the head node, and the next of the new head node points to the old old node.

(6) If the size of the HashMap exceeds the critical value, the size must be reset and expanded. See section 9.

Get analysis of HashMap

Understand the above put, get is easy to understand. code show as below:

 public V get(Object key) {
        if (key == null)
            return getForNullKey();
        int hash = hash(key.hashCode());
        for (Entry<K,V> e = table[indexFor(hash, table.length)];
             e != null;
             e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
                return e.value;
        }
        return null;
    }
Don't look at this code, the problems it brings are huge. Remember, HashMap is not thread-safe, so the loop here will cause an endless loop. Why? When you look for the existence of a key hash, you enter the loop. It is at this time that another thread deletes the Entry. Then you have been in an infinite loop because you cannot find the Entry. The final result is Code efficiency is very low, CPU is particularly high. Must remember.

HashMap size () analysis


The size of HashMap is very simple, not calculated in real time, but every time an Entry is added, the size is incremented. It is decremented when deleted. Space for time. Because it is not thread safe. This can be done. High effectiveness.

HashMap reSize () analysis


When the size of HashMap exceeds the critical value, the capacity of HashMap needs to be expanded. code show as below:

void resize(int newCapacity) {
        Entry[] oldTable = table;
        int oldCapacity = oldTable.length;
        if (oldCapacity == MAXIMUM_CAPACITY) {
            threshold = Integer.MAX_VALUE;
            return;
        }

        Entry[] newTable = new Entry[newCapacity];
        transfer(newTable);
        table = newTable;
        threshold = (int)(newCapacity * loadFactor);
    }
As you can see from the code, it returns if the size exceeds the maximum capacity. Otherwise, a new Entry array is created, the length of which is twice the length of the old Entry array. Then copy the old Entry[] to the new Entry[]. The code is as follows:

void transfer(Entry[] newTable) {
        Entry[] src = table;
        int newCapacity = newTable.length;
        for (int j = 0; j < src.length; j++) {
            Entry<K,V> e = src[j];
            if (e != null) {
                src[j] = null;
                do {
                    Entry<K,V> next = e.next;
                    int i = indexFor(e.hash, newCapacity);
                    e.next = newTable[i];
                    newTable[i] = e;
                    e = next;
                } while (e != null);
            }
        }
    }
At the time of copying, the index of the array int i = indexFor(e.hash, newCapacity); to participate in the calculation again.

At this point, HashMap still has some iterator codes, which are not introduced one by one here. In the JDK1.7 version, HashMap has also been upgraded, specifically with the participation of Hash factor.

Today almost completed the source code analysis of HashMap, the next step will analyze the source code of ConcurrencyHashMap. ConcurrencyHashMap makes up for the lack of HashMap thread insecurity and low HashTable performance. It is currently a high-performance thread-safe HashMap class.

No comments:

Post a Comment