[ Pobierz całość w formacie PDF ]
the two (or more) operations can occur simultaneously by executing in different functional units.
Consider again the steps the MOV( mem/reg, reg ) instruction requires:
" Fetch the instruction byte from memory.
" Update the EIP register to point at the next byte.
" Decode the instruction to see what it does.
" If required, fetch a displacement operand from memory.
" If required, update EIP to point beyond the displacement.
" Compute the address of the operand, if required (i.e., EBX+xxxx) .
" Fetch the operand.
" Store the fetched value into the destination register
The first operation uses the value of the EIP register (so we cannot overlap incrementing EIP with it) and it uses the bus to
fetch the instruction opcode from memory. Every step that follows this one depends upon the opcode it fetches from memory,
so it is unlikely we will be able to overlap the execution of this step with any other.
The second and third operations do not share any functional units, nor does decoding an opcode depend upon the value of
the EIP register. Therefore, we can easily modify the control unit so that it increments the EIP register at the same time it
decodes the instruction. This will shave one cycle off the execution of the MOV instruction.
The third and fourth operations above (decoding and optionally fetching the displacement operand) do not look like they
can be done in parallel since you must decode the instruction to determine if it the CPU needs to fetch an operand from mem-
ory. However, we could design the CPU to go ahead and fetch the operand anyway, so that it s available if we need it. There is
one problem with this idea, though, we must have the address of the operand to fetch (the value in the EIP register) and if we
Page 255
must wait until we are done incrementing the EIP register before fetching this operand. If we are incrementing EIP at the same
time we re decoding the instruction, we will have to wait until the next cycle to fetch this operand.
Since the next three steps are optional, there are several possible instruction sequences at this point:
#1 (step 4, step 5, step 6, and step 7) e.g., MOV( [ebx+1000], eax )
#2 (step 4, step 5, and step 7) e.g., MOV( disp, eax ) -- assume disp s address is 1000
#3 (step 6 and step 7) e.g., MOV( [ebx], eax )
#4 (step 7) e.g., MOV( ebx, eax )
In the sequences above, step seven always relies on the previous steps in the sequence. Therefore, step seven cannot exe-
cute in parallel with any of the other steps. Step six also relies upon step four. Step five cannot execute in parallel with step four
since step four uses the value in the EIP register, however, step five can execute in parallel with any other step. Therefore, we
can shave one cycle off the first two sequences above as follows:
#1 (step 4, step 5/6, and step 7)
#2 (step 4, step 5/7)
#3 (step 6 and step 7)
#4 (step 7)
Of course, there is no way to overlap the execution of steps seven and eight in the MOV instruction since it must surely
fetch the value before storing it away. By combining these steps, we obtain the following steps for the MOV instruction:
" Fetch the instruction byte from memory.
" Decode the instruction and update ip
" If required, fetch a displacement operand from memory.
" Compute the address of the operand, if required (i.e., ebx+xxxx) .
" Fetch the operand, if required update EIP to point beyond xxxx.
" Store the fetched value into the destination register
By adding a small amount of logic to the CPU, we ve shaved one or two cycles off the execution of the MOV instruction.
This simple optimization works with most of the other instructions as well.
Consider what happens with the MOV instruction above executes on a CPU with a 32-bit data bus. If the MOV instruc-
tion fetches an eight-bit displacement from memory, the CPU may actually wind up fetching the following three bytes after the
displacement along with the displacement value (since the 32-bit data bus lets us fetch four bytes in a single bus cycle). The
second byte on the data bus is actually the opcode of the next instruction. If we could save this opcode until the execution of
the next instruction, we could shave a cycle of its execution time since it would not have to fetch the opcode byte. Furthermore,
since the instruction decoder is idle while the CPU is executing the MOV instruction, we can actually decode the next instruc-
tion while the current instruction is executing, thereby shaving yet another cycle off the execution of the next instruction. This,
effectively, overlaps a portion of the MOV instruction with the beginning of the execution of the next instruction, allowing
additional parallelism.
Can we improve on this? The answer is yes. Note that during the execution of the MOV instruction the CPU is not access-
[ Pobierz całość w formacie PDF ]