Merge pull request #208 from paulusmack/faster
Make the core go faster Several major improvements in here: - Simple branch predictor - Reduced latency for mispredicted branches and interrupts by removing fetch2 stage - Cache improvements o Request critical dword first on refill o Handle hits while refilling, including on line being refilled o Sizes doubled for both D and I - Loadstore improvements: can now do one load or store every two cycles in most cases - Optimized 2-cycle multiplier for Xilinx 7-series parts using DSP slices - Timing improvements, including: o Stash buffer in decode1 o Reduced width of execute1 result mux o Improved SPR decode in decode1 o Some non-critical operation take a cycle longer so we can break some long combinatorial chains - Core logging: logs 256 bits of info every cycle into a ring buffer, to help with debugging and performance analysis This increases the LUT usage for the "synth" + A35 target from 9182 to 10297 = 12%.
Showing