ARM big.Little Coherency

http://www.arm.com/files/pdf/big_LITTLE_Technology_the_Futue_of_Mobile.pdf,
http://www.arm.com/products/system-ip/interconnect/corelink-cci-400.php

 

”Cache Coherent Interconnect” – in this case the ARM CoreLink™ CCI-400 interconnect IP. The system is completed by the CoreLink GIC-400, which provides dynamically configurable interrupt distribution to all the cores.

QQ截图20140917145732

the bus interfaces of Cortex-A15 and Cortex-A7 processors make use of the
AMBA® AXI Coherency Extensions (ACE) to the widely-used AMBA AXI protocol. This protocol provides for coherent data transfer at the bus level. In the AMBA ACE protocol, three coherency channels are added in addition to the normal five channels of AMBA AXI. As an example, the lower part of Figure shows the steps in a coherent data read from the Cortex-A7 cluster to the Cortex-A15 cluster. This starts with the Cortex-A7 cluster issuing a Coherent Read Request through the RADDR channel. The CCI-400 hands over the request to the Cortex-A15 processor’s ACADDR channel to snoop into Cortex-A15 processor’s cache. On receiving the request from CCI-400, the Cortex-A15 processor checks the data availability and reports this information back through the CRRESP channel. If the requested data is in the cache, the Cortex-A15 processor places it on the CDATA channel. Then the CCI-400 moves the data from the Cortex-A15 processor’s CDATA channel to the Cortex-A7 processor’s RDATA channel, resulting in a cache linefill in the Cortex-A7 processor. The CCI-400 and the ACE protocol enable full coherency between the Cortex-A15 and Cortex-A7 clusters, allowing data sharing to take place without external memory transactions.

All interfaces support 128-bit wide data allowing for systems scaling to 10’s Gbyte/s data bandwidths to support high definition multimedia requirements and the latest high performance networking interfaces.

Without hardware coherency the software is responsible for cache maintenance including cleaning, flushing and invalidating caches. This takes significant processing cycles and energy as data is cleaned out from caches to external memory. The hardware coherency introduced with AMBA 4 ACE allows the different processing engines to view each other’s caches and removes or reduces the need for the cache maintenance operations. The hardware coherency ensures that any cached data in the small core can be passed seamlessly to the large core without having to access external memory.

Therefore, the Cortex-A15-Cortex-A7 system is designed to migrate in less than 20,000-cycles, or 20-microSeconds with processors operating at 1GHz. Less than 2,000 instructions are required to achieve save-restore and because the two processors are architecturally identical there is a one-to-one mapping between state registers in the inbound and outbound processors.(http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf)

For private cache warmup penalty, prior work shows that performance often improves when private LLCs of big and little cores are powered on together [Scheduling Heterogeneous Multi-Cores through Performance Impact Estimation (PIE) ]. Thus, we ignore the warmup penalty. Also, prior work suggested that the power overhead of task migration is < 0.75% [Thread Motion: Fine-Grained Power Management for
Multi-Core Systems]. Thus, we do not consider the additional energy consumption of our scheduling mechanism.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a comment