Heracles Wiki

The latest release (Heracles 2.0) notes:

=============================================================================
                                                          Release Notes for Hearcles v2.0
=============================================================================
In this release, cache coherence and multi-threading is added. Cache coherence is done using remote-access (RA) or by directory-based protocol. For the directory-based cache coherence only MESI (not snoopy and MSI) is supported.

In the current version of the release, MIPS 2-way hyper-threading is fully functional, and the software toolchain has programming support for it. Only RA for cache coherence is supported in the GUI for the MIPS 2-way hyper-threading. Multi-threading with migration has been removed for more testing. On line 191 in mips_core/two_threads/7_Stage_2_Thread_MIPS_Core.v user can change the thread switching policy.

No viable and stable Windows MIPS cross-compiler could be developed, tests and added to this release. So the compilation process is still going through a Linux system.

Software Notes
Please read the origin paper, or check out the Fibonacci example to see how to automatically partition an application across multiple cores using the "#pragma Heracles core x".
When using the MIPS 2-way hyper-threading, function name should be unique to avoid aliasing
when the program is flattened.
Only the second hardware thread needs to explicitly build inside the core job with "#pragma Heracles thread 1".
There is no "#pragma Heracles thread 0", because the core job is by default assigned to hardware thread 0.
The thread number could have been omitted, but it's done for possible extension to more hardware threads.

There can be interactions between the two hardware threads. In such cases, careful synchronization and execution time allocations should be implemented to avoid deadlock:
" ...
looking(&lock0, 2);
#pragma Heracles thread 1 {
    // do work
    release(&lock0, 2);
}
check_lock(&lock0, 100);
..
"
When variables are called in thread 1, they must be set global with the keyword HGlobal.

Execution Notes
When running for the first time configuration examples, user needs to reload the binaries and apply them on the last panel, so that the executable can point to the proper local user directory.
#  --- Start ---
#  Reset .................................
#  Reset .................................
….
#  Reset .................................
#  Core [          0] Start [1] Program [00000010]
...
#  Running ...............................
************************* Output of a core ********************************
# -------------------------- CORE           3 -----------------------------------
#  Result: Register [17] Value =          1
#  Current cycles [      2953] instructions executed [       688]
# -------------------------------------------------------------------------------
# ------------------------- Core 3 data back-end ------------------------------
# Local accesses [       352]  Remote services [         9]
# ------------------------------------------------------------------------------
# ------------------------ Core 3 D-Cache front-end  ---------------------------
# Local requests [       352]  Remote requests [         0]
# ------------------------------------------------------------------------------
# -------------------------- Cache           6: ---------------------------------
# Total hits [      2047]    | Total misses [        27]
# -------------------------------------------------------------------------------
# -------------------------- Cache           7: ---------------------------------
# Total hits [       348]    | Total misses [         4]
# -------------------------------------------------------------------------------
# ----------------------- Router 3 link utilization -----------------------------
# Total run cycles       [      2943]
# Port         [          0]  used cycles [        16]
# Port         [          1]  used cycles [        11]
# Port         [          2]  used cycles [         0]
# Port         [          3]  used cycles [         0]
# Port         [          4]  used cycles [         1]
# -------------------------------------------------------------------------------
# Data [00000001]

This the normal reporting of execution, user can modify so some of the statistics are not printed. In particular, when a core has multiple outputs like in the case of matrix example.
 

General software compilation process is:

1- Create object file from c file by executing: ./softwareToolchain/mips-linux-gnu-gcc -S filename.c

2- Use isa-checker to get .asm file by executing: ./softwareToolchain/isa-checker filename.s"

3- Create object file from .asm file by executing: ./softwareToolchain/mips-linux-gnu-as filename.asm -o filename.o

4- Create dump file by executing:
./softwareToolchain/mips-linux-gnu-objdump --disassemble-all --disassemble-zeroes filename.o > filename.dump

5- Get vmh file by executing: ./softwareToolchain/objdump2vmh.pl filename.dump filename.vmh

6- Based on core and memory configuration get .mem to be loaded onto cores by executing:
./softwareToolchaing/linker filename.vmh x xx xx (Check below for more detailed usage)


Software side memory and I/O management:

user can manipulate the 'linker.cpp' and 'isa-checker.cpp' file.

To compile them execute:
~\softwareToolchain> g++ -Wno-write-strings -o isa-checker isa-checker.cpp
~\softwareToolchain> g++ -Wno-write-strings -o linker linker.cpp

'isa-checker.cpp' transforms .s file into .asm file. Its main function is to remove non-supported instructions and macro with supported instructions.
Usage: ./softwareToolchain/isa-checker filename.s

Here we also set assembly directives
.space 0x10
.set noreorder
.set nomacro
.set nomips16
.set nomicromips
This is to avoid branch delay slot, instruction starting at address zero.
As part of a simple debugging function and I/O, we move final answer or execution status to register 17 by adding instruction "addi $17, $2, 0" (see line 349).

'linker.cpp' transforms .vmh file into .mem file. This file is very powerful and does among other things the following:

1- Set which core is the target executing core for this memory file
./softwareToolchain/linker filename.vmh 2
when core is not set, the default is zero
(e.g., ./softwareToolchain/linker filename.vmh)

2- Set the stack frame pointer for the program on a given core
./softwareToolchain/linker filename.vmh 2 512
the default is 1024.

3- Set number of bits in the effective local real address.
./softwareToolchain/linker filename.vmh 2 512 16
This allows the stack data to be store in the local memory versus
at a remote core.

Important usage information:

For manual configuration, get familiar with 'Real_Cores_Mesh_Wrapper.v' file in the sourceCode/testbench/reference folder. Because it provides a very good use case.

ID_BITS = ROW_BITS + COLUMN_BITS -- to identify cores
FLOW_BITS = (2*ID_BITS) + EXTRA; -- to identify a flow per the cores
-- communicating in the flow
-- EXTRA bits helps to establish
-- multiple unique flows between
-- two cores.

FLIT_WIDTH = FLOW_BITS + TYPE_BITS + VC_BITS + DATA_WIDTH;
PORTS = OUT_PORTS + 4*(SWITCH_TO_SWITCH);-- in the current version
-- OUTPORTS = INPORTS

RT_WIDTH = PORTS_BITS + VC_BITS + 1; -- for a given flow specified by
-- 'route_table_address' provides its forwarding
-- (output) port, VC at next hop and the
-- valid or invalid state associated with that entry

core_ID <= x;-- specifies which core to setup
ON <= 1;-- turns on the router at core_ID
PROG <= 1;-- programs the router
start <= 1;-- starts the core itself
prog_address <= xx;-- starting instruction address at the core

operation <= x; -- set the different activation protocol at the core_ID
-- if(operation[0] == 1) cores_ON[core_ID] <= ON;
-- if(operation[1] == 1) cores_reset[core_ID] <= reset;
-- if(operation[2] == 1) cores_PROG [core_ID] <= PROG;
-- if(operation[3] == 1) cores_start[core_ID] <= start;

route_table_address <= x;-- flow ID used to index into the routing table
route_table_data <= xx; -- flow forwarding info at that router [port_vc_v]

The setup is such that the following programming models can be supported:

1- Execute a single program one a single core.
If user is certain no on-chip routing will be performed, then routers can be off. Start only the core in question after program load.
Make sure to compile the program to that core using ./softwareToolchain/linker filename.vmh core_id
Open filename_core_id.txt and used the Starting Address data to set prog_address for that core.

2- Run same program on multiple cores
a) prog_address on the cores is set to the same, and program is loaded only on the core which address is used. The cores fetch instructions from the same memory address and stack spaces are at the core which has instruction. This may lead to stack corruption if not managed properly.

b) prog_address on the cores is set to the same, and program is loaded only on the core which address is used. The cores fetch instructions from the same memory address but stack spaces are local to each core.

c) prog_address on the cores is set to the same, and program is loaded only on the core which address is used. The cores fetch instructions from the same memory address but stack spaces are other cores.

d) prog_address is set to local memory, but the program compile for one core is loaded onto another core. (e.g., here we are loading the program 'fibonacci_core0.mem' compile for core 0 onto core 3 (MESH_NODE[3]) mem load -i {~/applications/binaries/examples/fibonacci_core0.mem} -format hex {/tb_Real_Cores_Mesh_top/TB_M/U/MESH_NODE[3]/NODES/Memory_Sub_System/mem_packetizer/ m_memory/RAM_Block/ram})
This approach automatically keeps the program stack on core 0.

3- Multiple programs on multiple cores
Compile each program to specific core using ./softwareToolchain/linker filename.vmh core_id
Use filename_core_id.txt Starting Address data to set prog_address for each core.

Other useful information:

  • Note that on some simulators, you may want to zero out the caches of active cores before starting (on FPGA, on startup they are zero by default). On ModelSim, the cache structures are filled with x on startup. So to zero out for example the instruction cache for core zero we use: 'mem load -filltype value -filldata 0 -fillradix symbolic -skip 0 {/tb_Real_Cores_Mesh_top/TB_M/U/MESH_NODE[0]/NODES/Memory_Sub_System /mem_packetizer/unified/ICache/CACHE_RAM/ram}'

  • filename_coreX.txt file gives the core number, the starting address for that core and the stack frame space. The stack frame, due to the lack of OS in the current version, it's user's responsibility to not overflow it. And it should be adjusted to fit the memory arrangements.

 

CSAILCSGmit
ASCS LabASCSBU