diff --git a/Documentation/perf-counters.txt b/Documentation/perf-counters.txt new file mode 100644 index 00000000000..19033a0bb52 --- /dev/null +++ b/Documentation/perf-counters.txt @@ -0,0 +1,104 @@ + +Performance Counters for Linux +------------------------------ + +Performance counters are special hardware registers available on most modern +CPUs. These registers count the number of certain types of hw events: such +as instructions executed, cachemisses suffered, or branches mis-predicted - +without slowing down the kernel or applications. These registers can also +trigger interrupts when a threshold number of events have passed - and can +thus be used to profile the code that runs on that CPU. + +The Linux Performance Counter subsystem provides an abstraction of these +hardware capabilities. It provides per task and per CPU counters, and +it provides event capabilities on top of those. + +Performance counters are accessed via special file descriptors. +There's one file descriptor per virtual counter used. + +The special file descriptor is opened via the perf_counter_open() +system call: + + int + perf_counter_open(u32 hw_event_type, + u32 hw_event_period, + u32 record_type, + pid_t pid, + int cpu); + +The syscall returns the new fd. The fd can be used via the normal +VFS system calls: read() can be used to read the counter, fcntl() +can be used to set the blocking mode, etc. + +Multiple counters can be kept open at a time, and the counters +can be poll()ed. + +When creating a new counter fd, 'hw_event_type' is one of: + + enum hw_event_types { + PERF_COUNT_CYCLES, + PERF_COUNT_INSTRUCTIONS, + PERF_COUNT_CACHE_REFERENCES, + PERF_COUNT_CACHE_MISSES, + PERF_COUNT_BRANCH_INSTRUCTIONS, + PERF_COUNT_BRANCH_MISSES, + }; + +These are standardized types of events that work uniformly on all CPUs +that implements Performance Counters support under Linux. If a CPU is +not able to count branch-misses, then the system call will return +-EINVAL. + +[ Note: more hw_event_types are supported as well, but they are CPU + specific and are enumerated via /sys on a per CPU basis. Raw hw event + types can be passed in as negative numbers. For example, to count + "External bus cycles while bus lock signal asserted" events on Intel + Core CPUs, pass in a -0x4064 event type value. ] + +The parameter 'hw_event_period' is the number of events before waking up +a read() that is blocked on a counter fd. Zero value means a non-blocking +counter. + +'record_type' is the type of data that a read() will provide for the +counter, and it can be one of: + + enum perf_record_type { + PERF_RECORD_SIMPLE, + PERF_RECORD_IRQ, + }; + +a "simple" counter is one that counts hardware events and allows +them to be read out into a u64 count value. (read() returns 8 on +a successful read of a simple counter.) + +An "irq" counter is one that will also provide an IRQ context information: +the IP of the interrupted context. In this case read() will return +the 8-byte counter value, plus the Instruction Pointer address of the +interrupted context. + +The 'pid' parameter allows the counter to be specific to a task: + + pid == 0: if the pid parameter is zero, the counter is attached to the + current task. + + pid > 0: the counter is attached to a specific task (if the current task + has sufficient privilege to do so) + + pid < 0: all tasks are counted (per cpu counters) + +The 'cpu' parameter allows a counter to be made specific to a full +CPU: + + cpu >= 0: the counter is restricted to a specific CPU + cpu == -1: the counter counts on all CPUs + +Note: the combination of 'pid == -1' and 'cpu == -1' is not valid. + +A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts +events of that task and 'follows' that task to whatever CPU the task +gets schedule to. Per task counters can be created by any user, for +their own tasks. + +A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts +all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege. +