**Main metric** for evaluation is the **segment based error rate**.

Detailed description and calculation procedure of metrics is presented in [Mesaros2016]

## Segment-based metrics

Segment based evaluation is done in a fixed time grid, using segments of one second length to compare the ground truth and the system output.

**In each segment $k$** we count:

**true positives**$TP$: events indicated as active by both the ground truth and system output**false positives**$FP$: events indicated as active by the system output but inactive by the ground truth;**false negatives**$FN$: events indicated as inactive by the system output but active by the ground truth;**substitutions**$S$: system output indicating as active a wrong label events; one substitution is equivalent to one false positives and one false negative, meaning the system did not detect the correct event (false negative for the correct class) but detected something (false positive for another class)**insertions**$I$: false positives after subtracting the substitutions**deletions**$D$: false negatives after subtracting the substitutions**reference events**$N$: number of events in the ground truth (segment!)

#### Error rate

**Error rate** calculated as described in [Poliner2007] over all test data based on the total number of insertions, deletions and substitutions:
\begin{equation*}
ER=\frac{\sum {S(k)}+\sum{D(k)}+\sum{I(k)}} {\sum N(k)}
\end{equation*}

#### F-score

**F-score** is calculated over all test data based on the total number of false positive, false negatives and true positives:

\begin{equation*}
\label{eq-fscore}
F=\frac{2P \cdot R}{P+R}, \quad \text{where} \quad
P=\frac {\sum TP(k)} {\sum TP(k)+\sum FP(k) },\quad
R=\frac{\sum TP(k)} { \sum TP(k)+\sum FN(k) } \
\end{equation*}

## Event-based metrics

Event-based evaluation considers true positives, false positives and false negatives with respect to event instances.

**Definition**: An event in the system output is considered correctly detected if its temporal position is overlapping with the temporal position of an event with the same label in the ground truth. A tolerance is allowed for the onset and offset (200 ms for onset and 200 ms or half length for offset)

We count for all sequences:

**true positives**$TP$: correctly detected events.**false positives**$FP$: events in the system output that are not correct according to the definition**false negatives**$FN$: events in the ground truth that have not been correctly detected according to the definition;**substitutions**$S$: events in system output that have correct temporal position but incorrect class label**insertions**$I$: events in system output that are not correct nor substitutions**deletions**$D$: events in ground truth that are not correct nor substituted**reference events**$N$: number of events in the ground truth

#### Error rate

\begin{equation*}
ER=\frac{S+D+I}{N}
\end{equation*}

#### F-score

\begin{equation*}
F=\frac{2P \cdot R}{P+R}, \quad \text{where} \quad P=\frac{TP}{TP+FP},\quad R=\frac{TP}{TP+FN} \
\end{equation*}

## References

[Mesaros2016] Mesaros, A., Heittola, T. , and Virtanen, T. "Metrics for polyphonic sound event detection", Applied Sciences, 6(6):162, 2016 PDF

[Poliner2007] Poliner, G. and Ellis, D.P.W. "A Discriminative Model for Polyphonic Piano Transcription", EURASIP Journal on Advances in Signal Processing, 2007 PDF