An adverse drug effect (ADE) is any harmful event resulting from medical drug treatment. Despite their importance, ADEs are often under-reported in official channels. Some research has therefore turned to detecting discussions of ADEs in social media. Impressive results have been achieved in various attempts to detect ADEs. In a high-stakes domain such as medicine, however, an in-depth evaluation of a model′s abilities is crucial. We address the issue of thorough performance evaluation in detecting ADEs with hand-crafted templates for four capabilities, temporal order, negation, sentiment and beneficial effect. We find that models with similar performance on held-out test sets have varying results on these capabilities.