Purpose: To externally test four chest radiograph (CXR) classifiers on a large, diverse, real-world dataset with robust subgroup analysis. Materials and Methods: In this retrospective study, adult posterior-anterior CXRs (January 2016-December 2020) and associated radiology reports from Trillium Health Partners (THP) in Ontario, Canada were extracted and deidentified. An open-source natural language processing tool was locally validated and used to generate ground truth labels for the 197,540-image dataset based on the associated radiology report. Four classifiers generated predictions on each CXR. Performance was evaluated using accuracy, positive predictive value, negative predictive value, sensitivity, specificity, F1 Score, and Matthews correlation coefficient for the overall dataset and for patient, setting, and pathology subgroups. Results: Classifiers demonstrated 68%–77% accuracy, 64%–75% sensitivity and 82%–94% specificity on the external testing dataset. Algorithms showed decreased sensitivity for solitary findings (43%–65%), patients under 40 years of age (27%–39%), and patients in the emergency room (38%–60%) and decreased specificity on normal CXRs with support devices (59%–85%). Differences in sex and ancestry represented movements along an algorithm’s receiver operating characteristic curve and were likely due to calibration. Conclusion: Performance of deep learning CXR classifiers was subject to patient, setting, and pathology factors, demonstrating that subgroup analysis is necessary to inform implementation and monitor ongoing performance to ensure optimal quality, safety, and equity.