Metabolic profiles to predict long-term cancer and mortality: the use of latent class analysis

Background Metabolites are genetically and environmentally determined. Consequently, they can be used to characterize environmental exposures and reveal biochemical mechanisms that link exposure to disease. To explore disease susceptibility and improve population risk stratification, we aimed to identify metabolic profiles linked to carcinogenesis and mortality and their intrinsic associations by characterizing subgroups of individuals based on serum biomarker measurements. We included 13,615 participants from the Swedish Apolipoprotein MOrtality RISk Study who had measurements for 19 biomarkers representative of central metabolic pathways. Latent Class Analysis (LCA) was applied to characterise individuals based on their biomarker values (according to medical cut-offs), which were then examined as predictors of cancer and death using multivariable Cox proportional hazards models. Results LCA identified four metabolic profiles within the population: (1) normal values for all markers (63% of population); (2) abnormal values for lipids (22%); (3) abnormal values for liver functioning (9%); (4) abnormal values for iron and inflammation metabolism (6%). All metabolic profiles (classes 2–4) increased risk of cancer and mortality, compared to class 1 (e.g. HR for overall death was 1.26 (95% CI: 1.16–1.37), 1.67 (95% CI: 1.47–1.90), and 1.21 (95% CI: 1.05–1.41) for class 2, 3, and 4, respectively). Conclusion We present an innovative approach to risk stratify a well-defined population based on LCA metabolic-defined subgroups for cancer and mortality. Our results indicate that standard of care baseline serum markers, when assembled into meaningful metabolic profiles, could help assess long term risk of disease and provide insight in disease susceptibility and etiology. Electronic supplementary material The online version of this article (10.1186/s12860-019-0210-7) contains supplementary material, which is available to authorized users.

field of cancer epidemiology [14]. It refers to every non-genetic exposure to which an individual is subjected from conception to death [14,15] . Specifically, metabolites, part of the internal exposome, are both genetically and environmentally determined and can consequently be used to characterize environmental exposures and reveal biochemical mechanisms that link exposure to disease [15][16][17][18]. Hence, the internal distribution of metabolites and their interactions might help unravelling cancer susceptibility in a population.
With the overall goal of identifying statistical methods to stratify individuals based on their underlying risk of developing cancer and risk of increasing mortality, we conducted a data driven approach utilizing standard serum markers available from routine health check-ups to study susceptibility to cancer and death in a well-defined cohort of 13,615 participants from the AMORIS study (Apolipoprotein MOrtality RISk) [19,20]. More specifically, the study was set out to explore population heterogeneity and cancer susceptibility by investigating serum metabolic profiles using latent class analysis (LCA). This data reduction method clusters covariates based on models of data distribution probabilities. It allows for evaluation of clusters of biomarkers linked to carcinogenesis and their intrinsic associations, which ultimately helps us assess their possible role in predicting long-term cancer and mortality.

Characteristics of the study population
A total of 1,956 individuals (14.37%) developed cancer after at least 3 years of follow-up, including 655 breast and genito-urinary cancers, 330 cases of digestive cancer, 133 cases of respiratory cancers and 129 lymphatic and hematopoietic cancers during a mean follow-up time for cancer of 16.6 years, median follow-up time in the cohort of 17.22 years with a minimum of 3.01 years and a maximum of 24.77. Three thousand one hundred fifty-eight participants (23.20%) died during a mean follow-up of 17.3 years, comprising 706 cancer-specific deaths. Study population characteristics by cancer status is illustrated in Table 1.
Latent class analysis characterizes the study population into four metabolic profiles LCA was executed using the dichotomized values of the biomarkers to facilitate the biological interpretation of the results. The Chi-squared distribution criterion for model selection indicated a best fit model comprehend of four LCA classes, while AIC and BIC stabilized at 4 classes (Fig. 1a, b) [21]. All the criterions did not converge to a local maximum from class 12 onwards. The class allocation of the observations (individuals), the class conditional probability of each biomarker and the latent mixing proportions were obtained when running poLCA package in R statistical language. Table 2 and Fig. 2 outline the LCA-derived classes with the estimated class population proportions, the class conditional probabilities of belonging to each latent class for each of the biomarkers and the biological interpretation of the LCA-derived classes. The four mutually exclusive classes characterize the population in metabolic profiles based on class conditional probabilities: (1) those with probabilities for all abnormal values of the markers under 0.3; therefore, considered the normal class (63% of population); (2) those with abnormal values for lipid markers (22%); (3) those with abnormal values for liver function markers (9%); (4) those with abnormal values for iron and inflammation metabolism (6%).
A validation of the characterization of the population performed with the Latent class methodology is outlined in Additional file 1: Table S3. The baseline clinical characteristics of the individuals by LCA-derived metabolic classes (Additional file 1: Table S3) replicate the results displayed in Table 2 for the class conditional probabilities.

LCA derived metabolic profiles as cancer and mortality predictors
We then investigated the prediction capabilities of the four LCA-derived metabolic profiles to estimate overall cancer risk, specific cancer types risk, cancer mortality and overall mortality, assigning the reference level to the healthy metabolic profile Class 1 (Tables 3 and 4).
All metabolic profiles increased risk of cancer and mortality compared to Class 1. For instance, individuals in Class 3 (abnormal liver function profile) had a higher risk of overall cancer (HR: 1.28 (95% CI: 1.10-1.50)), but also a worse cancer-specific survival and overall survival as compared to those in Class 1 (Tables 3 and 4). Class 2 (abnormal lipid profile) and Class 4 (abnormal iron markers and inflammatory) were positively associated with overall death, while Class 2 was also associated with cancer-specific death. The results were consistent for both time-scales (Tables 3 and 4).
When assessing the risk of specific cancer types, several patterns occurred (Tables 3 and 4). Individuals in Class 2 (abnormal lipid markers) presented a higher risk of lymphatic and hematopoietic tissue cancer (HR: 1.72 (95% CI: 1.15-2.56)). There was a greater risk of digestive cancers in individuals in Class 3 (abnormal values of liver enzymes) (HR: 2.12 (95% CI: 1.54-2.91)), while individuals in Class 4 (abnormal iron markers and inflammation) were exposed to a higher risk of buccal and oral system cancers in comparison with the individuals in Class 1 (HR: 3.94 (95% CI 1.38-11.30)) ( Table 3).
Moreover, the connective tissue and endocrine glands cancer risk was higher in individuals grouped in liver metabolic profile (HR: 2.65 (95% CI: 1.00-7.02) and in participants belonging to the iron markers and inflammation (HR: 3.00 (95% CI: 1.11-8.11)). Similar associations were observed  when using the age scale for the multivariable cox proportional hazard regression model (Tables 3 and 4).

Discussion
We demonstrated that standard of care baseline serum markers when assembled into meaningful metabolic profiles can help stratify the population for cancer risk, cancer mortality and overall mortality. More specifically, we observed that abnormal values for markers of the lipid metabolism, liver function and inflammatory and iron metabolism distinguish participants into metabolic profiles, which are predictive of long term cancer risk and/ or mortality.

Metabolic profiles
Among the biological pathways addressed in our LCA, abnormalities in the lipid metabolism were the most common. Hyperlipidemia was present in about a quarter of the study population explaining the largest abnormal metabolic profile. The weight of the lipid profile in the analysis was consistent with the reported global prevalence of hypercholesterolemia among adults (37% for males and 40% for females) as reported in the Global Health Observatory in 2008 estimates by the World Health Organization (WHO) and the results from the Swedish population in the WHO MONICA project [22]. Dyslipidemias are associated with higher risk of CVD and other chronic diseases such as cancer, as also observed in our study [23]. Liver dysfunction, iron deficiency and altered inflammatory markers profiles also distinguished important subgroups in our study population. About 9% of our population had abnormal values for markers of liver functioning (GGT, AST and ALT), which is similar to the results obtained in a population-based survey in the United States that estimated abnormal alanine aminotransferase (ALT) was present in 9% of respondents in absence of viral hepatitis C or excessive alcohol consumption [24]. Moreover, these enzymes are known to be linked to cancer because of their role in preserving the intracellular homeostasis of the oxidative stress [25][26][27], which is concordant with the results of these analyses. The iron profile and inflammatory markers clustered 6% of individuals in the study, which was predominantly driven by low levels of serum iron and TIBC, as well as high levels of CRP and leukocytes. This could potentially point towards anemia of inflammation, a chronic inflammation presenting low iron values, that occurs because the iron deficiency provides the body with infection resistance, which   demonstrates the tightly connection between the inflammatory response and the iron and its homeostasis [28]. This condition has been reported in more than 30% of cancer patients at time of diagnosis.
Metabolic profiles as a risk factor for long term cancer and mortality The above-described three classes of abnormal metabolic profiles were all associated with an increased risk of cancer and worse survival, as compared to the healthy class. The findings therefore confirm the key importance of these metabolisms in the maintenance of the intracellular homeostasis and how their unbalance can be related with the etiology of cancer disease and mortality [2]. The LCA adapted in this study thus illustrates how a biomarker-wide approach can help assess markers of the blood exposome in the context of carcinogenesis and mortality [29] (Fig. 3). More specifically, individuals presenting abnormal liver function markers carried worse outcomes in terms of overall cancer risk and cancer death, and a positive association with digestive, connective and endocrine cancers diagnosis. Moreover, the participants with this profile had a higher probability of overall death. These results are consistent with previous published data. A positive association between elevated GGT and overall cancer risk, with no interaction of ALT, was found in the AMORIS cohort previously [30], and it was also reported in other large cohort studies [31,32]. These studies also found strong associations with elevated levels of GGT and digestive and respiratory cancer incidence. Elevated GGT has been associated with mortality from all causes, liver disease, cancer and diabetes, while ALT only showed associations with liver disease death in a large US cohort [33]. However, in a study based on an elderly population it was found that GGT was associated with increased cardiovascular disease mortality, and ALP and AST with increased cancer-related mortality [34]. Moreover, a meta-analysis evaluating the associations between liver enzymes and all-cause mortality found positive independent associations of baseline levels of GGT and ALP with allcause mortality [35]. In the present study, the liver biomarker profile was positive associated with all the outcomes studied, suggesting a key role of this pathway in the development of cancer, probably related with its active  role maintaining the intracellular redox regulation. Further investigations are necessary to establish the potential of the altered liver enzyme profile as a tool for cancer risk stratification. Individuals allocated to the lipid profile presented positive associations with cancer mortality, and overall mortality and higher risk of lymphatic and hematopoietic cancers. The link between hyperlipidemia and mortality has been studied broadly, with associations with established links for cancer and all-cause mortality [36][37][38]. The association between lipids and lymphatic and hematopoietic cancers is more controversial, as other studies found an inverse association for these cancers and high levels of serum cholesterol [39,40]. However, a systematic literature review from 2016 found no association [41].
Participants clustered in the unbalanced iron profile and inflammation had an increased risk of endocrine, buccal and oral cancers and were observed to have a higher risk of all-causes death. Altered inflammation and iron metabolisms are key metabolic 'hallmarks of cancer' [2,42,43]. Our observation of an association with an increased risk of buccal and oral cancer corroborates previous findings in AMORIS [42].

Population heterogeneity and risk stratification: the need for data reduction techniques
The modulation effect of population heterogeneity on the association between potential risks factors and disease is a new avenue to understand the variability of risk in the population [44]. For instance, in a targeted metabolomics exercise Shan et al. performed a principal component analysis and time to event analysis identifying metabolic profiles to predict risk of CVD [13]. Another study used Monte Carlo Cross Validation and Lasso logistic regression to evaluate serum biomarkers as an alternative to fecal immunochemical testing to improve detection of colorectal cancer [11]. In 2010, the European Prospective Investigation on Cancer and Nutrition (EPIC) cohort reported that a specific prediagnostic plasma phospholipid fatty acid profile could predict the risk of gastric cancer [45]. As rationalized in the HELIX project, these multiple profiling approaches aim to identify groups of individuals in the population that share a similar exposome that might account for differences on the specific risk of study [46].Together with these studies, our systematic data integration approach based on LCA demonstrates the potential of investigating population heterogeneity using metabolic profiling as risk factors for long term cancer risk and mortality prediction. However, in order to establish the prediction capability of these LCA metabolic profiles and implement their use in a clinical setting, further studies to validate the results whilst allowing to measure sensitivity and specificity, will need to be conducted such as a nested case-control in AMORIS that could determine the predictive capabilities of the metabolic profiles to estimate cancer risk and mortality.

Strengths and limitations
The present study has been conducted in a large and welldefined population, applying a multi-faced approach covering main biological pathways to assess biomarker profiles that could indicate cancer risk, cancer survival and mortality. The major strength of these analyses lies in the innovative avenue to study population heterogeneity and susceptibility to disease and mortality in a large cohort of participants with multiple measurements, all measured on fresh blood samples on the same day at the same clinical Fig. 3 Study statistical pipeline describing the methodology followed in the project. We explored the blood exposome using metabolic markers of the population to assess how population heterogeneity is associated with cancer risk and mortality laboratory. We included all the markers available in the cohort for a large population (n > 13000), however not every marker of the central metabolic pathways was available in the database (i.e. Complete Blood Count). Life-style factors established as cancer risk factors such as tobacco smoking, low physical activity, poor diet, alcohol intake, obesity and hypertension were partially available in AMORIS which limited their used in the study. To mitigate the lack of some of these external factors such as BMI, the analyses have been adjusted for Charlson Comorbidity Index which includes comorbidities such as obesity and hypertension. The lack of others life-style factors such as alcohol consumption was mitigated by using information on serum biomarkers such as gamma glutamyl transferase and other liver enzymes. All participants were selected by analyzing blood samples from health check-ups in non-hospitalized individuals from the greater Stockholm area ensuring good internal validity in the study. Future studies will benefit from a longitudinal approach with repeated serum markers measurements that will capture the population phenotypic variations in relation to disease over long periods of time and will help to improve our understanding of the biomarkers' impact on carcinogenesis and mortality.

Conclusion
Our findings support the recently expressed need for a shift from the classical epidemiological approach of assessing one exposure to a systemic approach with multiple exposures. The LCA adapted in this study illustrates how a biomarker-wide approach can help assess population susceptibility to disease and provide insight into disease etiology in the context of carcinogenesis and mortality (Fig. 3). Given the environmental and genetic modulation of metabolic molecules, metabolic profiling based on standard of care serum markers could become a useful non-invasive predictive signature for risk stratification and an important area of research for mechanisms and clinical relevance.

Study design and study population
The AMORIS study, a large prospective cohort study, has been described in detail elsewhere [19,47,48]. Briefly, the AMORIS database is based on linkages with the Central Automation Laboratory (CALAB) database, which analyzed fresh blood samples from subjects from the greater Stockholm area. All individuals were either healthy individuals referred for clinical laboratory testing as part of a general health check-up or outpatients between 1985 and 1996. The AMORIS cohort has been linked to several Swedish national registries such as the National Cancer Register, the Patient Register, the Cause of Death Register, the consecutive Swedish Censuses during , and the National Register of Emigration, using the Swedish 10-digit personal identity number. These linkages provide detail information on demographics, lifestyle, socio-economic status, vital status, cancer diagnosis, comorbidities and emigration. The AMORIS study conformed to the declaration of Helsinki and was approved by the ethics board of the Karolinska Institute.
These biomarkers were selected to reflect common metabolic pathways: lipid (TC, TG, ApoA-I, ApoB, HDL and LDL) and glucose metabolism (Glucose, FAMN), liver function (GGT, ALT and AST), inflammation (Albumin, WBC and CRP), iron metabolism (FE and TIBC), kidney function (Creatinine) and phosphate (Phosphate and Calcium). The blood metabolites included in the analysis were all the standard serum markers available from routine health check-ups. Most of the markers included have been previously studied individually in AMORIS, however no systemic integrative approach to examine the metabolic markers interactions and susceptibility to cancer has been conducted to date [30,42,[50][51][52][53][54][55][56][57][58][59]. All participants were free from cancer at time of study entry and none were diagnosed with cancer within the first three years of follow-up to avoid reverse causation.
The main exposure variables for the analyses were the above-mentioned metabolic biomarkers, for which the values were categorized using standardized clinical cutoffs based on recognized medical criteria to facilitate interpretation of the results (Additional file 1: Table S2). The main outcomes were first cancer diagnosis, as registered in the National Cancer Register using ICD-9 for the years 1987-1992, ICD-O/2 for years 1993-2004 and for year 2005 onwards has been coded in ICD-O/3), and mortality. As secondary outcomes, we explored those cancer types for which there were more than 30 events during follow-up. Likewise, cancer mortality was explored. Follow-up time was assessed specifically for each of the outcomes studied. For cancer diagnosis, follow-up time was defined as time from blood drawn until date of first cancer diagnosis, death, emigration or study closing date (31st of December 2012), whichever occurred first. The follow-up time for death was described as time from blood drawn until date of death, emigration or study closing date (31st of December 2012), whichever occurred first.
Information on the following potential confounders was also incorporated: age, sex and comorbidities. The latter was quantified using the Charlson Comorbidity Index (CCI) calculated based on data from the National Patient Register. The CCI comprises 19 disease categories, all assigned a weight. The sum of an individual's weights was used to create the CCI ranging from no comorbidity to severe comorbidity (0, 1, 2, and ≥ 3) [60].

Data analysis
First, we calculated Pearson correlation coefficients to measure the strength of association between the biomarkers included in the analysis. Pearson's correlation analyses showed strong correlation between the different biomarkers in the lipid metabolism (TC, LDL and ApoB (r > 0.7); HDL and ApoA-I (r > 0.8)). We replaced the individual lipid biomarkers by the established ApoB/ ApoA-I ratio and log (TG/HDL) ratio [20,49,61,62] to avoid collinearity and to comply with the principle of local independence as required by latent class analysis [63]. Most of the markers were normally distributed except from the liver biomarkers.
Latent Class Analysis (LCA) [63,64] is a modelbased clustering method that reduces the dimension of the data by clustering covariates into latent classes, using a probabilistic model that describes the data distribution, and it assesses the probability that individuals belong to certain latent classes. LCA avoids the use of a linear combination or a random distance definition to reduce the number of covariates [65] and has recently been employed in health sciences [21,66]. More specifically, we applied LCA to characterize different classes of individuals based on their metabolic profiles [67] and to evaluate intrinsic associations between the biomarkers, using the poLCA package [68] in R statistical programming language. We first determined the optimal number of LCA-derived classes by executing step-wise models with different numbers of classes, starting with the null model and adding one extra class in each model until reaching the total number of biomarkers in the data, while the model kept converging into a local maximum likelihood. The criterions used for model selection (Akaike information criterion (AIC), Bayesian information criterion (BIC) and Chi-squared distribution) were evaluated to estimate the best goodness of fit model and to define the optimal number of LCA-derived metabolic classes that characterized our dataset. To identify which sets of biomarkers predominantly explained each latent class, how the classes were distributed across the study population and which individuals were allocated to each class, we assessed the conditional probabilities, mixed proportions and class memberships of the best fitted latent class model.
Once each subject was assigned to its LCA-derived metabolic class, we conducted multivariable Cox proportional hazard regression to examine whether the LCA-derived metabolic classes were associated with long term risk of overall cancer as well as specific cancer types. In addition, we evaluated how the classes were associated with all cause-death and cancer-specific death. All models were adjusted for age, sex, and CCI. We performed a sensitivity analysis using age as a time-scale, as this is potentially a strong confounder. Moreover, Schoenfeld residuals were tested to ensure the proportional hazard assumption of the multivariable cox proportional hazard regression analysis.
Data management and statistical analyses were performed using Statistical Analysis Systems (SAS) release 4.3 (SAS Institute, Cary, NC) and R version 3.0.2 (R Foundation for Statistical Computing, Vienna, Austria).

Additional file
Additional file 1: Table S1. Laboratory fully automated methods with automatic calibration were performed at one accredited laboratory (CALAB to measure the serum biomarkers examine in the study. Table S2. Panel of serum markers describing standard medical cutoffs information. Table S3

Availability of data and materials
The authors can confirm that for ethical and legal reasons imposed there are restrictions to the allowance of general public access to the data underlying the findings of this study. The database is formed of not only the AMORIS cohort but is a merged database. This includes AMORIS plus information from the Swedish National Patient Registry, the National Cause of Death Registry, SWEDEHEART, the Work Lipids, Fibrinogen study, the Cohort of Swedish Men Study, the Swedish Mammography Cohort, the cohort of 60year-old subjects in Stockholm, the Sollentuna Primary Prevention study and the National Prescribed Drug Register. The merged database from these sources contain sensitive information and is therefore anonymized and located in a security server with restricted access at the institute of Environmental medicine, Karolinska Institutet in Stockholm. Professor Maria Feychting (maria.feychting@ki.se) and Sofia Carlsson (sofia. carlsson@ki.se) are both members of the Steering Committee of the AMORIS cohort and are based on the Unit of Epidemiology, Institute of Environmental Medicine hosting the database. They would both be able to respond to external requests for data access given that the interested party can obtain approval from the data owners including the National Board of Health and Welfare in Sweden (http://www.socialstyrelsen.se/english) and Statistics Sweden (http://www.scb.se/en_/) as well as from the owners of the research registers at Karolinska Institutet, Stockholm. Sweden. To ensure persistent and long-term database storage and availability, AMORIS cohort database is stored at the Institute of Environmental Medicine and the storage follows the principles kept at Karolinska Institutet. The database can be accessed after permission and considering the restrictions by remote access through a secure LAN solution.
Ethics approval and consent to participate The study was approved by the ethics board of Karolinska Institutet who waived the need for consent and conformed to the declaration of Helsinki.