An IDIBAPS-Clínic study shows for the first time the benefits of the Google-Pubmed "big data" in systemic autoimmune diseases geoepidemiology

The big data is changing the landscape of scientific research, also in medicine, where in areas such as epidemiology and genetics allows analyzing total or almost total populations in studies that can be relatively basic (determine the impact of a specific disease in a population) or very complex (analyzing the effectiveness of new drugs on a whole population or large-scale genetic research).

Now a geoepidemiologic autoimmune study conducted on 394.827 patients with systemic autoimmune diseases (SAD) led by the Systemic Autoimmune Diseases IDIBAPS team, the Transverse Group for research in Primary Care from IDIBAPS and CAPSBE, and the Hospital Clínic, in co-operation with Tel Aviv University, King's College University of London and the Harvard Medical School in Boston, shows the results of a massive data search using Google and Pubmed in combination, and the benefits that can be obtained for medical research on systemic autoimmune diseases (SAD). The study is available online in the journal Autoimmunity Reviews.

Manuel Ramos-CasalsPilar Brito-ZerónBelchin Kostov, Antoni Sisó-AlmirallXavier BoschDavid BussAntoni TrillaJohn H. Stone, Munther A. Khamashta and Yehuda Shoenfeld are the authors of this study that explores the potential of the Google search engine for collecting a large amount of SAD referenced in the scientific database Pubmed. The aim is to obtain a high-definition geoepidemiologic 'map' for each disease using a large number of epidemiological, geographical, ethnic and clinical variables.


The team performed a text search on Google from 20th to 31st January 2015 using the search algorithm “systemic autoimmune disease” “1000 ... 100,000 patients” and “www.ncbi.nim.nig .gov”. Together with the references obtained by Google, researchers manually add reference lists of relevant articles that the search did not gather, and excluded reviews, meta-analysis of epidemiological studies in patients who had no systemic autoimmune diseases and duplicated cohorts.

The resulting references were analyzed independently by two reviewers, and from the selected ones data on the type of SAD, the number of patients, country, data sources, types of databases, inclusion and exclusion criteria, distribution by gender, age at the beginning of the study selected, diagnosis and ethnic data, were extracted.

The most studied SAD

Of the total SAD sought in the study relevant data were collected: systemic lupus erythematosus (SLE), Kawasaki disease, giant cell arteritis, Behçet disease, systemic sclerosis, primary Sjögren syndrome, sarcoidosis, primary immunodeficiencies, amyloidosis, medium-sized-vasculitis sized, polymyalgia rheumatic, inflammatory myopathies, granulomatosis with polyangiitis, Schönlein-Henoch disease and antiphospholipid syndrome. The study did not obtain relevant data analyzed on 10 SAD: relapsing polychondritis, Still disease, Haemophagocytic syndrome, IgG4-related disease, polyarteritis nodosa, Takayasu arteritis, Buerger disease, eosinophilic granulomatosis with, microscopic polyangiitis and Cogan disease. In total, researchers collected data from 85 studies and 394.827 total patients, of which finally data from 359.838 was analyzed.

The SAD with the highest number of reported cases in children was Kawasaki syndrome, whereas in adults were the systemic lupus erythematosus (SLE), giant cell arteritis, Behçet disease, systemic sclerosis and Sjogren syndrome.

The analysis shows a predominant use of medical databases vs. the administrative ones (74% vs. 26%), of the public health system vs. health insurance companies sources (88% vs. 12%), of patient-based study designs compared to population-based designs (82% vs. 18%), and the use of classification criteria vs. clinical approaches and diagnostic codes (53% vs. 22% vs. 25%).

Gender and age of the SAD

The female predisposition to develop an autoimmune disease is well known, even though it remains unknown the specific cause for this gender predominance. However, the unequal gender ratio varies significantly depending on the disease, and there are also differences between the SAD developed in adults and those that occur in children.

The massive data study shows that 73.1% of patients were women, with a global ratio woman/man 3 to 1. The most unequal gender ratio was found in Primary Sjögren syndrome, with nearly 10 women affected for every man, followed by SLE, systemic sclerosis (SSC) and the antiphospholipid syndrome (APS), all with a ratio of almost 5 women affected for every man. Giant cell arteritis (GCA) and polymyalgia rheumatic and inflammatory myopathies, had a proportion women/men of 2-3 to 1; Sarcoidosis and Behçet disease showed a very balanced ratio (1:1) while the vasculitis and amyloidosis ratio changes gender, being more common in men than in women. The three children SAD -the primary immunodeficiencies, Kawasaki disease and Schönlein-Henoch disease- showed a predominance in boys, with 58% of cases.

Regarding the age of onset of SAD, the big data study shows a great variability: for some diseases, the age of onset is from 0 to 1 year, while in others, such as the APS and Behçet disease, the maximum age of diagnosis is 94-96 years.

The massive data confirm a well-known feature of the SAD: each disease predominantly affects a specific age group. Kawasaki disease (2.57 years at the start), the primary immunodeficiencies (3.3 years) and Schönlein-Henoch disease (5.24 years) affect children, while LES (33 years), Behçet disease (36 years) and sarcoidosis (38 years) affect young adults. The systemic sclerosis (51 years) and vasculitis (52 years) are found middle-aged people, while amyloidosis (63 years), polymyalgia rheumatica (PMR) and giant cell arteritis (GCA) (73 years) affect predominantly in the elderly.

SAD geography and ethnicity

Researchers have found that every disease shows a significant geographical concentration: 6 of 7 studies of patients with giant cell arteritis (GPC) and polymyalgia rheumatica (PMR) were carried out in northern Europe; the same happens in the case of amyloidosis, with 5 of 6 studies performed in the US; and with Behçet disease (11 of 12 studies) and Kawasaki disease (5 of 7 studies) conducted in Asia. A significant geographical concentration regarding the origin of the databases used the studies analyzed, was also shown, with a predominance use of administrative databases vs. the medical ones.

Regarding the ethnicity of patients, only 14 studies (16%) analyzed the ethnic breakdown; 5 studies were conducted with very homogenous samples, with 90% of Caucasian or Asian patients. The other nine studies were conducted with multiethnic cohorts from America, mainly from the United States but also from Canada and Latin America.

Of these nine studies, six were carried out in a well defined geographical area, which allowed the research team to make a comparison of the major epidemiological characteristics to the ethnic reported for the general population of the same geographical area. The results show that there is a higher frequency of three SAD (SLE, inflammatory myopathies and Kawasaki disease) in Afro-American patients, and a higher frequency of Kawasaki disease in Asian patients. In two of these studies, carried out in patients with SLE California, it was found that the disease showed a higher frequency in Caucasian patients and lower in Hispanic, possibly reflecting the influence of different levels of access to medical care plans of these groups.

The future of medical research in the XXI century

Medical research in the twenty-first century has initiated a radical change thanks to the growing convergence of medicine with technological advances, especially in the digital field. Maximum innovation projects promote the interaction of medicine with great innovations of this early century (Big Data, Internet of Things (IoT), 3D Printing, Analytics and Social Media). The leaders of medical research in the coming years will require, not only a high level of medical expertise in their field, but also a sound knowledge in these novel areas of technological knowledge and the ability to work and lead teams that integrate, besides doctors, biologists, geneticists and statisticians, mathematicians, bioinformatics, programmers, technicians and experts in social communication, in robotics or 3D printing.


Ramos-Casals M,et al.:  Google-driven search for big data in autoimmune geoepidemiology: Analysis of 394,827 patients with systemic autoimmune diseases, Autoimmun Rev (2015),